You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

5 February 2025

Open and Extensible Benchmark for Explainable Artificial Intelligence Methods

,
and
Faculty of Digital Transformations, ITMO University, Saint Petersburg 197101, Russia
*
Authors to whom correspondence should be addressed.
This article belongs to the Section Evolutionary Algorithms and Machine Learning

Abstract

The interpretability requirement is one of the largest obstacles when deploying machine learning models in various practical fields. Methods of eXplainable Artificial Intelligence (XAI) address those issues. However, the growing number of different solutions in this field creates a demand to assess the quality of explanations and compare them. In recent years, several attempts have been made to consolidate scattered XAI quality assessment methods into a single benchmark. Those attempts usually suffered from a focus on feature importance only, a lack of customization, and the absence of an evaluation framework. In this work, the eXplainable Artificial Intelligence Benchmark (XAIB) is proposed. Compared to existing benchmarks, XAIB is more universal, extensible, and has a complete evaluation ontology in the form of the Co-12 Framework. Due to its special modular design, it is easy to add new datasets, models, explainers, and quality metrics. Furthermore, an additional abstraction layer built with an inversion of control principle makes them easier to use. The benchmark will contribute to artificial intelligence research by providing a platform for evaluation experiments and, at the same time, will contribute to engineering by providing a way to compare explainers using custom datasets and machine learning models, which brings evaluation closer to practice.

1. Introduction

Approaches based on machine learning (ML) models are incorporated into an increasing number of fields, replacing or augmenting traditional approaches. As the size of the models increases over time to tackle more complex tasks, it becomes harder to understand the outcomes. In addition, there is an increasing interest in systems that satisfy not only the accuracy criterion but also a set of additional criteria such as fairness, safety, or providing the right to explanation [1]. For some areas, satisfying those criteria is a dealbreaker when it comes to the adoption of an ML system [2,3,4].
To address these issues, eXplainable Artificial Intelligence (XAI) has emerged [5]. It encompasses a variety of approaches, from highlighting relevant parts of the input data to showing similar examples or even providing verbal explanations.
While actively expanding the variety of explanation algorithms, the field of XAI is often criticized for its lack of rigor and evaluation standards [5,6]. Given the uncertainty of explainability definitions noted in many recent works [7], it seems necessary to explicitly state which one is used in this work. In this work, the definition is the same as in the work of Doshi-Velez. Interpretability is the ability to explain or present information in understandable terms to a human. To further clarify, for inclusion purposes, no distinction is made between interpretability and explainability, similar to the work of Nauta et al. [8].
The growing number of explanation solutions with different approaches demands a way to compare them. Isolated evaluation-centered papers exist, but they are very limited compared to benchmarks.
In recent years, several attempts have been made to build a benchmark for XAI methods. However, the existing benchmarks have a number of drawbacks. They, for example, focus only on feature importance, limited or no openness and extensibility, a limited set of measured interpretability properties, which are usually not justified explicitly, focusing on ground-truth-based metrics alone, and missing important documentation, versioning, and software distribution aspects.
In this work, a new XAI Benchmark (XAIB) is proposed (https://oxid15.github.io/xai-benchmark/index.html (accessed on 29 October 2024)). It was designed keeping in mind the diversity of the XAI landscape. It was important for the generality to be able to include different explanation types, models, and datasets. At the start, it already features two different explanation types, as well as various datasets, models, and explainers that can be used in many combinations. Special efforts were made to make it easy to use and extend. It features thin, high-level interfaces for easier setup while not obfuscating the internals, opening itself for customization and extension. Finally, its XAI evaluation system is based on a comprehensive framework of 12 properties, ensuring the completeness of the evaluation for every explainer type.

3. Benchmark Structure

The main idea of the XAIB design is to push the trade-off between extensibility and ease of use. If the system is very extensible and customizable, it tends to overwhelm new users. In the opposite scenario, when it is very easy to use, it usually hides a lot of complexity under the hood, which can hinder customization for users with in-depth knowledge.
Since simplicity and customization create a trade-off, they cannot be significantly increased at the same time. Taking into account the current state of the XAI field, the XAIB prioritizes simplicity for the user first. Popularity and, in general, interpretability awareness are not as high, so it seems that configuration complexity may overwhelm many users who are not familiar with XAI or programming in general [22]. This is the motivation for making the XAIB easier to use while allowing for a deep enough extensibility to include various explanation methods in one benchmark.
This section provides an overview of high-level entities, the XAIB modules, and its dependencies. Then, it focuses on how extensibility and consistency are ensured. The final subsections offer the categorization of potential users, their needs, and the use cases of the benchmark.

3.1. High-Level Entities

In order to name entities that may be required for XAI evaluation, a typical workflow for this task should be summarized. It starts with the data. It is required to train the model, if necessary, in the evaluation process and also to evaluate it. A machine learning model and a dataset are then passed to the explainer to extract explanations. Those are, in turn, used to evaluate an explainer using a set of quality metrics. After the computation, the metric values should be stored for further analysis or visualization.
This is the general outline of an XAI evaluation experiment. Figure 1 shows how it is handled in the proposed benchmark. According to this description, each experiment requires the handling of data, models, explainers, metrics, and experiments.
Figure 1. Three stages of the general XAIB workflow—Setup, Experiment, and Visualization. Each Setup is a unit of evaluation. It contains all the parameters and entities needed to obtain the values. The execution pipeline takes setups and executes them, writing down the values. The values can then be manually analyzed or put into the visualization stage.

3.1.1. Data

The dataset is the first object that is required for the evaluation of any explainer, whether that is a training set used to fit a model or an evaluation set used to obtain explanations. This is why data handling is the first thing that is necessary when building a benchmark. In the XAIB, data are represented by a special dataset class. Datasets in the XAIB were built with the diversity of the XAI landscape in mind. They represent an access point for any data source and encapsulate data loading and access while providing a unified interface. Each dataset is a sequence of data points. Each point was determined to be represented by a Python dictionary. The reason for this is that datasets in machine learning are usually complex, with different layers of labeling, and they can also be multimodal. Dictionaries enable handling different data channels, for example, the features and labels, without the risk of confusion because each channel has a unique dictionary key. This configuration also enables one to build automatic data validation at the top.
As an example, some datasets from the sklearn library were included in the benchmark. Since toy datasets are small, they are loaded into memory at the initialization of a dataset object and can then be accessed by index. For the classification task, the output object is a dictionary with the keys item and label.
The interface does not place strong requirements on how data should be stored or loaded, enabling us to include a variety of datasets into the benchmark and to access them in the same way, which is an important aspect of the extensibility of the benchmark.

3.1.2. Models

To explain the model’s predictions, the model itself is required. Models need special care since they have a multitude of requirements. In the evaluation workflow, they may need to be trained, evaluated, their state saved for later, and loaded. All of this should be handled in the same fashion, regardless of the model type. For handling models, the XAIB features the model class. The model encapsulates all that is needed for inference and training, if that is required. It is a wrapper around an inference method built to be adapted to any backend. Models can be trained, inferred, evaluated, saved, and loaded in the same fashion. The interface is similar to the widely known sklearn library, which should simplify the work with models for users who are already familiar with that package. As with the data, the interface is general enough to allow various types of models to be included. Datasets require a special format of output, such as a dictionary. Models do not have such requirements. However, explainers can have input requirements and not every model is compatible with every explainer, at least by input type.
As with the datasets, the default wrapper for the sklearn models was implemented. It manages the training and evaluation of models while maintaining a common interface.

3.1.3. Explainers

Explainers are built similarly to the models. They include similar methods since their evaluation workflow is usually the same. Only one assumption was made as follows: the model is required for inference by default. This is not required for the descendants of the explainer class but was made for consistency since, usually, explainers require the model object at the time of obtaining explanations.
The general idea on which the XAIB is built is the self-sufficiency of entities. In the context of an explainer, this means that it should accept all configuration parameters on initialization and then be able to explain model predictions when given only the model and the data. This abstraction enables the unified handling of different entities within the benchmark. All models or explainers that comply with this can be treated the same within a single category that sets the input and output requirements. The outputs of feature importance and explanation-based approaches will definitely be different, for example.

3.1.4. Metrics

The evaluation is handled using a set of metrics. Unlike other similar benchmarks, the XAIB treats metrics as separate objects with their own metadata and not just as functions. Metrics are handled like this because, in the XAIB, they are not just functions but have their own states. They store a creation time, name, value, direction (the higher the better or the less the better, are denoted by “up” and “down” in the library), and references to the dataset and model they were computed on.
A metric is a complex object that has three major roles in the XAIB. Its primary role is a value storage. Metrics store a single scalar value that is considered their main value. Metrics are also functions. They encapsulate the way they compute their values, similar to explainers and models. After configuration, they should be able to be computed. Furthermore, the last role is metadata management. The metrics record a dataset and a model that were used for the calculation. This information is then used to trace values back to their setups and assess the quality of explanations in different cases.
To make metric design more flexible, since they can feature only a scalar as a value, the XAIB also features fields like interval and extra. The interval can be used if a confidence interval was computed, so this field can store the upper and lower boundaries of a value. The extra field is made to be adjustable for some more complex scenarios; it is a dictionary made for the additional metadata or supplementary values without format requirements. Those measures should allow the flexible usage of a metric class to wrap more complex metrics from the field.

3.1.5. Cases

The XAIB evaluation framework is based on the Co-12 properties proposed in the work of Nauta et al. [8]. Their work is an attempt to gather and systematize the aspects on which XAI methods could be evaluated. They argue that interpretability is not a binary property and should not be validated with anecdotal cases alone. Thus, they propose the framework of 12 properties, where each property represents some desired quality of the method. Namely, these are as follows: Correctness, Completeness, Consistency, Continuity, Contrastivity, Covariate Complexity, Compactness, Composition, Confidence, Context, Coherence, and Controllability. This framework of properties was chosen mainly for its completeness; it is likely to cover most of the ways in which one can measure the quality of explanations.
To implement the idea of a property that can also be measured in one or more ways, the benchmark features cases. Each case is a representation of one of the properties. If a metric is implemented in the benchmark, it should be attached to a case to signify that it measures a property. Cases function not only as containers for metrics but also as evaluation units. They are the place where the created metrics are evaluated. The benchmark has a collection of predefined cases for the Co-12 properties that already feature corresponding metrics, but cases are not limited to that. Using the add_metric methods, users can create custom evaluation units or extend existing ones.

3.1.6. Factories

There are also special interfaces that help users perform basic operations without having to dive deep into the documentation or implementation details of what they are using. Those interfaces are factories, setups and experiments. They are the tools that can shift the trade-off between complexity and ease of use.
Factories, as the name suggests, provide a uniform interface for creating XAIB entities. They are used to create datasets, models, explainers, and cases without having to input any parameters except their names.
The main use case for this entity is the use of default factories. They can be found in the benchmark evaluation module, which is filled with already parameterized constructors of datasets and models. Since every configuration is already made, it is easy to create a default instance using a factory.

3.1.7. Experiments

There are no special objects to handle the experiments themselves, but there are default procedures that take a case, explainers, and other parameters, run evaluations, and save the results to the disk. Similarly to factories, they represent a default experiment run. The XAIB entities can be used without such utility; however, it is deemed helpful for users to easily compute and save evaluation results.
In its internal workflow, it is a decorator that wraps a case. It accepts a folder to store experiment results, a list of explainers to evaluate, and arguments for a case’s evaluation method, like batch size, which is the default. After creating a case it starts iterating over the given explainers, providing them with the case to evaluate. All metadata for the case are then saved to the folder provided in a structured manner using JSON files. It contains metadata for every metric that is inside it, which in turn records their datasets, models, and parameters. This traceability feature spans the whole benchmark system and allows entities and their parameters to be tracked, helping demonstrate how these influence tge metrics.

3.1.8. Setups

Evaluation is a very high-dimensional problem where for every entity there are dozens of options, and the need to test all of those combinations creates a combinatorial explosion.
Some cases, in turn, require testing on only a subset of those options. This is where the XAIB setups can be used. This is a convenience class for creating enumerations of all objects included in the benchmark. Factories can contain a large number of defaults, and sometimes it is easier to specify what models, for example, should not be included in the overall evaluation. The setup accepts the factories for datasets, models, explainers, and cases upon creation, as well as the specifications of what entities should be created later. It is not responsible for initialization; rather, it is just a formal way of setting up an experiment. At the time of use, the setup merely returns the names of entities without creating them.
Specifications are set using the keywords datasets, models, explainers, and cases. They can be set using a list of names, enumerating all the things that users need to include. For the default behavior and ease of use, there is also a special value all that frees the user from the need to input all available entities manually. If “all” is used, the setup can also exclude some of the options with the keywords <entity>_except that allow listing the options for each entity.
This interface allows for the flexible customization of the experimentation process and a readable way to specify what is inside each evaluation. The following Listing is a code excerpt showing how setups can be used in practice.
Listing 1. Setup example.
  •     factories = (
  •         DatasetFactory(), ModelFactory(), ExplainerFactory(), CaseFactory()
  •     )
  •     setups = [
  •         Setup(
  •             *factories,
  •             datasets = ["iris""digits"],
  •             models_except = ["knn"]
  •         )
  •     ]

  •     for setup in setups:
  •         for dataset in setup.datasets:
  •                   ...
It is important to note that abstractions around evaluation properties do not remove the ability to use lower-level entities from the user. They provide an easier way to work with defaults, and when users want to change default values, they can create entities manually. They were built taking into account that new benchmarking methods should not be hard for a user who is not familiar with the benchmark and its interfaces for each entity if they are only interested in obtaining metric values.

3.2. Modules and Dependencies

Working with different providers without relying on each of them specifically requires a way to deal with dependencies. By default, the XAIB does not require any of the explainer’s or model’s packages. This is purposefully included in the design to make the setup easier for those who do not need to use every explainer or every model. Dependencies, unlike in any other XAI benchmark, are handled per module separately, and if the explainer or model requires some Python packages, they should be installed separately. For complete installation, the XAIB provides a file with all required dependencies that can be installed using one pip install command.
The XAIB Python module is divided into a number of submodules that serve different purposes.
The design of the modules was considered an important part for several reasons. They are mainly used for handling external dependencies when importing and for ensuring clear imports from the user side. The submodules allow for the definition when some external dependency will be imported. The main idea was to isolate submodules so users are not required to install dependencies they do not need, as long as they do not use certain functionalities. There is also import convenience to consider. When using code, users should easily understand the project’s structure. This drives choices on module names and their hierarchy. There is a certain trade-off between the two aspects mentioned, and in the case of XAIB, the choice was made in favor of dependency isolation while still trying to build a comprehensible structure.
Most of the modules collect implementations of different entities and group them into one submodule.
There is base, which is a module for every base class. It has no internal dependencies, but almost every submodule depends on it.
There are entity submodules, namely datasets, models, explainers, metrics, and cases. Modules named datasets and models include the default implementations of the corresponding entities.
Modules with the names explainers, metrics, and cases do not include their entities directly, but are divided into submodules by the type of explanation. Currently, there are feature_importance and example_selection sections inside these modules. Other similar modules that collect entities do not need such division since they do not depend on the explanation type.
There is also a special section called evaluation which is not a module in a strict sense but a container for evaluation workflows and everything that helps them. It was also split into submodules for each type of explanation, but it has common tools for each of them. That module contains factories for datasets, models, explainers, and experiments. This is also where setups are implemented.
The modules mentioned create the structure of the benchmark. There are also additional modules in the project for common code, documentation, and tests.

3.3. Extensibility

Extensibility is a crucial aspect of the XAIB design that dictates most decisions.
The main idea is the kind of benchmark that does not work as an arbiter of quality but rather facilitates the research in quality estimation and provides the platform for experiments. Many of the traditional benchmarks are implemented in the following fashion: there are train and test datasets, the first is published while the second is kept secret. In this scenario, a benchmark accepts a solution in a predefined format, evaluates it on its own using a test dataset, and then places a solution on a leaderboard. All of these measures prevent competitors from overfitting or cheating by using test data.
Most existing explainability benchmarks are designed in a similar way. The data may be open, but this centralized design remains unchanged. Since the XAIB features a multitude of metrics and evaluation setups in general, it is difficult to “overfit” on every setup when designing an explainer. This is why the XAIB does not work like a regular benchmark, and this is why extensibility and unified interfaces are important for it. It allows for the creation of various ways to measure the quality of explainers without rewriting any existing entities or evaluation code.
One of the main features that facilitates this modular design is the way the interfaces work. XAIB entities are designed using the inversion of control principle. They maximize receiving configuration information at the initialization stage rather than at the time of actual usage. For example, explainers accept both a test dataset and a model at initialization time rather than receiving them at the time of the actual explanation. Cases obtain a dataset, a model, and an explainer and then they can be evaluated without specifying this information. This design creates the whole evaluation experiment, which can be transparently created and configured in the same place and then passed to the evaluation pipeline. Creation can be handled by the benchmark itself—this is what factories are for—or by the user for full control over parameters. The created evaluation pipeline can then abstract all implementation details. It will just run an evaluation and record the results.
In this case, further development is expected, not only by project maintainers but also with the help of the XAI community. Thus, the extensible design of the XAIB becomes not only an engineering solution for handling complexity but also a choice to be open and community-driven.

3.4. Versioning

It is hard to underestimate the importance of the thorough tracking of changes and versioning of evaluation software. Missing versioning can cause reproducibility issues. For example, a researcher installs a copy of evaluation software and obtains metric values for their method. Later, when the evaluation procedure is updated, another researcher can evaluate their method using a newer copy. This situation will lead to two researchers reporting inconsistent results. Without an indication of the versions, these inconsistencies may remain unnoticed for a long time when different methods are compared.
Semantic versions provide a meaningful reference, indicating not only what copy was used but also what kind of changes were made. This is why the XAIB employs semantic versioning to enable reproducibility and prevent hidden results inconsistencies. Researchers are encouraged to report the full version number with evaluation results when using the benchmark.

3.5. Users and Use Cases

Since users are considered the main focus of the XAIB, it is important to understand the interests of possible users, which is why this question is considered separately. It was decided to divide users into several groups based on their interests. The groups will be named Researchers, Engineers, and Developers. The names do not define groups themselves but serve as convenient associations for better explanations. All of these groups have different needs that should be addressed with proper tools within the benchmark. Their particular roles are not as important as their goals when using the benchmark in this regard. A wider audience interested in XAI will also be taken into consideration, but it will not be considered as a group. All of the results of the user analysis summarized can be seen in Figure 2. For the sake of simplicity, not every use case is presented in the diagram, instead, some of them are grouped together.
Figure 2. Use case diagram with groups of users. Arrows represent interactions with different components of the XAIB. Each group has different goals; therefore, their interactions are different. Developers contribute new functionalities and entities. Researchers and Engineers interact in a similar way but have different goals. Researchers propose their own method; for them, setup is a variable. When Engineers select a method for their own task, for them, the method is a variable.

3.5.1. Users

The Researchers group consists of people who are working on their own XAI solutions. The development of an explainer may be at a different stage, and at each stage, they can benefit from a benchmark. In the early stages of development, prototypes can be quickly assessed and compared, and different ideas can be tested against several metrics. The low entry requirements of a benchmark will benefit at this stage in being able to quickly integrate it into the project operations. Users in those stages of development need a way to effortlessly test their own solution without rewriting it or going through the complicated process of getting measurement results.
In the latest stages of development, as well as for XAI solution maintainers, there is a need for comprehensive evaluation and comparison with other existing methods. To satisfy those needs, a benchmark should feature the ability to pass full evaluation without the need to manually recreate complex setups, as well as the ability to obtain the results of other methods to compare.
In the context of evaluation, the ability to plug in your own data or model may also be important for people who develop their methods, as they may already have some datasets and models for debugging purposes. If the same data could be used for evaluation, it would help researchers better understand their method and in a specific context.
People who intend to reproduce the evaluation results will also be attributed to this category. Public benchmark results should be reproducible to address these requirements.
To meet the requirements of Researchers, a benchmark should have the ability to quickly evaluate a method without the need to rewrite existing code, possibly while also using custom data and models, making a full evaluation and comparison with other methods and reproducing published results. For quick start-up, it should feature clear installation instructions, explicitly defined dependencies, and an easy way to run experiments.
Engineers in our categorization scheme are the people who use XAI methods in their own machine learning tasks. Their most important goal with a benchmark could be choosing the most suitable method for their specific task. In various fields where machine learning is employed, the explainability criteria may be different. A benchmark should feature different measurements covering different facets of explainability that may be less or more important depending on the specific application.
Engineers choosing the most suitable solution may want to see the results of all measured methods on a benchmark’s data, but the ability to test several methods on their own setup is what sets the XAIB apart from existing solutions. Given the multitude of measurements a benchmark could have, Engineers may also need a guide for the properties and metrics. They need the ability to identify key properties for their applications and see how different measures can express those properties.
To meet the requirements of Engineers, a benchmark should have a customizable evaluation, published results, and information about properties and metrics it features. For trying out different methods, it should also provide a common, unified interface for each of them.
The group named Developers is made up of people who seek to broaden the picture of XAI research. Their interaction with a benchmark revolves more around development than usage. They may want to propose new ways to measure the quality of explainers or extend a benchmark in some other way. For example, by adding new types of methods to be measured and the corresponding metrics for them, this group of users should have easy access to source code, documentation, and other materials that will help them effectively grasp the structure of a benchmark and how to contribute to its development. A benchmark should be documented and built to be extensible enough to meet the needs of Developers.
A broader audience in the XAI or general ML field may be interested in finding out what comparison criteria are used, what methods are currently performing better, and other details. Their interests should also be considered, since some people can eventually become one of the other three groups.
In this section, users and their possible interests were discussed. Based on that premise, the main requirements were identified. In the next section, this information will be used to show how the XAIB satisfies these requirements.

3.5.2. Use Cases

The use cases presented in this section are based on the potential users identified in the previous one. The use cases are divided into groups depending on how users can interact with the benchmark and its the main functions.
The first group of use cases is utilization of the benchmark. This is an experiment that results in reproduction to verify metric values that were published. There are also different kinds of evaluations as follows: a single method on a specific setup, a method against a full set of metrics, several methods on some setup, etc.
To reproduce the experiment results on the full setup using the XAIB, one needs to install it and run an evaluation. The benchmark provides installation instructions on its documentation website and GitHub page. The instructions themselves are not so complex, since the package is available on PyPI and can be installed with a single command. The list of additional dependencies is also available and can be installed in the same fashion for users wanting a full setup. After installation, users can run the default evaluation scripts. The evaluation itself is separated into modules by explanation type. For each type of explanation, the results are written in files along with the visualizations in the same directory as the evaluation script.
Reproducing methods is important, but for the users who want to test some specific existing method or their own method, something less general is needed. For this purpose, users can use special interfaces that were mentioned in the Structure section.
In the Listing, an example of how to use an existing method is offered. There is a need to create and configure entities that is not within the scope of this experiment. To not overwhelm the user and create additional complexities, all of it is encapsulated within the benchmark itself. Since the particular explainer is the focus of this experiment, the user is responsible for creating and configuring it.
If the user wants to evaluate this method on all of the cases, the previous example can be continued as shown in Listing.
The second group is result inspection. The main function of a benchmark is to provide users with a ranking of competing methods. To be able to compare methods in general, users should be able to see the aggregated results for each method. For a more in-depth view of each method, in particular, metric values for each method should be presented separately. Since the XAI evaluation is very high-dimensional, it requires special care when visualizing results. For a benchmark, the representation of results is one of the most important aspects.
Listing 2. Existing method evaluation example.
  •     from xaib.explainers.feature_importance.lime_explainer import LimeExplainer
  •     from xaib.evaluation import DatasetFactory, ModelFactory

  •     train_ds, test_ds = DatasetFactory().get("synthetic")
  •     model = ModelFactory(train_ds, test_ds).get("svm")
  •     explainer = LimeExplainer(train_ds, labels=[0, 1])
  •     sample = [test_ds[i]["item"for i in range(10)]
  •     explanations = explainer.predict(sample, model)
Listing 3. Existing method evaluation example.
  •     from xaib.evaluation.feature_importance import ExperimentFactory
  •     from xaib.evaluation.utils import visualize_results

  •     experiment_factory = ExperimentFactory(
  •         repo_path = "results",
  •         explainers = {"lime": explainer},
  •         test_ds = test_ds,
  •         model = model,
  •         labels = [0, 1],
  •         batch_size = 10
  •     )
  •     experiments = experiment_factory.get("all")
  •     for name in experiments:
  •         experiments[name]()
  •     visualize_results("results""results/results.png")
The XAIB can help those wanting to inspect evaluation results by using the information published on the documentation website. For a general comparison, there is an aggregated bar plot with all the methods averaged over all setups. The website also provides results for each method per dataset for comparing the influence of datasets on the method’s performance. In addition to that, a full table with all metric values for each combination tested is also published. All visualizations are automatically produced using saved evaluation results.
Aside from the results, the documentation itself provides information on every entity of a benchmark as follows: datasets with general descriptions of features and links to the sources, models with categorization based on task and interpretability, explainers along with baseline explainers, brief descriptions, and source links. Explainers are divided according to their type. Metrics are divided in the same fashion; their descriptions contain information on the case they belong to and other information such as direction (the more, the better, or the less, the better) and a link to the source code.
The third group of cases is extending the benchmark. For a benchmark to stay up-to-date, it needs to incorporate the latest methods. To grow and extend with the field itself, a benchmark should be easily extensible, even for an external developer. This requires a set of interfaces for each entity and detailed documentation on how to contribute.
Writing their own implementation of an explainer is required not only for users who want to extend the benchmark but also for those wanting to use their own solution that is not featured in the defaults. To satisfy the needs of those users, the XAIB provides documentation written specifically on the topic of implementing new explainers.
The Listing demonstrates the process of adding a new explainer. The user is required to set a name and implement a method that is used to obtain explanations.
Listing 4. Example of adding an explainer.
  •     import numpy as np
  •     from xaib import Explainer

  •     class NewExplainer(Explainer):
  •         def __init__(self, *args, **kwargs):
  •             self.name = "new_explainer"
  •             super().__init__(*args, **kwargs)
  •         def predict(self, x, model):
  •             return np.random.rand(len(x), len(x[0]))

4. Experimental Evaluation

This section provides a description of how experiments are conducted within the benchmark. First, it describes the setup needed for the experiments and then briefly describes the metrics used and the results obtained. Since the focus of this work is not the set of metrics, the description is very concise. In the final part, the novel property of the benchmark is described.
Experiments serve as a central evaluation workflow that is embedded in the benchmark structure as a separate module. It is implemented as a set of scripts using XAIB tools to evaluate every compatible combination of dataset, model, explainer, and metric. The main goals of building this default workflow are reproducibility and automation in obtaining complete results. However, the default workflow is just one of the many possible ways of using the XAIB to evaluate XAI methods, and users are encouraged to create their own procedures.
Experiments are divided into two parts according to the explainer type as follows: feature importance and example selection, as in this case. For each of the groups, their own metrics were chosen for different properties.
It is important to note the role of experiments and metrics in this work. Although metric values and concrete measurement details are at the core of the evaluation, in this paper, they are intentionally omitted to maintain the focus of the work on the evaluation infrastructure itself. Experiments in this work demonstrate the capabilities of the benchmark to extend, enabling the evaluation of various types of explainers. This is the reason some details about metrics may be omitted in the following sections. However, since the role of metrics is still crucial, they hold a separate part on the benchmark’s documentation website.
In this section, the results of the experiments will be discussed. First, the setup will be briefly described, a set of metrics will be outlined, and finally, the actual metric values will be presented.

4.1. Experiment Setup

Corresponding to different types of explanation methods, two experiment runs were conducted as follows: one for feature importance and the other for example selection methods. In the following section, experiments are described, and their results are interpreted and discussed. To set up an experiment with the given set of metrics in the benchmark, one needs to provide a dataset, a trained model, and an explainer. Numerous combinations of datasets and models exist, but datasets, models, explainers, and cases are not always compatible with each other. To avoid illegal comparisons, experiments were run while manually excluding them from the list using the setup functionality.

4.1.1. Datasets

All datasets that were included in the experiments are commonly known among machine learning researchers and are available using the scikit-learn library interface [23]. At this stage of benchmark development, only small datasets were included to be able to experiment with the development of the benchmark quickly, without the need for special treatment or hardware for large-scale datasets. This principle applies to models as well. Since the benchmark was built to be extensible, larger datasets and models will be integrated into it further in the development process.
On the side of the benchmark, a special sklearn wrapper was implemented to adapt the interfaces. It accepts the name of the dataset from the library and the name of the split either “train” or “test”. Then, it fetches the mentioned dataset, splits it, and returns the wrapper. The need to wrap datasets is crucial because of the differences between interfaces. In the XAIB, datasets are represented as a collection of data points that are accessible with indices. Each data point is a dictionary with keys that represent the names of the columns or channels inside a dataset. Special wrappers for the sklearn datasets transform raw numpy arrays returned from the library to the interface described before.
The datasets that were used in all experiment runs are the following: breast_cancer [24], digits [25], wine [26], iris [27], and two synthetic datasets named synthetic and synthetic_noisy that were generated with custom parameters.
The dataset named synthetic was generated using the standard scikit-learn method. Artificial data are crucial in these experiments to effectively debug every method since we can put methods and models under controllable conditions. For experiments, a 14 × 100 dataset was generated for the binary classification task. The values are positive and negative. All of them were assigned to be important. No interactions or repeated features were inserted. The points form two clusters, one for each class.
The dataset named synthetic_noisy was brought up to understand how robust the methods are. It was generated in the same fashion as a regular synthetic dataset but with different settings as follows: 100 rows with 14 features each, of which 7 are informative, 5 are redundant, and 2 are repeated. Each class has two clusters. Each dataset was separated using the ratios 0.8 and 0.2 for the training and evaluation sets, respectively.

4.1.2. Models

The models that were used include an SVM-based classifier and a neural network for feature importance experiments and K-nearest neighbors for example selection ones. SVM and NN are considered black-box models, and this is why they were chosen as the first models for evaluation. Similarly, as with the datasets, to not overcomplicate initial benchmark development, large-scale models were not included, but they can be added later in the development and expansion. All models were initialized with the default library configurations.

4.1.3. Explainers

The compared methods include both baselines and real feature importance methods. The real methods are represented by shap [28] and LIME [29]. Shap uses the default method provided by the library with the default configuration. LIME was also used without any custom setup, with default values for the number of samples and distance metric.
In the case of example-based methods, KNN with different distance measures was used. KNN is an interpretable model and is used both as a model and as an explainer for itself.
It is important to note that every explanation method outputs a different distribution of values. Shap, for example, gives positive and negative importance scores as well, indicating different contributions of features. For the correct comparison to obtain comparable metric values, all explanations were normalized for each batch. The minimum value was added to each batch, and then, all values were divided by the range of values. If the range is zero and all values are zero, then the vector remains the same, and if all values are equal, then the all-ones vector is returned.
Baselines are a very important part of the evaluation because they can serve as an estimate of the best, average, or worst theoretical case for a metric. The constant explainer was configured to always return dummy results in the form of vectors of all ones. Furthermore, the random baseline was set to test the normalization so as to return values from an arbitrary negative range from −25 to −5. Example selection baselines are different from feature importance ones. The constant always returns the first element of the dataset, and the random one for each call returns some random example. Baselines are important for the utility of metrics, as they can serve as sanity checks for them. A metric can be considered useful if it measures the desired property. So, when it is certainly known that the property is not satisfied, the metric should indicate this by showing the difference between the baseline dummy method and a real one.
After the preparation of datasets and training models, all metrics were computed for every combination of dataset, model, explainer, and case. For each type of explanation, separate experiment pipelines were carried out with common models and datasets. The version of the benchmark that is considered is 0.4.0, which was the latest at the moment of writing this article. All entities currently present in the XAIB are presented in Table 1.
Table 1. Summary of entities present in XAIB. The set of entities is not fixed by design and was created to support proof-of-concept with the intention to expand the number of available options. For the abbreviations of metrics, see Section 4.2.

4.2. Metrics

One of the main are where the XAIB contributes to the evaluation of XAI is implementing the clear evaluation system proposed in the work of Nauta et al. [8] in practice. The Co-12 framework is a set of desirable properties for an explainer that can be used to evaluate and compare different methods.
Since the focus of this work is on the benchmark itself, the details of metric implementations will be left outside the scope of this work, but they are still available in the documentation; their importance is well understood.
The results of the experiments still require explanations for the meaning of the metrics that were computed, and this is what this section is devoted to.
Not every one of the 12 properties for every explainer type was covered during the implementation of metrics; some of them may be inapplicable, and some were not covered yet. In the next section, only the properties currently addressed are listed.
Correctness for example, was measured for both types of explainers in different ways with a metric called model randomization check (MRC). A Correctness metric should indicate that the explanations describe the model correctly and truthfully. The explanations may not always be reasonable to the user, but they should be true to the model to satisfy this criterion. Continuity is measured for both types with a small noise check (SNC), which measures the stability of the explanation function. Continuous functions are desirable because they are considered more predictable and comprehensive. Contrastivity measure should show how discriminative the explanation is in relation to different targets. The contrast between different concepts is very important, and the explanation method should explain instances of different classes in different ways. For this benchmark, we implemented label difference (LD) and target discriminativeness (TGD) for feature importance and example selection types, respectively. Coherence metric should show to what extent the explanation is consistent with the relevant background knowledge, beliefs, and general consensus. The agreement with domain-specific knowledge can be measured, but this can be difficult to define and it can be very task-dependent. The measure can be proxied using the agreement between different methods. It is represented by a metric called different methods agreement (DMA) for feature importance and same class check (SCC) for example selection. Compactness measures the size of the explanations. The explanations should be sparse, short, and not redundant. The size can be measured directly in some cases. In the case of feature importance, the size is always the same, but sparsity is different. This is why Compactness was measured through sparsity (SP) for feature importance. Covariate Complexity means that the features used in the explanation should be comprehensible. Furthermore, non-complex interactions between features are desired. It is measured as covariate regularity (CVR) for both types. For the precise formulations and implementations of all the metrics and up-to-date information, the reader is encouraged to visit the documentation.
Although metrics were implemented from scratch, they are mostly very similar to the existing solutions. Measures of Correctness, Continuity, Contrastivity, and Coherence are well known. Compactness and Covariate Regularity may be more novel.
In this paper, the description of the metrics is intentionally not provided to emphasize a key idea. We believe that the benchmark is more than a set of metrics. The initial proof-of-concept features a set of metrics with the main intention of demonstrating the capabilities of the XAIB. Metrics are a central part of the benchmark, but they are outside the scope of this work. The work itself focuses on how to provide any metric to the user.
The development of new metrics is an area of research in itself and the detailed description of metrics is worth a separate paper. The flexibility of the XAIB should help facilitate further research in this area. The benchmark and its evaluation process should not be limited to the set of metrics mentioned; it is designed to be further extended to cover more properties. The set of properties already covered and the implementation details of the metrics may not be clearly justified in this paper due to scope limitations.
The properties themselves should not be considered covered when only one metric is implemented for them. The metrics are not perfect for every use case, and a more complete picture for single property can be derived when measured using different metrics.
Considering this, the set and formulations of metrics always evolve with time, and the implementation of new metrics or metrics that are already known to the community represents the direction for the future development of the XAIB.

4.3. Experiment Results

In Table 2, the results of the tested feature importance methods are presented for each metric. Bold represents the best result among real methods; the baselines are excluded. The arrows represent the direction of a metric as follows: up is “the greater the better”, and down is “the less the better”.
Table 2. Evaluation results for the feature importance methods. The score for each method is averaged across datasets and models. Arrows represent the direction of the metric as follows: the greater the better (↑) or the lower the better (↓). The best performing method is shown in bold.
In the correctness metric, no real method was better than the baseline random method, which means that both methods are not as true and sensitive to the model as they may seem. However, random explanations have a great advantage when it comes to the representation of a random model, so this is expected.
The model that is best for continuity is the constant explainer because explanations never change with any noise, but shap is notably very close to that value as well. Shap is also the most contrastive method. It is suggested that this is due to the fact it influences different features on the label that may be positive or negative.
LIME is the most feature-simple method; it generates more understandable vectors and is also more sensitive to the model.
The coherence values of the two methods are very close. This metric, in the way it is computed, benefits from more methods to form some sort of “common sense”, so there should be more methods to gain confidence in those values.
The results of the example selection methods are presented in Table 3. KNN-based approaches show the best results on almost all metrics. The most interesting result is the result on target discriminativeness ( T G D ). This means that the selection of examples from KNN trains a better model than a random selection of them, which may be the basis for further experiments on this matter.
Table 3. Evaluation results for the example selection methods. The score for each method is averaged across datasets and models. Arrows represent the direction of the metric as follows: the greater the better (↑) or the lower the better (↓). The best performing method is shown in bold.
The updated results, along with descriptions of the data, models, methods, and metrics are available on the documentation website. The development is active, which means that new entities will appear with each release, further broadening the XAI evaluation landscape.

4.4. Method Comparison

In terms of evaluation, one of the most important aspects that sets the XAIB apart from existing benchmarks is its ability to compare explainers in different contexts.
In most existing XAI evaluation solutions, this is not the case. Datasets and models are treated as constants in evaluation experiments and are usually embedded or hardcoded. This is acceptable for research and the theoretical study of explanation quality. However, when it comes to the evaluation of explainers for specific applications, when models and datasets are different from the evaluation scenarios, this scheme is not flexible enough to allow it.
In the XAIB, the nature of most metrics makes data and models pluggable into a measurement process. Since a dataset and a model are variables, it easily enables, for example, the kind of experiments where one of them is varied while the other stays constant. This and other experiments, including evaluations on a specific scenario, also become available.
Using the terms established earlier, previous works on benchmarking XAI methods focused on the needs of Researchers, but not those of the Engineers. With the benchmark proposed, Engineers are now able to use their own models and data to evaluate several explainers on their own setup and figure out what works best for the application.
The following example demonstrates how different metric values can be derived when the dataset–model pair changes while the explainers stay the same. In Figure 3, the results of shap and LIME on the SVM classifier trained on the breast_cancer dataset are shown against all metrics. Metric values were not taken as this is only ransformed for visualization purposes. They were normalized to the range [0, 1], and the ones with a downward direction (the lower the better) were multiplied by −1. This transformation makes all metrics comparable in one graph and gives them an upward direction. Taking into account these results, shap achieves better metric values than LIME on almost all metrics. In this case, if there are no preferences for some set of metrics, one could choose shap over LIME.
Figure 3. Results on the first setup—SVM—on the breast cancer dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, shap outperforms LIME on the majority of metrics.
However, if the pair dataset–model changes, (assuming that another task is solved), the results change as shown in Figure 4. Methods change places, and LIME becomes better on a majority of metrics, so if there are no preferences among metrics, one could choose LIME for this application. This example serves as a demonstration of how the XAIB can facilitate experiments for Researchers and enable setup-specific evaluation for Engineers at the same time.
Figure 4. Results on the second setup —NN—on synthetic noisy dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, LIME outperforms shap on the majority of metrics.

5. Discussion

In this section, a detailed comparison will be made and the limitations of the benchmark will be discussed. The proposed solution has limitations, both in the field of evaluation metrics and the software part of it. After that, possible directions for future research will be proposed.

5.1. Comparison

In order to highlight the contribution of this work, this section provides a comparative analysis of the benchmark proposed against the currently existing solutions. Flexibility, extensibility, and usability are the main focus of this analysis, although other criteria such as interpretability abd property coverage were also considered. Table 4 summarizes the overview of different aspects of the existing XAI benchmarks and also includes the one proposed in this work.
Table 4. Comparison table of existing XAI benchmarks.
The flexibility of an XAI benchmark can be achieved in many ways. Given the diversity of the ML landscape with different data types, tasks, models, and ways to explain them, a flexible benchmark should be able to cover as much as possible.
However, current solutions do not provide such flexibility in terms of tasks, models, and especially explanation types. Most of the works do not go beyond tabular data classification. All of them solely focus on feature importance methods. The XAIB, in turn, provides an example of a flexible benchmark, offering the ability to evaluate not only feature importance explainers but also those that use examples as explanations. It can be argued that the only pair of a data type and a task implemented in the XAIB is also the classification over tabular data, but there is one crucial difference. Compared to other solutions, the XAIB was built with different tasks in mind. Although tabular data classification was the first choice for obvious reasons, the addition of other data types and tasks can occur without rebuilding everything from scratch. The applicability row in Table 4 illustrates this very clearly. Most of the other solutions can only use predefined sets of datasets, models, explainers, or tests.
The extensibility is closely related to the previous property. The extensible benchmark should not only allow the evaluation of different explainers (this would be the borderline), but it should also provide means of experimentation with different data, models, and metrics.
The applicability and documentation rows in Table 4 suggest that some solutions offer the ability to evaluate custom explainers. For example, OpenXAI, Compare-xAI, and M 4 have this extensibility option. Alternative directions are not currently allowed. For example, users who want to evaluate an explainer on their own dataset and model cannot use OpenXAI or Compare-xAI. They can leverage M 4 if their data are images or text and their model is a neural network. In other cases, those users have no options.
In this regard, at present, only the XAIB provides a full set of extension directions. It provides the user with the ability to experiment with their own implementations of datasets, models, explainers, and metrics. As long as interface compatibility is ensured, users can combine existing entities with custom ones, which is supported not only with the code but also with the detailed documentation.
Usability is an important criterion in the open source community. The benchmark should be open, providing the documentation and the results table. It should not be difficult to setup and run experiments.
Almost every existing benchmark provides some documentation that highlights different aspects of its usage. Mostly, this covers the basics, for example, how to reproduce results. Usually, there is no information on how to submit the new method or contribute to the development in any other way. Providing up-to-date results is also an important and seemingly underrepresented feature. The ability to easily install and use the package is also excluded from most of the methods.
The proposed benchmark provides detailed documentation, not only on reproducing the results but also on usage with custom entities and separate sections on how to contribute. Users can also install the XAIB with a single command, making it much more accessible to users from different scientific backgrounds, who may find manual installation more difficult.
Table 5 shows which of the Co-12 properties are measured with the metrics of existing benchmarks. The benchmarks feature numerous metrics; however, in terms of completeness in property coverage, they lack variability. The coverage of the Co-12 properties was analyzed by attributing each of the metrics presented in the respective papers to some of the properties whose definition best fits the given metric. The existing benchmarks feature multitudes of metrics that should facilitate a comprehensive evaluation. However, the particular set of metrics that is chosen each time is rarely discussed or justified. The completeness of the chosen metrics is also a topic that is usually not brought up in papers. After analyzing the proposed metrics through the lens of one of the most complete systems of properties available, their completeness becomes clearer.
Table 5. Co-12 property coverage of existing XAI benchmarks. Properties do not correspond 1:1 to metrics. Each property can be measured in various ways. Existing XAI benchmarks can have a multitude of metrics while measuring only a limited amount of properties. Bold text marks properties that are unique contributions of the XAIB.
For example, the saliency eval benchmark provides the Human Agreement metric, which is considered to belong to the Context property; Faithfullness measures Correctness; Confidence Indication is conceptually very similar to Contrastivity; and the Rationale/Dataset Consistency metrics could belong to the Consistency property.
The XAI-Bench is very narrow in terms of properties; however, it goes deep into the measures of Completeness. Faithfullness, ROAR, Monotonicity, and Infidelity are all different measures of the same property and can reflect different aspects of it. GT-Shapley is likely to measure Coherence since this ground truth can reflect some form of “common sense”.
OpenXAI, although featuring 22 metrics, does not cover as many properties. They feature GT Faithfullness, similarl to the previous case, is a measure of coherence. Aside from that, metrics are divided into two groups as follows: Faithfullness as a measure of Correctness and Stability as a measure of Continuity.
Compare-xAI provides a multitude of tests which are divided into six categories as follows: Fidelity and Fragility as measures for Correctness; Stability, which is similar to OpenXAI and is also a metric for Continuity; Simplicity, which is likely to measure Coherence; and Stress with Other, which are directed to Context property.
M 4 features five metrics, even though they are all different measures of Faithfullness, which in our case is named Correctness.
Analyzing the table according to the classification made by Nauta et al., the properties categorized as Content (Correctness, Completeness, Consistency, Continuity, Contrastivity, Covariate Complexity) are the most covered. Properties in the categories Presentation (Compactness, Composition, Confidence) and User (Context, Coherence, Controllability) are the ones that are poorly covered at this moment.
Considering the comparison made, XAIB is more flexible, allowing for the deep customization of the evaluation process. It helps widen the scope of XAI evaluation by featuring more types of explanation and allowing the use of custom datasets and models in the evaluation. It opens new extensibility directions that were previously unaddressed, while making steps towards usability by introducing clean versioning, dependency management, and distribution.
In addition, it also covers most of the properties, making it the broadest existing XAI benchmark. Although the XAIB implements only six out of twelve properties across different explainer types at the initial stage, all properties can be covered in the future.

5.2. Limitations

Functional ways to assess quality seem to be a good solution. They are very cheap to compute compared to user studies and provide clear comparison criteria. As long as the metrics themselves are interpretable enough, it should be easy to make a decision by comparing numbers. Those may be the reasons why functional evaluation is gaining more popularity at the moment. There is a certain demand to bring XAI evaluation to the standard of AI evaluation. However, this does not seem possible. The main difference between AI and XAI, and the main reason for this impossibility, is the presence of a human. Since interpretability depends solely on human perception, it becomes difficult to formalize, in contrast to performance measures, which are formulated mathematically in the first place and, in essence, are a convenient way to aggregate lots of observations. In addition to that, accuracy (in a broad sense) and its properties are well defined, while interpretability is not, which is a major point of criticism for the field. Considering this, one should approach functional measures with care and perceive them only as guides and proxies of the real interpretability properties they are trying to represent.
All of the above means that when working in high-risk conditions and when building reliable machine learning systems that will have a great influence on human lives, one should not rely solely on the values of quantitatively and functionally obtained quality metrics. For a complete evaluation, human-grounded and functionally grounded experiments are required. Only by using insights from every method of quality measurement can one make an informed decision.
In the following paragraph, the existing implementation limitations will be considered. The XAIB was designed to be universal and easily extensible. However, those qualities come with their own set of trade-offs. Analyzing them, the following difficulties were identified: performance issues, combinatoric explosion, compatibility management issues, and the cost of implementation.
Performance issues arise when building systems that have many independent components. This property impedes advanced optimizations that could be made if the solution was one-piece and specific. The abstractness and independence of the components create the conditions for additional work that should be done to manage this, which adds to performance costs. In addition to that, machine learning, in general, tends to be more demanding in terms of computing. Having compute- and memory-demanding solutions creates challenges that should be addressed in the future. The limited set of initial entities that are currently features in the benchmark is an intentional choice that was made to make the development easier by not trying to create premature optimizations for demanding solutions and focusing solely on the main benchmark principles.
Considering the number of different entities that make up an XAI benchmark, the evaluation task is highly dimensional. Adding a single new entity means creating a number of combinations with entities of other types. This could potentially create an exponential increase in evaluation runs for each method. Although this issue is virtually unavoidable, its computation costs can still be mitigated, and performance in this case does not look like a major concern. The management of such a system seems to be the most difficult thing to overcome. Each entity, dataset, model, or explainer can be incompatible with the others. For example, a dataset without any class labels can be incompatible with the classification model, which, in turn, can be incompatible with some types of explainers that do not work with some metrics, etc. Not only can this quality of the evaluation task cause errors when working with incompatible entities, but one should also carefully approach the solutions to addressing this problem. Some solutions can introduce a lot of additional complexity, and in this case, a good trade-off is needed to ensure ease of use.
Aside from those difficulties, result aggregation and interpretation also become problems that should be addressed in future research. This brings us to the last limitation mentioned, the implementation cost. When the system is abstract and every entity has a lot of metadata that are required for the system to work, it becomes difficult to derive new solutions. Although this is not the case at the current stage, with the expansion of the benchmark, a metadata handling and validation system should be built to ensure the efficient compatibility management. Furthermore, it seems inevitable that more weight will be put on the shoulders of new contributors.
Each limitation can be regarded as a challenge to future research and the development of the XAIB.

5.3. Future Work

Future work on the XAIB could focus on the extension of the evaluation landscape and addressing the numerous challenges that come along with this process.
Evaluation can be extended in numerous directions, but filling the gaps in the case coverage and different explanation types should be a priority.
Performance issues when dealing with compute-intensive solutions from a benchmark perspective can be addressed by minimizing the number of calls to the heavy algorithms. One of the ways this could be implemented is through extensive caching. For example, storing and retrieving already trained models and caching predictions and explanations between different experiments.
One of the most important issues—compatibility—could be addressed by implementing a data validation mechanism inside the benchmark that would not allow the use of incompatible entities and would help users test the correctness of their own implementations.
Research can also benefit from a more detailed comparison of XAI benchmarks. Since usability is one of the main focuses of the XAIB, the usability comparison with similar solutions can be deepened.
Future development should focus on expanding the benchmark, addressing the challenges mentioned, and others that will arise as the development progresses.

6. Conclusions

Being a relatively new field, XAI seems to lack common ground concerning numerous questions. This disagreement ranges from different definitions of central terms to the varying language that is used to describe similar phenomena. However, the level of responsibility placed on the field does not match. Solutions that emerge as attempts to create certain evaluation standards have a very narrow scope, either in terms of the type of explanations (which almost always feature importance) or evaluation metrics.
The benchmark that was proposed in this work is an attempt to fill those gaps. It is designed to include various explanation types and metrics, feature interfaces that are built to be extensible, and documentation covering every aspect of this process. The use of the latest advancements in XAI evaluation enabled us to build the XAIB on a complete framework of interpretability properties.
Providing easier access to the evaluation of explainers, the XAIB aims to become not only a benchmark in the traditional sense but also a platform for evaluation experiments, which is the foundation for further research in XAI.

Author Contributions

Conceptualization, S.K.; methodology, K.B.; software, I.M.; validation, K.B.; formal analysis, I.M.; investigation, I.M.; resources, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, K.B.; visualization, I.M.; supervision, K.B.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the Russian Science Foundation, agreement No. 24-11-00272, https://rscf.ru/project/24-11-00272/ (accessed on 29 October 2024).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bryce Goodman, S.F. European union regulations on algorithmic decision-making and a “right to explanation”. arXiv 2016, arXiv:1606.08813. [Google Scholar]
  2. Markus, A.F.; Kors, J.A.; Rijnbeek, P.R. The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. J. Biomed. Inform. 2021, 113, 103655. [Google Scholar] [CrossRef] [PubMed]
  3. Abdullah, T.A.; Zahid, M.S.M.; Ali, W. A review of interpretable ML in healthcare: Taxonomy, applications, challenges, and future directions. Symmetry 2021, 13, 2439. [Google Scholar] [CrossRef]
  4. Molnar, C.; Casalicchio, G.; Bischl, B. Interpretable machine learning—A brief history, state-of-the-art and challenges. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2021; pp. 417–431. [Google Scholar]
  5. Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
  6. Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
  7. Saeed, W.; Omlin, C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl.-Based Syst. 2023, 263, 110273. [Google Scholar] [CrossRef]
  8. Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; Van Keulen, M.; Seifert, C. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
  9. Bodria, F.; Giannotti, F.; Guidotti, R.; Naretto, F.; Pedreschi, D.; Rinzivillo, S. Benchmarking and survey of explanation methods for black box models. Data Min. Knowl. Discov. 2023, 37, 1719–1778. [Google Scholar] [CrossRef]
  10. Huysmans, J.; Dejaeger, K.; Mues, C.; Vanthienen, J.; Baesens, B. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decis. Support Syst. 2011, 51, 141–154. [Google Scholar] [CrossRef]
  11. Kulesza, T.; Stumpf, S.; Burnett, M.; Yang, S.; Kwan, I.; Wong, W.K. Too much, too little, or just right? Ways explanations impact end users’ mental models. In Proceedings of the 2013 IEEE Symposium on Visual Languages and Human Centric Computing, San Jose, CA, USA, 15–19 September 2013; pp. 3–10. [Google Scholar]
  12. Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst. 2018, 31, 9525–9536. [Google Scholar]
  13. Hase, P.; Bansal, M. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? arXiv 2020, arXiv:2005.01831. [Google Scholar]
  14. Zhang, H.; Chen, J.; Xue, H.; Zhang, Q. Towards a unified evaluation of explanation methods without ground truth. arXiv 2019, arXiv:1911.09017. [Google Scholar]
  15. Bhatt, U.; Weller, A.; Moura, J.M. Evaluating and aggregating feature-based model explanations. arXiv 2020, arXiv:2005.00631. [Google Scholar]
  16. Sokol, K.; Flach, P. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 56–67. [Google Scholar]
  17. Atanasova, P.; Simonsen, J.G.; Lioma, C.; Augenstein, I. A diagnostic study of explainability techniques for text classification. arXiv 2020, arXiv:2009.13295. [Google Scholar]
  18. Liu, Y.; Khandagale, S.; White, C.; Neiswanger, W. Synthetic benchmarks for scientific research in explainable machine learning. arXiv 2021, arXiv:2106.12543. [Google Scholar]
  19. Agarwal, C.; Krishna, S.; Saxena, E.; Pawelczyk, M.; Johnson, N.; Puri, I.; Zitnik, M.; Lakkaraju, H. Openxai: Towards a transparent evaluation of model explanations. Adv. Neural Inf. Process. Syst. 2022, 35, 15784–15799. [Google Scholar]
  20. Belaid, M.K.; Hüllermeier, E.; Rabus, M.; Krestel, R. Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark. arXiv 2022, arXiv:2207.14160. [Google Scholar]
  21. Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models. In Proceedings of the NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  22. Naser, M. An engineer’s guide to eXplainable Artificial Intelligence and Interpretable Machine Learning: Navigating causality, forced goodness, and the false perception of inference. Autom. Constr. 2021, 129, 103821. [Google Scholar] [CrossRef]
  23. Scikit Learn. Toy Datasets. Available online: https://scikit-learn.org/stable/datasets/toy_dataset.html (accessed on 29 October 2024).
  24. Wolberg, W.; Mangasarian, O.; Street, N.; Street, W. Wisconsin Diagnostic Breast Cancer Database; UCI Machine Learning Repository. 1993. Available online: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic (accessed on 28 October 2024). [CrossRef]
  25. Garris, M.D.; Blue, J.L.; Candela, G.T.; Grother, P.J.; Janet, S.; Wilson, C.L. NIST Form-Based Handprint Recognition System; National Institute of Standards and Technology: Gaithersburg, MD, USA, 1997. [Google Scholar]
  26. Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, J.; Reis, J. Wine Quality. UCI Machine Learning Repository. 2009. Available online: https://archive.ics.uci.edu/dataset/186/wine+quality (accessed on 28 October 2024). [CrossRef]
  27. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  28. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
  29. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.