In this section, a detailed comparison will be made and the limitations of the benchmark will be discussed. The proposed solution has limitations, both in the field of evaluation metrics and the software part of it. After that, possible directions for future research will be proposed.
5.1. Comparison
In order to highlight the contribution of this work, this section provides a comparative analysis of the benchmark proposed against the currently existing solutions. Flexibility, extensibility, and usability are the main focus of this analysis, although other criteria such as interpretability abd property coverage were also considered.
Table 4 summarizes the overview of different aspects of the existing XAI benchmarks and also includes the one proposed in this work.
The flexibility of an XAI benchmark can be achieved in many ways. Given the diversity of the ML landscape with different data types, tasks, models, and ways to explain them, a flexible benchmark should be able to cover as much as possible.
However, current solutions do not provide such flexibility in terms of tasks, models, and especially explanation types. Most of the works do not go beyond tabular data classification. All of them solely focus on feature importance methods. The XAIB, in turn, provides an example of a flexible benchmark, offering the ability to evaluate not only feature importance explainers but also those that use examples as explanations. It can be argued that the only pair of a data type and a task implemented in the XAIB is also the classification over tabular data, but there is one crucial difference. Compared to other solutions, the XAIB was built with different tasks in mind. Although tabular data classification was the first choice for obvious reasons, the addition of other data types and tasks can occur without rebuilding everything from scratch. The applicability row in
Table 4 illustrates this very clearly. Most of the other solutions can only use predefined sets of datasets, models, explainers, or tests.
The extensibility is closely related to the previous property. The extensible benchmark should not only allow the evaluation of different explainers (this would be the borderline), but it should also provide means of experimentation with different data, models, and metrics.
The applicability and documentation rows in
Table 4 suggest that some solutions offer the ability to evaluate custom explainers. For example, OpenXAI, Compare-xAI, and
have this extensibility option. Alternative directions are not currently allowed. For example, users who want to evaluate an explainer on their own dataset and model cannot use OpenXAI or Compare-xAI. They can leverage
if their data are images or text and their model is a neural network. In other cases, those users have no options.
In this regard, at present, only the XAIB provides a full set of extension directions. It provides the user with the ability to experiment with their own implementations of datasets, models, explainers, and metrics. As long as interface compatibility is ensured, users can combine existing entities with custom ones, which is supported not only with the code but also with the detailed documentation.
Usability is an important criterion in the open source community. The benchmark should be open, providing the documentation and the results table. It should not be difficult to setup and run experiments.
Almost every existing benchmark provides some documentation that highlights different aspects of its usage. Mostly, this covers the basics, for example, how to reproduce results. Usually, there is no information on how to submit the new method or contribute to the development in any other way. Providing up-to-date results is also an important and seemingly underrepresented feature. The ability to easily install and use the package is also excluded from most of the methods.
The proposed benchmark provides detailed documentation, not only on reproducing the results but also on usage with custom entities and separate sections on how to contribute. Users can also install the XAIB with a single command, making it much more accessible to users from different scientific backgrounds, who may find manual installation more difficult.
Table 5 shows which of the Co-12 properties are measured with the metrics of existing benchmarks. The benchmarks feature numerous metrics; however, in terms of completeness in property coverage, they lack variability. The coverage of the Co-12 properties was analyzed by attributing each of the metrics presented in the respective papers to some of the properties whose definition best fits the given metric. The existing benchmarks feature multitudes of metrics that should facilitate a comprehensive evaluation. However, the particular set of metrics that is chosen each time is rarely discussed or justified. The completeness of the chosen metrics is also a topic that is usually not brought up in papers. After analyzing the proposed metrics through the lens of one of the most complete systems of properties available, their completeness becomes clearer.
For example, the saliency eval benchmark provides the Human Agreement metric, which is considered to belong to the Context property; Faithfullness measures Correctness; Confidence Indication is conceptually very similar to Contrastivity; and the Rationale/Dataset Consistency metrics could belong to the Consistency property.
The XAI-Bench is very narrow in terms of properties; however, it goes deep into the measures of Completeness. Faithfullness, ROAR, Monotonicity, and Infidelity are all different measures of the same property and can reflect different aspects of it. GT-Shapley is likely to measure Coherence since this ground truth can reflect some form of “common sense”.
OpenXAI, although featuring 22 metrics, does not cover as many properties. They feature GT Faithfullness, similarl to the previous case, is a measure of coherence. Aside from that, metrics are divided into two groups as follows: Faithfullness as a measure of Correctness and Stability as a measure of Continuity.
Compare-xAI provides a multitude of tests which are divided into six categories as follows: Fidelity and Fragility as measures for Correctness; Stability, which is similar to OpenXAI and is also a metric for Continuity; Simplicity, which is likely to measure Coherence; and Stress with Other, which are directed to Context property.
features five metrics, even though they are all different measures of Faithfullness, which in our case is named Correctness.
Analyzing the table according to the classification made by Nauta et al., the properties categorized as Content (Correctness, Completeness, Consistency, Continuity, Contrastivity, Covariate Complexity) are the most covered. Properties in the categories Presentation (Compactness, Composition, Confidence) and User (Context, Coherence, Controllability) are the ones that are poorly covered at this moment.
Considering the comparison made, XAIB is more flexible, allowing for the deep customization of the evaluation process. It helps widen the scope of XAI evaluation by featuring more types of explanation and allowing the use of custom datasets and models in the evaluation. It opens new extensibility directions that were previously unaddressed, while making steps towards usability by introducing clean versioning, dependency management, and distribution.
In addition, it also covers most of the properties, making it the broadest existing XAI benchmark. Although the XAIB implements only six out of twelve properties across different explainer types at the initial stage, all properties can be covered in the future.
5.2. Limitations
Functional ways to assess quality seem to be a good solution. They are very cheap to compute compared to user studies and provide clear comparison criteria. As long as the metrics themselves are interpretable enough, it should be easy to make a decision by comparing numbers. Those may be the reasons why functional evaluation is gaining more popularity at the moment. There is a certain demand to bring XAI evaluation to the standard of AI evaluation. However, this does not seem possible. The main difference between AI and XAI, and the main reason for this impossibility, is the presence of a human. Since interpretability depends solely on human perception, it becomes difficult to formalize, in contrast to performance measures, which are formulated mathematically in the first place and, in essence, are a convenient way to aggregate lots of observations. In addition to that, accuracy (in a broad sense) and its properties are well defined, while interpretability is not, which is a major point of criticism for the field. Considering this, one should approach functional measures with care and perceive them only as guides and proxies of the real interpretability properties they are trying to represent.
All of the above means that when working in high-risk conditions and when building reliable machine learning systems that will have a great influence on human lives, one should not rely solely on the values of quantitatively and functionally obtained quality metrics. For a complete evaluation, human-grounded and functionally grounded experiments are required. Only by using insights from every method of quality measurement can one make an informed decision.
In the following paragraph, the existing implementation limitations will be considered. The XAIB was designed to be universal and easily extensible. However, those qualities come with their own set of trade-offs. Analyzing them, the following difficulties were identified: performance issues, combinatoric explosion, compatibility management issues, and the cost of implementation.
Performance issues arise when building systems that have many independent components. This property impedes advanced optimizations that could be made if the solution was one-piece and specific. The abstractness and independence of the components create the conditions for additional work that should be done to manage this, which adds to performance costs. In addition to that, machine learning, in general, tends to be more demanding in terms of computing. Having compute- and memory-demanding solutions creates challenges that should be addressed in the future. The limited set of initial entities that are currently features in the benchmark is an intentional choice that was made to make the development easier by not trying to create premature optimizations for demanding solutions and focusing solely on the main benchmark principles.
Considering the number of different entities that make up an XAI benchmark, the evaluation task is highly dimensional. Adding a single new entity means creating a number of combinations with entities of other types. This could potentially create an exponential increase in evaluation runs for each method. Although this issue is virtually unavoidable, its computation costs can still be mitigated, and performance in this case does not look like a major concern. The management of such a system seems to be the most difficult thing to overcome. Each entity, dataset, model, or explainer can be incompatible with the others. For example, a dataset without any class labels can be incompatible with the classification model, which, in turn, can be incompatible with some types of explainers that do not work with some metrics, etc. Not only can this quality of the evaluation task cause errors when working with incompatible entities, but one should also carefully approach the solutions to addressing this problem. Some solutions can introduce a lot of additional complexity, and in this case, a good trade-off is needed to ensure ease of use.
Aside from those difficulties, result aggregation and interpretation also become problems that should be addressed in future research. This brings us to the last limitation mentioned, the implementation cost. When the system is abstract and every entity has a lot of metadata that are required for the system to work, it becomes difficult to derive new solutions. Although this is not the case at the current stage, with the expansion of the benchmark, a metadata handling and validation system should be built to ensure the efficient compatibility management. Furthermore, it seems inevitable that more weight will be put on the shoulders of new contributors.
Each limitation can be regarded as a challenge to future research and the development of the XAIB.