State-of-the-Art Explainability Methods with Focus on Visual Analytics Showcased by Glioma Classiﬁcation

: This study aims to reﬂect on a list of libraries providing decision support to AI models. The goal is to assist in ﬁnding suitable libraries that support visual explainability and interpretability of the output of their AI model. Especially in sensitive application areas, such as medicine, this is crucial for understanding the decision-making process and for a safe application. Therefore, we use a glioma classiﬁcation model’s reasoning as an underlying case. We present a comparison of 11 identiﬁed Python libraries that provide an addition to the better known SHAP and LIME libraries for visualizing explainability. The libraries are selected based on certain attributes, such as being implemented in Python, supporting visual analysis, thorough documentation, and active maintenance. We showcase and compare four libraries for global interpretations (ELI5, Dalex, InterpretML, and SHAP) and three libraries for local interpretations (Lime, Dalex, and InterpretML). As use case, we process a combination of openly available data sets on glioma for the task of studying feature importance when classifying the grade II, III, and IV brain tumor subtypes glioblastoma multiforme (GBM), anaplastic astrocytoma (AASTR), and oligodendroglioma (ODG), out of 1276 samples and 252 attributes. The exempliﬁed model conﬁrms known variations and studying local explainability contributes to revealing less known variations as putative biomarkers. The full comparison spreadsheet and implementation examples can be found in the appendix.


Introduction
In recent years, extensive benefits to different application areas have been offered due to successfully applying machine learning (ML) algorithms. In particular, the success of deep learning (DL) approaches are transforming the way we approach real-world tasks performed by humans. ML and DL establish artificial intelligence (AI) models which can be applied in many different fields of research such as healthcare [1], cancer classification [2][3][4], autonomous robots and vehicles [5], image processing [6], manufacturing, and many more [7][8][9][10], thus enhancing and providing various benefits in the corresponding fields. Moreover, these models resulting from ML are suitable for performing different tasks, such as recommendation, ranking, forecasting, classification, or clustering. The variety and the nature of these approaches make them complex to understand and interpret. In the ity [35]. For example, grading of diffuse gliomas (DIFG) is still an ongoing discussion and momentarily defined by tumor nomenclature [36]. The process involves molecular and histological features in order to revise risk stratification. Common molecular biomarkers used for clinical classification of glioma include α-thalassemia/mental retardation syndrome X-linked (ATRX), isocitrate dehydrogenase 1 (IDH1), tumor protein p53 (TP53), telomerase reverse transcriptase (TERT), and phosphatase and tensin homolog (PTEN) or the epidermal growth factor receptor (EGFR) among others [34,37]. We have recently highlighted age-based differences in brain tumor diseases using an explainable classification approach [22]. We now extend our studies to include several xAI methods for classifying DIFGs.

Theoretical Background on xAI
xAI is defined for the first time in 2004 by Can Lent et al. [38] as a research field that explains the behavior AI models in a more understandable way. However, focus on the topic of xAI has been recently increasing [32] due to increased attention and improvements around the topic of AI/ML across different fields. However, along with the high accuracy results, a more human-centric explanation of the decision-making process of these models is required. This leads the focus toward xAI in the current age. Furthermore, the increase in complexity of ML models has lead to the requirement for developing algorithmic decisionmaking such as fairness, accountability, and transparency (FAT) principles [39] which are especially evident in highly regulated and mission-critical scenarios.
There are several perspectives on the explainability of an AI model (e.g., scope, stage, problem type, etc.). The scope perspective regards the global and local view on model explanations. AI models can be explained either at the global level or local level. Global level interpretation is known as global interpretability in the literature [32], where the entire model behavior is analyzed e.g., feature importance. Global level interpretability summarizes the impact of input features on the model, as well as the model as a whole, while the local interpretation is defined as local interpretability, and it aims to understand the behavior of single predictions and decisions made by the model.
Another perspective on the explainability of an AI model is associated with the type of AI model itself. Overall, two types of models exist, white-box and black-box models. White-box models are made to be explainable by design, resulting in no requirement of additional xAI methods for the model to be explainable. Contrarily, black-box models are not explainable by design, so other techniques have to be applied to extract reasoning for certain decisions and predictions.
In regard to xAI methods, a recent study [32] reviewed more than 200 scientific articles that aimed to develop new methods for explainability. However, discussing these methods and other xAI concepts falls outside of the scope of this paper. We encourage the reader to consult the work discussed in [30][31][32] for more details about these concepts.

Dataset
Data on glioma samples were downloaded from cbioportal [40,41] with filtering the 6 studies gbm_mayo_pdx_sarkaria_2019, gbm_tcga_pub2013, glioma_mskcc_2019, lgg_tcga, lgg_ucsf_2014, and odg_msk_2017. Only data with the 7 attributes "Oncotree Code", "Mutation Count", "Overall Survival (Months)", "Overall Survival Status", "Sex", "Somatic Status", and "Diagnosis Age" were used. Sample rows without complete data have been removed. Data were extended with gene mutation data of the top 246 mutated genes within selected studies.
The top three diffuse glioma (DIFG) subtypes (Glioblastoma multiforme (GBM), Anaplastic Astrocytoma (AASTR), and Oligodendroglioma (ODG)) were further selected and analyzed within this work. We filtered and further processed data for model building comprising of 1276 sample rows with 253 columns out of the 5 studies gbm_mayo_pdx_sarkaria_2019, gbm_tcga_pub2013, glioma_mskcc_2019, lgg_tcga, and lgg_ucsf_2014. The Oncotree Code was selected as the target and the other 252 data columns were selected as features, with 872 GBM sample rows, 234 AASTR sample rows, and 170 ODG sample rows. The data preprocessing and model building can be found on https://github.com/mathabaws/SOTA_ xAI_Visual_analytics/blob/main/notebooks/diffuseglioma-dataset-processing.ipynb (accessed on 12 January 2022).

Implementation
We conducted a structured review with the goal of investigating current developments and the state of the art xAI libraries focusing on model interpretation and visualization techniques. State of the art means most up to date, publicly available, implemented consistently with the requirement of current software technology, and following common Python patterns. Moreover, this review aims to investigate various relevant aspects of xAI libraries such as maturity level, documentation, supported programming languages, models and different machine learning tasks, support for data types, etc. The structured review closely follows the methodology for Structured Literature Review (SLR) from Webster and Watson [42]. Additionally, we take necessary attributes for a software selection process into account.
The initial set of available libraries was acquired through a search in GitHub. Keywords and the type of the results are the two key limiting factors to guide the initial set of results. For the first limiting factor, the keywords "explainable AI" and "interpretability" were used. The second limiting factor was the type of results and this was set to "repository" which excluded all the results with these keywords in, e.g., the code itself or discussions, issues, commits, etc. Applying these limiting factors resulted in 57 results. To further narrow down the results, three rules were developed for the initial scan of the libraries as shown below: 1.
Result has to be a repository of a Python library or a software package; 2.
Result has to implement at least one xAI method; 3.
Result has to be an overview repository (repository that provides an overview of xAI libaries).
Supplementary source code together with the overview of library versions and descriptions to recreate an exact development environment used for these experiments can be found on GitHub at the following URL: https://github.com/mathabaws/SOTA_xAI_ Visual_analytics (accessed on 12 January 2022).

Library Comparison on Glioma Subtype Classification
By using the processed data from the combined studies described in the materials section, we trained a model to classify cancer subtypes by distinguishing between the Oncotree codes GBM, AASTR, and ODG. These are the top three most frequent diffuse glioma subtypes samples.
In general, 1020 training instances were used for training, and 256 for testing. Testing data remained unbalanced representing a realistic scenario. Ten-fold cross-validation scored a mean accuracy of 0.87 with a standard deviation of 0.02. The results of the trained model are shown in Table 1. In the next subsections, the Python libraries suitable for xAI and VA selected for indepth analysis are presented, including results from tests with the above described model.

Python Libraries for Explainability
Applying the method described in the previous section, 52 relevant repositories were identified. Moreover, several overview repositories in the topic of xAI have been identified. These overview repositories provided information on the libraries other than ones identified through initial scan and were further used for backward and forward search. Next, a process resembling abstract and conclusion scan was conducted to filter out the libraries not focused on xAI and/or VA. In other words, documentation from repositories and implementation of the libraries were scrutinized to identify their focus and scope. As a result, 48 libraries were selected as relevant. These libraries were analyzed, interpreted, and summarized in a concept-centric way [42]. Through an in-depth analysis, metadata was collected, and core libraries and frameworks were identified for further exploration. Figure 1 provides an overview of the process. As a first step, we drill down initial results described in the previous section to the most important libraries aiming for xAI using visualization tools. The complete comparison table can be found in Appendix A.1. We then defined structured rules that help us to identify relevant libraries, which will be further analyzed and experimented. Firstly, we select only those libraries that are implemented in Python and integrate visualization features to communicate xAI results. Furthermore, chosen libraries are able to explain classification models. Last but not least, these libraries are open source, provide good documentation, and support tabular data.
After filtering, we identified 11 relevant libraries. Selected libraries based on the aforementioned rules are listed in Table 2. We excluded 6 of the 11 identified libraries as missing criteria were revealed during the in-depth inspection. The remaining relevant libraries were grouped into three different groups: libraries aiming for global explainability in general, libraries focusing on local explanation, and, in particular, libraries which support Lime and SHAP approaches. In the first group, the following libraries are selected: ELI5 [43], Dalex [29], InterpretML [28], and SHAP [17]. In the second group, i.e., local explainability, Lime and SHAP approaches are explored in more detail. Three different libraries focusing on Lime are analyzed: Lime [16], Dalex [29], and InterpretML. Finally, three different libraries focusing on SHAP approaches are analyzed in detail: InterpretML, Dalex, and SHAP. The selected libraries are analyzed and compared within the groups and the results are shown in the sections below. The complete overview table can be found on the GitHub repository (Appendix A.2). All experiments concerning the analyzed libraries in depth are conducted using a notebook with the following characteristics: Lenovo ThinkPad L470, Intel(R) Core(TM) 2.70GHz -2.90GHz, 16 GB RAM, Windows 10.

Global Explainability
Several libraries were identified with implementation of different feature importance methods. These are methods that rely on assigning a score to input features based on the predictive performance they add to the model. We are starting this overview with the focus on (1) methods for global explainability of the model and (2) methods that use visualization to communicate the explainability results. During the in-depth analysis, four libraries were identified to contain feature importance visualizations, namely ELI5, Dalex, InterpretML, and SHAP.
ELI5 focuses on feature selection with the implementation of permutation importance. It enables extraction and visualization of feature weights and their contribution from the model as a form of global explanations. Visualizations are based on the list view of the features and their weights in a tabular form. The gradient of green and red color indicates the positive or negative impact on the model decisions, and there are no interactive options. Figure 2 depicts feature importance visualization implemented in the ELI5 library. Furthermore, model inspection on the prediction level is supported, which uses similar visualization with weights adding up to either probability of a class in classification models or predicted value in case of regression models.
Dalex implements a method called variable importance which provides global explanations of a model based on Permutational Variable Importance [44]. Each variable is randomly shuffled in this method, and the model is inspected for its predictive performance. Intuitively, more important features impact the model performance more than the less important features. Finally, after 10 permutation rounds for each feature, visualization is created, showing the impact of each feature on the model. Such visualization provided by the Dalex library is depicted in Figure 3. Furthermore, the Dalex library provides a simple interactive overview during the mouse hovering over the visualization. This interactive window quantifies their influence on the model and provides additional information. The Dalex library also provides the option to tune the hyperparameters, such as a number of permutation rounds and various thresholds, and enables grouping of the features.  The SHAP library provides the opportunity to analyze the model at the global level. This method helps to interpret the model by estimating feature importance altogether with feature effects on prediction with respect to raw data (as shown in Figure 4). The importance of features is shown along the x-axis, with important features listed at the top. For each feature, the contribution towards the specific classes is shown using the corresponding color, as shown in Figure 4a. Furthermore, SHAP provides the opportunity to conduct global interpretation for specific classes as shown in Figure 4b. In this case, the contribution of specific features is shown along x-Axis, where the contribution can be either positive (contributed toward prediction of this class) or negative. Each data point stacked vertically within this visualization represents the contribution for a specific instance. The color gradient encodes the raw values, blue representing the lowest and red the highest value. As mentioned in Section 3.1, InterpretML is focused on navigation through different views and interactive application of different methods. One of the methods that is provided by the library is the overall importance. Overall importance presents the global feature importance of the model. InterpretML makes the distinction of algorithms that are applied in two different model types. These are glassbox models and black-box explainers. To be able to apply and extract global feature importance, a glassbox model needs to be trained. These models are structured for direct interpretability, contrary to the black-box models that provide approximations of explanations. This introduces additional overhead in utilizing InterpretML for model explainability, as an additional model had to be trained to extract important features of the model. An example of such feature importance visualization provided by InterpretML is depicted in Figure 5. Based on the popular visualization library Plotly [45], InterpretML allows simple interaction with the visualization (e.g., zoom-in, selection, export to image format, etc.).
Summarizing libraries for global explanation analysis, in terms of computational load, ELI5 provides the most lightweight solution for feature inspection. A simple and unified application programming interface enables a virtually instant overview of the features. On the contrary, all other remaining libraries require some degree of further processing to provide global explainability information. In the context of tabular data, the only supported visualization in ELI5 is a table overview with a gradient of green and red color encoding to indicate the importance of a feature in model predictions. The SHAP library provides more variety in terms of visualization with the implementation of bar chart and summary plot, which combines feature importance with feature effects. In regard to interactivity, visualizations provided by SHAP in the context of global importance are static and do not provide any further interactive features. Furthermore, in comparison to ELI5, SHAP requires an additional computational load that comes with the calculation of shap values. The Dalex library implements additional interactivity features in the model-level variable importance calculation. Visualization implemented in Dalex contains a list of features and their impact on predictions, with additional information provided upon the selection of a feature, which proved particularly useful when inspecting models with large numbers of features. However, this interactivity comes with additional computational load, which was significant in comparison with other libraries. Calculation of the feature importance for the previously developed model took from 1.5 to 5 min, depending on the number of permutation rounds for each feature. Finally, InterpretML provided the most interactivity out of all previously described libraries. Invoking global explanation functions provided a menu system alongside visualizations to investigate feature importance and their interaction. Each visualization enabled extensive inspection through zoom, select, lasso, and export functionality. Despite this interactivity, limitations of InterpretML library are due to the requirement of using built-in GlassBox models such as ExplainableBoostingClassifier.
Although showing comparable performance, this restriction to built-in models is quite significant. Furthermore, the additional computation overhead of training an additional model should not be overlooked. Overall, from the perspective of global explainability, all identified libraries provide useful insight into the model behavior, and each comes with its merits and limits from the perspective of visualization options, interactivity, and computational overhead.

Local Explainability
Models that produce accurate predictions and, at the same time, can explain such predictions are crucial. Researchers often generate global explanations, which try to explain predictions of black-box learning algorithms. However, such a global explanation cannot clarify the prediction of every single instance in the model. Local explainability focuses on gaining the user's trust for individual predictions and then trusting the model as a whole. Interpretation should make sense from the point of view of individual prediction. Globally important features may not be important locally and vice versa. In this case, the aim is to understand model decisions with respect to local context rather than the global behavior of the model.
There are several solutions mentioned in this paper and in this section; we will focus on the local explanations and two most relevant Python libraries, SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations) [16], identified by the selection rule mentioned in the previous section.

Local Explainability with SHAP
We identified three different libraries that fit to the selection rule of the most relevant libraries which are implementations of the SHAP approach: InterpretML [28], Dalex [29], and SHAP [17]. Consequently, we compared and analyzed these libraries showing the stateof-the-art in the topic of SHAP values aiming for the interpretation of black-box models.
Dalex (shown in Figure 6) offers basic interaction such as hovering over the visualization. This provides an opportunity to navigate through the results easily. Moreover, it provides the option to download the chart directly from generated visualization.
SHAP offers various visualization such as waterfall graphs for global analysis and force plots for local analysis. We specifically compared local interpretation based on the force plot shown in Figures 7 and 8. SHAP provides many alternatives to interpret blackbox behaviors, such as the force plot of a single prediction shown in Figure 7, which is a static visualization. Additionally, in Figure 8 a grouped analysis of all predicted instances is shown, where the single instances are stacked over the x-axis. This interactive visualization provides the opportunity to select a method (e.g., ordered by similarity) to order the instances over the x-axis group the results using the drop-down menu on the top of the chart over the x-axis. Moreover, on the y-axis, the drop-down menu offers the option to select the feature which the user wants to analyze. Moreover, hovering over the chart highlights different details, thus increasing the level of information provided from this approach.  In contrast, InterpretML provides an opportunity to navigate through different instances using a drop-down menu, presented in Figure 9. The estimated SHAP results for the specific instance are shown automatically by selecting a particular instance. This provides an opportunity to navigate through different instances, having a better overview of the results and the possibility to compare the output of different instances faster. In particular, information such as the predicted class, actual class, and residual error for each instance is shown in the drop-down menu, as well as in the main window. This provides an opportunity to compare similar instances based on predicted class, actual class, or the residual error, thus showing an opportunity to understand a model's class prediction more comprehensively. Moreover, interactions such as zoom in, zoom out, pan, select, and download are supported. However, InterpretML supports only KernelSHAP methods. In the x-axis, the local explanation results of every instance are stacked. The y-axis shows the contribution to prediction and the option to select the feature that will be explored for every instance in terms of SHAP contribution. 3.4.2. Local Explainability with LIME LIME (Local Interpretable Model-Agnostic Explanations) is a popular technique that tries to explain the predictions of any classifier by learning an interpretable model locally around the prediction. The key idea behind LIME is that it is easier to approximate a black-box model by a simple model locally. The Lime library can explain any black-box classifier with two or more classes. The visualization output of the LIME library is a list of explanations, reflecting the contribution of each feature to the instance prediction ( Figure 10a). Visualization provides local explainability and helps to investigate which feature changes will have the most impact on the instance prediction.   Figure 10a as an example, features are represented in two colors: blue and light sea-green. The blue bars indicate supporting (positive) scores towards an instance being predicted as GBM, while the light sea-green bar indicates contradicting (negative) scores towards its prediction. Float point numbers on the horizontal bars represent the relative importance of these features. We can see in Figure 10a that the highest positive influence have genes CIC, BCL6, PKD1L1, and ATRX.
Similar to the SHAP approach, besides LIME, InterpretML and Dalex are the most relevant libraries that implement the LIME approach, based on our selection rule. The libraries Dalex and InterpretML were already mentioned and explained in previous sections. The resulting plot for Dalex is shown in Figure 11. The Figure shows an explanation for instance predicted as class GBM. The length of the bar indicates the magnitude, while the color indicates the sign (red for negative, green for positive) of the estimated coefficient. In the previous examples, Dalex offered basic interaction such as hovering over the visualization, as well as the ability to navigate through the results easily. Unfortunately, the resulting plots for the LIME method do not provide any of these features. InterpretML using the LIME approach is shown in Figure 12. As in previous examples (see Figure 9), InterpretML provides an opportunity to navigate through different instances using a drop-down menu. By selecting a specific instance, we can navigate through different instances having a better overview of the results. Regarding computation time, as can be seen in Table 3

Biomedical Implication of Features
The evaluation of features affecting the classification between the diffuse glioma (DIFG) of Glioblastoma multiforme (GBM), Anaplastic Astrocytoma (AASTR), and Oligodendroglioma (ODG) highlights various mutated genes and clinical variables depending on the underlying xAI method. Diagnosis age and survival are among the most important predictors all of the methods, followed by varying gene mutations. Capicua (CIC) depicts an important feature in all approaches and is the most valuable gene feature in Dalex, second in SHAP and InterpretML, and fourth in ELI5. Mutated IDH1 is among the top features and, from a clinical point of view, commonly used for survival prognosis in patients suffering from glioma [34].  Figure 10). Variables changed place in the hierarchy of importance, while there is additional information on a particular variable's prediction impact shown as negative or positive factor towards the particular class of the local view.

Overview of xAI Approaches
The comparison overview and ranking is shown in Table 3. As a result, the table shows the overview concerning the global and local explainability comparison results of SHAP and LIME.
In the context of global explainability, similar criteria can be used for the selection of libraries, i.e., computational overhead, implemented visualizations, and interactivity. From the perspective of the computational overhead, ELI5 provides the most lightweight solution both in terms of computational overhead and implemented visualizations and interactivity. The simple interface provides a good basis for a quick inspection of the existing model and overall model debugging. Feature importance alongside other implemented functionality (e.g., feature selection) of ELI5 can be convenient during the model development process. Increased interactivity and visualization options come with the additional computational overhead in SHAP, Dalex, and InterpretML libraries. From the perspective of interactivity in global explainability, InterpretML provides the most interactive solution. The addition of menu components to select different model components makes it easy to switch between analysis perspectives and extensive visualization features (zoom, lasso, select, and others). This provides excellent analytical insights. However, these functionalities come with limitations in terms of the limited scope of built-in Glassbox models that can be used and additional computation overhead caused by model retraining. In terms of visualization, SHAP and Dalex are in between ELI5 and InterpretML. Compared to ELI5, Dalex requires more computational overhead but provides additional interactivity and visualizations. On the other hand, SHAP requires even more computational overhead but provides excellent visualization options that enable a complex analysis of the interplay between feature importance and feature effect. From the perspective of the stage of the development of the predictive model, ELI5 and Dalex seem to be focused on the model analysis, while SHAP and InterpretML put focus on the underlying data and how this data impacts the model decisions.
Regarding local explainability using the SHAP approach, we identified different outcomes. In general, to explain a black-box in the big data context, it is important to find the trade-off between computation resources and explainable results. In the context of local explainability, SHAP outperformed other libraries in terms of computational resources and providing an interactive way to explore the different model predictions. In terms of interactivity, both SHAP and InterpretML outperform Dalex and provide many options to analyze explainable results of multiple instances interactively. However, if the goal is to find a trade-off between computational overhead and interactivity, then Dalex seems as the optimal solution in this context. Finally, if the focus is on exploring the features, the SHAP force plot grouping methods provide many advantages. However, InterpretML offers the option to compare different instances in terms of feature contribution, predicted class, actual class, and residual error. This provides a huge advantage over other methods for analyzing the behavior of block box models in terms of predicted/actual class. Compared to SHAP, LIME has advantages in terms of speed as it builds the model around individual predictions. In the case of large datasets, using SHAP might not be feasible due to the large computational overhead caused by the calculation of all global permutations. Despite the performance overhead, SHAP provides a unified solution, which, once computed, offers more refined explainability and analytical experience. LIME provides an intuitive instance explanation. The LIME library builds the model around individual predictions (neighborhood), thus it does not take additional time to compute the model for all instances. On the other hand, the resulting plots do not provide any interactivity. Using Dalex for the LIME approach does not offer any interaction as for the other libraries. InterpretML is the only library providing interactivity while using the LIME approach. In comparison with the LIME plot, InterpretML's resulting plot does not offer an extensive summary of features.
The main advantage of SHAP for local explanation is that it is the only xAI method based on solid theory (Shapely value) [46]. Moreover, SHAP guarantees that the prediction is fairly distributed among all feature values. On the other hand, LIME for local explanation is faster than SHAP concerning computation time. In particular, if the aim is to analyze huge data sets, then LIME will provide a suitable alternative to the time-consuming computation of Shapely values. The SHAP approach considers this challenge by using approximation and optimization; however, not all model types are supported yet. In particular, LIME supports tabular data, text, and images. In other xAI methods, it is rare that all these types of data are supported.

Discussion
The output of any ML model should be comparable and interpretable. This is of particular interest to researchers in the medical domain as for cancer, where model performance may be compared with the one of clinicians [47]. Some experts from the medical domain argue that transparency for black boxes is not of primary interest to AI applications in their domain, as doctors make diagnoses based on their experience, and complete information on the causality of medical issues are rare [48,49]. However, xAI methods can help to gain new insights and forward biomedical knowledge to better understand interrelated characteristics and signaling components in pathologies.
As a modeling approach, classifying glioma sub-types is exemplified: As the chosen dataset combining data from different brain tumor studies comprises sample data primarily from the glioma subtypes GBM, AASTR, and ODG, these three disease types were chosen to be classified to apply VA methods for interpreting global as well as local feature importance. The dataset provides Oncotreecode as identifier. GBM, AASTR, and ODG are all DIFG subtypes. Even combining data from six different studies resulted in a lack of samples for specific subtypes, therefore only the top three were chosen. Open data resources are still set to develop further and to be extended [50]. The chosen dataset is unbalanced and fits this use case insofar as it represents an often-found challenge in molecular sciences. This study aims to describe xAI tools rather than to provide a highly performing classifier solution; still, classifying glioma subtypes is a challenging task, which makes it an ideal example for comparing VA features in xAI. Cross-validation of xAI is not applicable to date, as a matter of ongoing research.
From a biomedical point of view, many of the important variables highlighted by the various xAI methods are already known to be involved in cancer signaling and represent common biomarkers in glioma. Generally, such insights into the model can be used for validation. The transcriptional repressor CIC is part of the tyrosine kinase signaling pathway which is known to be involved in tumorigenesis, especially in GBM [51]. Other gene features impacting the classification include mutated IDH1, ATRX, TP53, PTEN, TERT, NF1, and EGFR, all of which are known to be involved in DIFG [22,52]. Among important variables are also the mucin protein family (MUC16 and MUC17) which are involved in epithelial barrier formation and potential biomarkers for favorable prognosis in DIFG, or lysine methyl transferase (KMT2B) also shown to be a player in gliomagenesis [53,54]. One example given, the type I transmembrane protein Notch 1 receptor (Notch1), is involved in the NF-κB signaling pathway effecting cancer development and progression, especially in GBM [55]. Notch 1 is listed in the global top 20 variables listed by SHAP, but not by Dalex. Still, in SHAP it distinguishes primarily between ODG and GBM. Some gene mutations are not primarily common for one class of sub disease, but can increase or mitigate cancer malignancy as given by the example of IDH1. Mutated IDH1 will lead to a favorable outcome, but a complete genetic profile could tell more of cases not concordant with standard prognoses [56]. In the case of local explanations as given in Figure 10b, IDH1 is selected in favor of the ODG class. Local explanations can thereby support further insight on individual cases instead of presenting the big picture of global classes.
The local explanation in Figure 12 shows that the low mutation count has been used to select for the class of GBM for this instance. A high mutational burden is indicative for an unfavorable prognosis as given by GBM, which would contradict the observation in this local view. This could be seen as a limitation of model accuracy or be used for future investigations on individual cases and underlying experimental constraints. In Figure 10, we can see another local explanation for GBM classification which is supported by low numbers of mutation count. This could be due to the fact that a high number of samples originate from GBM biopsies, so that samples with low mutation count can also be frequently found. This unbalanced data source can be seen as a certain limitation to the represented model; however, combining local explanations in Figure 10 with global explanations in Figure 4, we can see that even if the mutation count is among the top rated features, there are also other important features that should be taken into account for further analysis. Diagnosis age and overall survival are preferably incorporated by the different algorithms on a global basis. Further local instances by InterpretML and Dalex are presented in Figures 9 and 11. For example, gene mutations With-No-Lysine Kinase 1 (WNK1) are ranked among the top important features, highlighting a possible role of WNK1 in glioma, which has yet to be shown for WNK3 [57]. One local instance presented by Lime in Figure 10a ranks AT-Rich Interaction Domain 1B (ARID1B), shown as putative driver gene in glioma [58], among the most important variables for classifying GBM. The feature is followed by others such as Protein Kinase DNA-Activated Catalytic Subunit (PRKDC), a component of the autophagy-regulating signaling cascades to be alterated also in glioma [59], and the Anaplastic Lymphoma Receptor Tyrosine Kinase (ALK), whose variation has been implicated with pediatric glioma [60]. Another local instance by InterpretML, shown in Figure 9 includes Polycystic Kidney And Hepatic Disease 1 Protein (PKHD1), shown as variant in GBM [61], in the top feature list, followed by Insulin Receptor Substrate 2 (IRS2) [62] and Dynein Axonemal Heavy Chain 11 (DNAH11), which has been recently linked to immune cell infiltration in glioma [63].
Applying xAI methods further facilitates the refinement process of the model's underlying data and thereby helps to understand and enhance a model. By studying the results of local explainability methods, we found an error in the algorithm for computing the different gene's mutations. The value "NA" had been counted as 1 rather than 0, due to the fact that different gene mutations from the processed data are handled as strings, separated by empty spaces. After evaluating and comparing the results, we corrected the model and revisited the comparison, leading to better results, both in reproducibility of already known markers and better quality, as well as model performance.
The comparison of xAI libraries can be used for gaining biomedical insights, but also to detail advantages and challenges using these tools appropriate for certain application scenarios. Figures 8 and 11 show two diverging examples in VA feature range such as interactivity or details on demand regarding xAI quality and quantity. After all, which library and approach to choose depends on the use case, such as finding novel biomarkers in analyzing classification feature importance or investigating survival prediction. Therefore, we compared libraries regarding their global xAI features separately from those with local ones. By making use of the detailed descriptions above, we try to support the decisionmaking process of choosing a suitable library. F.i. ELI5 is optimal regarding computational load, while InterpretML offers most interactivity at the expense of computation time.

Conclusions
We present a comparison of the ease of use of current xAI libraries and exemplify how to support understanding of a black-box model's results in glioma classification to find novel biomarkers. Thereby, we describe possibilities how to integrate VA features for xAI. We only scratch the surface when it comes to going beyond xAI. The process of understanding can be supported by interactivity and other features to assess the quality of explanations [64]. Future work may also include taking the type of mutation into account by incorporating various types of mutations as different features-for now, the model differentiates between wild-type/mutated and number of mutation if there is more than one mutation for the same gene. Additionally, data could be integrated from miscellaneous sources and cover further subclasses or clinical features, while adding use cases of survival prediction or clustering approaches for signaling insights. Performance experiments for further information on requirements and recommendations could be also part of future work. Finally, we believe that the presented approach, using open data, providing open source implementation, and focusing on ease of use, as well as showcasing the application of xAI to real scientific problems, can contribute to the research fields of cancer science and beyond. Data Availability Statement: Preprocessed data and implementations such as notebooks can be found on https://github.com/mathabaws/SOTA_xAI_Visual_analytics/ (accessed on 12 January 2022).

Acknowledgments:
We thank the cBioPortal maintainers and collaborators for providing data on cancer and all the other data providers to make open science possible. We dedicate our work in memoriam to our family members and friends we have lost.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: