Hereafter, for each class, we include the original formulation of the technique, though it was not expressly devised for transformers but later applied to them with some modifications, and describe the main papers that have adopted it.
5.1. Activation-Based Methods
One of the main ideas to identify the contribution of each input feature to the output is to use the activation of the neurons, going through the network back to the input. Developing this idea, Bach et al. presented in [
38] the Layer-wise Relevance Propagation (LRP). This technique is based on the assumption that the activation of a neuron signals its relevance for the output. Also, the relationship between different layers is represented by the relevance of a neuron in a layer being a linear combination of the relevances of the neurons in the previous layer. This relationship allows us to track neuron relevances from the output back to the input. Furthermore, the basic method assumes that the network uses ReLU activations in all layers.
Using
as the dummy variable describing the activation of neuron
i,
as the weight in the layer from the input
i to the output
j, and
as the relevance of the neuron
i in the layer
l, we consider the following relation to hold, which allows us to compute the relevance of any neuron in a layer based on the relevances of all neurons in the subsequent layer:
We can apply this backtracking relationship starting from the output layer and going through the network back to the input layer, hence identifying the relevance of each input feature.
Even though the LRP method was developed to be applied to CNNs for images (which explains the assumption of ReLU activation), it has been explored and modified to be applied to other types of NNs. Ding et al. applied LRP for machine translation tasks in attention-based encoder–decoder RNNs [
39]. Unlike the original method, they computed the relevance score for input vectors instead of single features. So, the relevance score for an embedding vector corresponds to the sum of the relevance scores of its features. Furthermore, they proposed to ignore the non-linear activation functions during LRP computations, based on the assumption that the choice of non-linear functions is irrelevant for LRP [
38].
The application of LRP to transformers requires three more issues to cope with: different activation functions, skip connections, and matrix multiplications. Indeed, comparing a CNN architecture with that defined in Equations (
1)–(
3) allowed us to spot many differences; first of all, the multiplication
and the term
used to implement the skip connections. Following the implementation of [
39], Voita et al. presented Partial-LRP to identify the most relevant heads in a transformer-based model and prune the least relevant ones [
40]. In applying LRP to their models, they considered a value
as the relevance of neuron
u for neuron
v, which can be defined as
where
is the set of nodes directly connected to
u in the next layer, and
is the weight ratio that measures the contribution of
u to
v, defined as
where
is the weight learned between neurons
u and
v, and
is the set of nodes directly connected to
u in the previous layer.
Chefer et al. in [
41] proposed a variation of LRP where the issue of different activation functions is settled just considering the
elements that turn out to be non-negative. Their analysis of the two other issues shows that both operations could be seen as Relevance Propagation through two different tensors instead of a single one. So, they expanded the definition of relevance for both tensors
u and
v as
Finally, they introduced normalization:
Even though this relevance score could be used to provide explanations for each attention layer (as in standard LRP), they used it as a construction block for a different score that computes both LRP and Attention Rollout, as described in more detail in
Section 5.5.1.
In order to improve the distinction between the positive and negative relevance of neurons, Nam et al. in [
42] proposed a variation of LRP called Relative Attributing Propagation (RAP), which uses normalization before the propagation of positive and negative relevances in each layer separately.
Since the results of LRP are class-independent, many methods have tried to differentiate the results of LRP to represent the classes. Gu et al. in [
43] presented Contrastive Layer-wise Relevance Propagation (CLRP), which is based on the idea of computing LRP both for the class of interest and for the aggregation of the other classes, keeping only the positive differences between the two classes of relevances obtained.
In image classification, one of the most interesting aspects of prediction is understanding which image features were most influential in predicting the class. To estimate this type of influence, Zhou et al. introduced the Class Activation Mapping (CAM) technique in [
44]. CAM considers a CNN with a global average pool just before the last fully connected layer (which could employ, e.g., a softmax function for classification and a ReLU activation function for regression). Its application is quite limited, given the large number of architectures other than CNN that have been proposed in the literature. The method consists of estimating the score for each area of the input image by multiplying the activation of each filter of the last convolutional layer (before the global average pool) and the weight learned for the average of the filter with respect to the neuron representing the class in the output layer. That multiplication outputs a matrix with the same size as the last convolutional layer for each filter. Summing them provides us with a relevance matrix for a single output class. The last step consists of visualizing this relevance matrix over the original input, resizing (usually with some form of interpolation) the relevance matrix to the same size as the original input image. Formally, given the last convolutional layer of a CNN architecture with the activation matrix
, composed of
F filters of size
, followed by a global average pool and a vector representing the output classes connected by a weight matrix
W, we compute the relevance score
for a super pixel at position
for the class
c as
A super pixel is a segment of the matrix after reshaping by applying the filters of the architecture, basically corresponding to a patch of the input image.
A different idea was proposed by Ferrando et al. in [
45], with a method called ALTI (Aggregation of Layer-wise Token-to-Token Interactions). The method computes the contribution of each component of a transformer block to the output of the block. Roughly speaking, the idea is to compute the difference between the component
from Equation (2) and
as the contribution of the
j-th component to the
i-th output. All the matrices composed using these differences are then combined using the same rules of Attention Rollout (explained in
Section 5.2).
Focusing on the differences appearing in the output due to different inputs, Li et al. in [
46] proposed a method (subsequently) called Input Erasure that masks part of the input with 0s to measure the contribution (the relevance) of this part to the same output (it could be even just a single embedding dimension).
A simple approach was proposed by Kim et al. in [
47] with a method called Concept Activation Vectors (CAVs). They considered the activations of a layer for many input samples, both for the target class and the non-target class (in a binary problem). Afterward, they trained a linear model on the activations to distinguish between the target class and the non-target class. The linear model weights were the relevance scores for the class with respect to the features.
Muhammad and Yeasin in [
48] proposed a method called Eigen-CAM, inspired by CAM, that is based on Single Value Decomposition. They combined the weight matrices from the first k convolutional layers, multiplying them by the input image matrix. The saliency map consisted of the projection of the matrix just computed on the first eigenvector.
Following the techniques described so far, many papers applied them in different contexts, often employing several techniques at the same time to look for the best one.
For example, Mishra et al. in [
49] compared different methods (LRP and perturbation methods like LIME and SHAP, described in
Section 5.4) to explain models for hate speech detection. Instead of straightforwardly evaluating models, Thorn Jakobsen et al. in [
50] turned their attention to the datasets employed to evaluate explanability methods, proposing new datasets and using LRP and Attention Rollout (see
Section 5.2). Other authors focused on LRP for several purposes. Yu and Xiang in [
51] proposed a model for neural network pruning, visualising relevances using LRP. Chan et al. proposed a new method to perform early crop classification by exploiting LRP in [
52].
CAM was instead the method chosen in [
53] to explain remote sensing classification performed through a transformer-based architecture.
The ALTI method was instead chosen and slightly modified in [
54] by Fernando et al., who proposed a variant called ALTI-logit, where each component
is multiplied by the matrix just before the final softmax function in the network. They further proposed Contrastive ALTI-logit, where the difference between two different tokens is measured at the output, subtracting the results of ALTI-logit for the two tokens. They also compared their methods with Contrastive Gradient Norm and Contrastive InputXGradient.
CAVs were instead applied by Madsen et al. in [
55] on an EEG classification model.
A comparison for (multi-modal) transformers was carried out in [
56], namely between Optimal Transport, which considers activations of different input types, and Label Attribution, which is a variation of TMME (see
Section 5.5.4). An even wider comparison was carried out by Hroub et al. in [
57], where different models are employed for pneumonia and COVID-19 existence prediction from X-rays. The set of methods included Grad-CAM, Grad-CAM++, Eigen-Grad-CAM, and AblationCAM to produce saliency maps.
Another group of papers focused on visualization techniques. Alammar in [
58] presented a tool (Ecco) to provide different visualization techniques for transformers. Each of them could be classified into one of the two classes: Gradient × Input or a function of activation (in this case, dimensionality reduction over responses is also performed). Van Aken et al. in [
59] presented VisBERT, a visualization tool for BERT-like architectures, which is an activation-based method (they used the responses of each layer) followed by dimensionality reduction techniques (t-SNE, PCA, and ICA) to project input tokens on a 2D plane.
Gao et al. in [
60] proposed a new architecture for table explanation, supplying three different methods of explanation: local explanation, global explanation, and structural explanation. The local explanation consists of computing output embeddings of text and extracting many subsequences, each of them being represented as the mean of the embeddings that compose it minus the embeddings of the CLS token. Afterward, they multiplied the matrix obtained by the weight matrix of the last layer just before applying a sigmoid activation. The results were the relevance scores of the subsequences. The global explanation consisted of computing the cosine similarity between the embeddings of the CLS (cross-lingual summarization) token of each sample in the dataset and assigning a relevance score to each of them accordingly. The structural explanation consisted of building a graph from the dataset, computing the embeddings of nodes and multiplying the embeddings by each other (CLS embeddings multiplied by its neighbors), and then (after multiplying the normalized version again for the neighbors’ embeddings) aggregating (by summing them) the scores of each neighbor and using the results as relevance scores.
5.2. Attention-Based Methods
Since the introduction of the attention mechanism in [
16], attention weights have been one of the go-to indicators to estimate explanations. In [
61], Abnar et al. introduced
Attention Rollout and
Attention Flow techniques. They share the same ground assumptions but differ in the information flow mechanism through the neural network. The assumption they share is that the attention in the last layer cannot be considered a proxy for explanation. Considering instead the attention in the first layers, we can use it to measure the contribution of each token to the result. They also underlined the importance of the residual connection with respect to the information flow, so instead of using just attention weights, they augmented the attention weights matrix with a layer
l as
where
is the average of all the attention weight matrices in all the heads of the layer
l of the transformer.
In the
Attention Rollout technique, a chain of cumulative attention matrices is formed by multiplying the attention matrix
of the
l-th layer by the attention matrices of the subsequent layers:
The result of the multiplication can be easily shown as a heatmap of the relevance of each token in the input with respect to each token in the output.
In the Attention Flow technique, the neural network is seen as a graph whose edges are weighed by the attention weights of the pertaining layer. Considering the weights as capacities, the input tokens as source nodes (one at a time), and the output tokens as sink nodes (again one at a time), we can compute the max flow of the network and consider it as the relevance score for the pair (source and sink).
When we come to the application of attention-based techniques, we can recognize three streams: (1) the papers proposing the usage of attention weights; (2) the papers using the Attention Rollout that we described above; and (3) the papers exploiting visualization techniques for attention weights.
In the first group, we find Renz et al., who proposed two different models for route planning in complex environments using the sum of attention weights as relevance scores [
62]. Feng et al. proposed a model for early stroke mortality prediction, using attention weights as relevance scores [
63]. A more complex function of attention weights was employed by Trisedya et al. to explain the output in knowledge graph alignment [
64]. Applications in the medical field were considered by Graca et al., who proposed a framework for Single Nucleotide Polymorphisms (SNPs) classification, using attention weights to explain the classification [
65]; by Kim et al., who used attention weights to score text in input and explain decisions taken by a model dedicated to medical codes prediction [
66]; and by Clauwaert et al., who focused on automatic genomics transcription, analyzing the attention weights of the trained model to prove the specialization of each head with respect to some input feature [
67]. Sebbaq and El Faddouli proposed a new architecture to perform a taxonomy-based classification of e-learning materials, using attention weights for explainability [
68]. In their model for sequential recommendation, Chen et al. also relied on attention weights for explainability [
69]. An aggregation of attention weights was employed by Wantiez et al. to explain the results of their architecture for visual question answering in autonomous driving [
70]. The context considered by Ou et al. to use attention weights was instead next-action prediction in reinforcement learning [
71]. An aggregation over all attention weights of all heads in all layers was employed by Schwenke et al. to process time series data using symbolic abstraction [
72,
73]. Finally, Bacco et al. trained a transformer to perform sentiment analysis, using a function of the attention weights to select the input sentences in input that better justify the classification [
74]. The same subject was more extensively dealt with in [
75]. Humphreys et al. proposed an architecture for predicting defects, using the sum of attention weights over all layers [
76]. Attention weights were employed for searching in a transformer-based model dedicated to multi-document summarization in [
77].
The following six papers used the Attention Rollout technique. Di Nardo et al. proposed a transformer-based architecture for visual object-tracking tasks, using Attention Rollout for explainability [
78]. Cremer et al. tested an architecture on 3 datasets for drug toxicity classification [
79]. A variation of Attention Rollout was employed by Pasquadibisceglie et al. to generate heatmaps in a framework for next-activity prediction in process monitoring [
80]. Attention Rollout was employed in conjunction with Grad-CAM (see
Section 5.3) by Neto et al. to detect metaplasia in upper gastrointestinal endoscopy [
81]. Both Attention Rollout and LRP were tested by Thorn Jakobsen et al. on new datasets in [
50]. Finally, Komorowski et al. compared LIME, Attention Rollout, and LRP-Rollout (see
Section 5.5.1) for a model dedicated to COVID-19 detection from X-ray images [
82].
A large group of papers have addressed the use of the visualization of attention weights to help explain the outcome of transformers. Fiok et al. used both BertViz (a visualization tool for attention weights) and TreeSHAP (a variation of SHAP for tree-based models) [
83]. Again, Tagarelli et al. employed BertViz after training a BERT-like model on the Italian Civil Code [
84]. Lal et al. proposed a tool to explain transformers’ decisions by visualizing attention weights in many ways, including dimensionality reduction [
85]. Dai et al. adopted the visualization of the attention weights to explain a classification model to infer personality traits based on the HEXACO model [
86]. Gaiger et al. considered a general transformer-based model [
87]. Zeng et al. used the visualization of attention weights to explain a new framework for DNA methylation sites prediction [
88]. Textual dialogue interaction was instead the application of interest in [
89]. Ye et al. employed attention weights visualization to classify eye diseases from medical records [
90]. Neuroscience was the domain of application considered in [
91], where a new architecture for brain function analysis was proposed, and [
92], where a new architecture was proposed based on the graph representation of neurons from fMRI images to predict cognitive features of the brain. Sonth et al. trained a model for driver distraction identification [
93]. Kohama et al. proposed a new architecture for learning action recommendations in [
94]. Wang et al. proposed a new architecture for medical image segmentation, using visualizations of both attention weights and gradient values to explain the output [
95]. Kim et al. proposed an architecture for water temperature prediction [
96]. Monteiro et al. proposed an architecture for a 1D binding pocket and the binding affinity of drug–target interaction pairs prediction in [
97]. Finally, Yadav et al. compared different explanability methods (LIME, SHAP, and Attention visualization) for hate speech detection models in [
98].
A related research stream considered the use of transformers for images, i.e., visual transformers. Ma et al. computed an indiscriminative score for each patch of an image as a function of the attention weights in all the layers [
99].
5.3. Gradient-Based Methods
Most of the training algorithms for neural networks are based on some form of gradient backpropagation from the loss function to the input. Many explainability methods are based on different functions of the gradient computed at different points in the neural network.
One of the first works using this approach was by Simonyan et al. in [
100]. They presented a method (subsequently) called saliency, which computes the gradient of
with respect to the input
x. Employing the gradient in the linear approximation of
in a neighborhood of
x via its Taylor series is analogous to interpreting the coefficients in a linear regression model as a measure of feature importance. Furthermore, another work presented by Springenberg et al. in [
101] introduced a class of methods that included Guided Backpropagation. This method consists of a forward pass through a CNN to reach a selected layer and then, after zeroing all the features but one, a backward pass to the input (filtering out all the non-positive pixels) to compute the relevance of the feature. After those papers, Kindermans et al. in [
102] proposed to scale the scores obtained with saliency by multiplying them by the input in a method (subsequently) called InputXGradient.
Extending the work in [
100], Yin and Neubig in [
103] computed the gradient of the difference for two different outputs with the same input (formally
) as the saliency score, and they called it the Contrastive Gradient Norm. They also used the same gradient to extend the work in [
102], calling it the Contrastive InputXGradient.
To generalize the CAM technique, Selvaraju et al. introduced in [
104] the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. The most important advantage of Grad-CAM with respect to CAM is the compatibility of the method with any CNN-based architecture. The method consists of backpropagating the output until the last convolutional layer. The gradients we propagated back are averaged to obtain a vector with a size equal to the number of filters. So, we use this vector just like the learned weights in the CAM method, multiplying them by the activations of the filters in the last convolutional layer. As the last step, we apply ReLU over the results of the linear combinations to filter out negative scores. Formally, given the last convolutional layer of a CNN architecture with the activation matrix
, composed by
F filters of size
, and the final output for a class
c represented as
, we obtain:
Grad-CAM++ presented in [
105] is a variation of Grad-CAM. Unlike GRAD-CAM, the ReLU is moved to the partial derivative, and different coefficients are used for each combination of position, filter, and class. These coefficients are computed as a function of the gradients backpropagated from the last layer of the network (before the activation function).
We can now see the papers employing one or more of the techniques described so far. Grad-CAM alone was employed by Sobahi et al., who proposed a model to detect COVID-19 by using cough sound recordings [
106]; Thon et al., who proposed a model to perform a 3-classes severity classification of COVID-19 from chest radiographs [
107]; and Vaid et al., who trained a model for ECG 2D representation classification [
108]. Wang et al. proposed a transformer-based architecture for medical 3D image segmentation, using Grad-CAM++ in [
109].
More often we find Grad-CAM employed with other techniques in a comparative fashion. Wollek et al. compared TMME and Grad-CAM for pneumothorax classification from chest radiographs [
110]. Neto et al. employed Grad-CAM and Attention Rollout to explain a model for metaplasia detection in upper gastrointestinal endoscopy [
81]. Thakur et al. compared LIME and Grad-CAM for plant disease identification from leaves images [
111]. Kadir et al. compared Soundness Saliency and Grad-CAM for image classification [
112]. Vareille et al. employed a host of methods (SHAP, Grad-CAM, Integrated Gradients, and Occlusion) for multivariate time series analysis [
113]. Hroub et al. compared different models for pneumonia and COVID-19 prediction from X-rays, using Grad-CAM, Grad-CAM++, Eigen-Grad-CAM, and AblationCAM to produce saliency maps [
57].
A wider selection, not including Grad-CAM, was employed in other papers. Cornia et al. proposed a method to explain transformers’ decisions in visual captioning by applying three different gradient-based methods (saliency, Guided Backpropagation, and Integrated Gradients) [
114]. Poulton et al. applied saliency, InputXGradient, Integrated Gradient, Occlusion, and GradientSHAP to explain the decisions of transformers concerning the automatic short-answer grading task [
115].
Finally, visualization techniques were considered by Alammar, who presented a tool (Ecco) to provide different visualization techniques for transformers [
58]. Wang et al. proposed a new architecture for medical image segmentation, using the visualization of both attention weights and gradient values to explain the output [
95].
As an example of the results that can be obtained with such methods, a visualization dedicated to image classification can be seen in
Figure 7. In this image, each pixel is colored according to the score assigned by the different methods (most of them already explained in this section) so as to compare them. This type of visualization, consisting just of a heatmap over the input image, is usually called a saliency map.
5.4. Perturbation-Based Methods
The perturbation approach identifies the relevance of a portion of the input by masking it and checking the consequences on the output.
Zeiler and Fergus in [
117] introduced an approach (subsequently) called Occlusion, where part of the input is masked with 0s and the difference in the output is measured. So, moving the Occlusion, we can use the differences in the output as a measure of the relevance of the masked part of the input.
Comparing neural networks and linear models, it is known that even though neural networks provide better results, linear models are much more easily explainable. Thinking of this fundamental difference, Ribeiro et al. introduced Local Interpretable Model-agnostic Explanation (LIME) in [
118], trying to build a linear model to explain the decision taken by the neural network by lightly perturbing the input and measuring the difference in the output. Formally, given a function of probability
f with respect to a class (the model we are trying to explain), a class of explainable models
G (it could be the class of linear models), a function
that measures the proximity distance with respect to
x, and a function
to measure the complexity of a model of the class
G, we try to find a model
M for the input
x defined as
where the function
ℓ returns a measure of how unfaithful the model
g is in approximating
f in the locality defined by
.
The definition is very general and could be arranged in many different ways. For the sake of clarity, we take the class of linear models as an example for
G, defined as
. Considering this type of model, we would define
as
with the function
D defined as a distance function (e.g., cosine distance for text and L2 distance for images) and
as a weight factor. Furthermore, we would define the function
ℓ as
with
defined as the set of samples obtained by perturbing the initial input
x.
It is important to notice that this type of technique guarantees a faithful local explanation but not a global one.
Lundberg et al. introduced SHAP in [
119], a method based on the game-theory-based notion of Shapley values developed by Shapley in [
120]. In their work, Lundberg et al. connected pre-existing methods (such as LIME and DeepLIFT) to Shapley values. They proved that LIME could return valid Shapley values after some variations. First of all, the perturbing method should introduce random masking over the features, replacing the missing ones with values sampled from a marginal distribution computed on the original input. The masking is represented by the function
. The proximity distance should be defined as
where
is the set of masked samples;
,
is the number of unmasked features; and
M is the max number of unmasked features among the samples in
. The model to be used should be a weighted linear regression model defined as
Finally, the loss function
is defined as
A different approach, called Anchors, was proposed by Ribeiro et al. in [
121]. They focused on the subset of input features that leaves the same output classification (with a high probability) even after the change in the remaining features.
Petsiuk et al. in [
122] presented a method called RISE (Randomized Input Sampling for Explanation), which is suitable for any image classification model. It generates many random masks that are applied to the input image. The probability of classification is measured for each masked image, and then the relevance map is composed as a weighted sum of the masks (with respect to the probabilities measured in the output) after applying a normalization with respect to how many times a pixel was in a mask.
Gupta et al. in [
123] propose a method (subsequently) called Soundness Saliency, which consists of learning a matrix (with the same size of the input image) used to mask the original input, such that the expectation of the negative logarithm computed on the classification output probability is minimum. Unlike other methods, the values used to replace the masked pixels are taken from an image randomly picked from the training set. The saliency map will correspond to the learned masking matrix.
A variation of CAM, called AblationCAM, was proposed by Desai and Ramaswamy in [
124], also inspired by Grad-CAM but without using gradients. They predicted the probability of an output class for an input image, considering the last layer activation matrix, just before the classification layer. They computed the relative increase in the output probability for each pixel (of each filter) in the last layer, zeroing the corresponding activation and predicting again the output probability. After that, they computed a saliency map (reshaped with respect to the input image) composed by the activation of a pixel times the percentage computed for the same pixel.
When we examine the papers reporting the application of perturbation-based methods, we see that the great majority are papers using the two best-known methods: LIME and SHAP. In most cases, they use either alone.
As for LIME, Mehta et al. applied LIME to explain the decisions of BERT-like models in a hate speech detection task [
125]. Rodrigues et al. extended LIME to meta-embedding input in a model for a semantical textual similarity task with a meta-embedding approach [
126]. Janssens et al. employed LIME when comparing different models to detect rumors in tweets [
127]. Chen et al. compared different models to perform Patient Safety Events classification using a proprietary dataset [
128]. Collini et al. proposed a framework for online reputation and tourist attraction estimation, explaining the output with LIME [
129]. Finally, Silva and Frommholz employed LIME as a model to perform multi-author attribution [
130].
A similar-size group of papers have instead employed SHAP for explanation purposes. In the following, we report the task for which explainability was sought through SHAP. Upadhyay et al. proposed a new model for fake health news detection [
131]. Abbruzzese et al. proposed a new architecture for OCR anomaly detection and correction [
132]. Benedetto et al. proposed an architecture for emotional reaction prediction for social posts [
133]. Rizinski et al. proposed a framework to perform a lexicon-based sentiment analysis [
134]. Sageshima et al. proposed a method to classify donors with high-risk kidneys in [
135].
Then came several papers that compared different explainability approaches, including LIME and SHAP, either alone or in combination. Most of them considered gradient-based approaches. El Zini et al. employed LIME, Anchors, and SHAP when introducing a new dataset to evaluate the performances of different models for sentiment analysis [
136]. Lottridge et al. compared annotations from humans with respect to explanations provided by both LIME and Integrated Gradients within the scope of crisis alert identification [
137]. Arashpour et al. compared a wide range of explainability methods falling into the classes of perturbation-based methods and gradient-based methods (Integrated Gradients, Gradient SHAP, Occlusion, the Fast Gradient Sign Method, Projected Gradient Descent, Minimal Perturbation, and Feature Ablation) for waste categorization in images [
138]. Neely et al. compared LIME, Integrated Gradients, DeepLIFT, Grad-SHAP, and Deep-SHAP to measure their degree of concordance [
139]. Komorowski et al. compared LIME, Attention Rollout, and LRP-Rollout for a model to detect COVID-19 based on X-ray images [
82]. Thakur et al. used LIME and Grad-CAM in [
111] to compare different models for plant disease identification from leaf images. Tornqvist et al. proposed integrating SHAP and Integrated Gradients for automatic short-answer grading (ASAG) [
140]. Vareille et al. compared different explainability methods (SHAP, Grad-CAM, Integrated Gradients, Occlusion, and different variations of them dedicated to the task) for multivariate time series analysis [
113]. Mishra et al. compared LIME, SHAP, and LRP in explaining models for hate speech detection [
49]. For the same task, Yadav et al. compared LIME, SHAP, and Attention visualization in [
98]. Malhotra and Jindal used SHAP and LIME in models for depressive and suicidal behavior detection [
141]. Abdalla et al. employed LIME and SHAP when introducing a dataset to be used as a benchmark for human-written papers classification [
142]. Fiok et al. used both TreeSHAP (a variation of SHAP for tree-based models) and BertViz in [
83].
Quite a smaller group of papers considered perturbation methods other than LIME or SHAP. Poulton et al. applied different methods (saliency, InputXGradient, Integrated Gradient, Occlusion, and GradientSHAP) to explain the decisions of transformers for the automatic short-answer grading task [
115]. Kadir et al. compared Soundness Saliency and Grad-CAM with respect to image classification [
112]. Tang et al. proposed a new method for explainability that consists of finding the relevance matrix that minimizes the difference between the loss computed on the original image and the perturbed version obtained by masking the image with the relevance matrix, applying the new technique to a new model for cancer survival analysis [
143].