EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models

Singh, Dhiraj Kumar; Ryan, Conor

doi:10.3390/a19030187

Open AccessArticle

EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models

by

Dhiraj Kumar Singh

^*

and

Conor Ryan

^*

Biocomputing and Development Systems (BDS) Group, Computer Science and Information Systems (CSIS) Department, University of Limerick, V94 T9PX Limerick, Ireland

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(3), 187; https://doi.org/10.3390/a19030187

Submission received: 4 February 2026 / Revised: 26 February 2026 / Accepted: 27 February 2026 / Published: 2 March 2026

(This article belongs to the Special Issue Explainable AI: Advances in Interpretability Algorithms and Applications)

Download

Browse Figures

Versions Notes

Abstract

As deep learning models become increasingly integrated into critical decision-making systems, the need for explainable Artificial Intelligence (xAI) has grown paramount to ensure transparency, accountability, and trust. Post hoc explainability methods, which analyse trained models to interpret their predictions without modifying the underlying architecture, have become increasingly important, especially in fields such as healthcare and finance. Modern xAI techniques often produce feature importance rankings that fail to capture the true causal influence of features, particularly in transformer-based models. Recent quantitative metrics, such as Symmetric Relevance Gain (SRG), which measures the area between the feature corruption performance curves of the Most Important Feature (MIF) and the Least Important Feature (LIF), provide a more rigorous basis for evaluating explanation fidelity. In this study, we first show that existing xAI methods exhibit consistently poor performance under the SRG criterion when explaining transformer-based text classifiers. To address these limitations, we introduceEvoDropX, a novel framework that formulates explanation as an optimisation problem. EvoDropX leverages Grammatical Evolution (GE) to evolve sequences of feature corruption with the explicit objective of maximising SRG, thereby identifying features that most strongly influence model predictions. EvoDropX provides interventional, input–output (behavioural) explanations and does not attempt to infer or interpret internal model mechanisms. Through comprehensive experiments across multiple datasets (IMDb movie reviews (IMDB), Stanford Sentiment Treebank (SST-2), Amazon Polarity (AP)), multiple transformer models (Bidirectional Encoder Representations from Transformers (BERT), RoBERTa, DistilBERT), and multiple metrics (SRG, MIF, LIF, Counterfactual Conciseness (CFC)), we demonstrate that EvoDropX significantly outperforms all state-of-the-art (SOTA) xAI baselines including Attention-Aware Layer- Wise Relevance Propagation for Transformers (AttnLRP), SHapley Additive exPlanations (SHAP), and Local Interpretable Model-agnostic Explanations (LIME), when evaluated using intervention-based faithfulness criteria. Notably, EvoDropX achieves

74.77 %

improvement in SRG than the best-performing baseline on the IMDB dataset with the BERT model, with consistent improvements observed across all dataset-model pairs. Finally, qualitative and linguistic analyses reveal that EvoDropX captures both sentiment-bearing terms and their structural relationships within sentences, yielding explanations that are both faithful and interpretable.

Keywords:

explainable artificial intelligence; explanation fidelity; grammatical evolution; transformer models

1. Introduction

The rapid integration of Artificial Intelligence (AI) into critical domains such as healthcare, finance, and autonomous systems has increased the need for transparency and accountability. Although state-of-the-art (SOTA) transformer models [1] have significantly enhanced predictive performance, their inherent opacity has driven the demand for explainable Artificial Intelligence (xAI) techniques [2,3]. xAI aims to bridge this gap by generating human-interpretable explanations, allowing stakeholders to evaluate, validate, and, when necessary, challenge model decisions in high-stakes environments [4,5,6,7,8,9]. Furthermore, recent regulatory frameworks, such as the European Union (EU) AI Act [10], have introduced stringent transparency and accountability requirements, further accelerating the push for more robust xAI methodologies.

Post hoc xAI techniques revolve primarily around methods such as Local Interpretable Model-agnostic Explanations (LIME) [11], SHapley Additive exPlanations (SHAP) [5], saliency maps [12], and Layer-wise Relevance Propagation (LRP) [13]. These methods assign importance scores to individual features, identifying which inputs have the greatest influence on a model’s output. However, growing evidence points to significant limitations. For example, gradient-based techniques are highly sensitive to small input perturbations, leading to fragile and inconsistent explanations [14,15]. Similarly, SHAP often assumes feature independence, an unrealistic simplification that can undermine the reliability of the resulting attributions [16,17].

Additionally, xAI methods have drawn criticism from both theoretical and practical points of view [2,18,19,20,21], contributing to the phenomenon known as the disagreement problem, where different methods produce contradictory and inconclusive explanations [22]. These shortcomings undermine the trustworthiness of AI systems and raise ethical concerns, particularly in regulated applications.

Recent studies have proposed quantitative metrics, such as Symmetric Relevance Gain (SRG) [23], to address these challenges and evaluate the reliability of explainability methods. SRG quantifies the area between performance degradation curves obtained by corrupting the Most Important Feature (MIF) and the Least Important Feature (LIF). SRG offers an objective benchmark for assessing the faithfulness of explanations and comparing different xAI methods.

This study demonstrates that existing xAI methods exhibit subpar performance when explaining transformer model decisions on text-based tasks. This suggests that many current approaches fail to faithfully capture transformer behaviour. Building on this insight, we introduce EvoDropX, a novel behavioral interpretability framework that provides more faithful explanations. EvoDropX employs Grammatical Evolution (GE) [24] to optimise the order of feature removal with the specific objective of maximising SRG for a more precise understanding of the importance of features. GE employs a global search strategy, unlike gradient-based methods, which rely on differentiable loss functions and are susceptible to local minima [25]. EvoDropX explores diverse candidate solutions using evolutionary operators such as mutation and crossover, effectively navigating the complex and discontinuous search space. Additionally, integrating a context-free grammar (CFG) enables the incorporation of domain-specific constraints, steering the evolutionary process toward more meaningful dropout sequences.

Our experimental evaluation on real-world datasets shows that EvoDropX consistently outperforms existing xAI methods, achieving an average 74.77% increase in SRG. This substantial improvement highlights the potential of using evolutionary algorithms (EAs) to generate more faithful and reliable explanations, enhancing transparency in AI systems.

The remainder of this paper is organised as follows: Section 2 reviews the pertinent literature on xAI and discusses the shortcomings of current methodologies. Section 3 introduces the EvoDropX framework, details its implementation and the experimental setup. Section 4 outlines the comparative results obtained. Finally, Section 5 concludes with a discussion of our findings and outlines potential avenues for future research. The code for the EvoDropX framework architecture and experimental evaluation is publicly available on our GitHub repository (https://github.com/DhirajLERO/EvoDropX (accessed on 26 February 2026)). All experiments were conducted using Python 3.11.

2. Related Work

2.1. Post Hoc xAI Methods for Transformer Models

Explaining complex machine learning (ML) architectures requires identifying salient features that influence decision-making, often assuming that the importance of the feature is independent. Existing approaches to explain transformer models fall into four main categories: attention-based, activation-based, gradient-based, and perturbation-based techniques [26].

Within this landscape, explainability research divides into two complementary paradigms: mechanistic and behavioural interpretability. Mechanistic interpretability seeks to reverse engineer how model internals, such as attention heads, neurons, or intermediate representations, encode specific functions or concepts, revealing the model’s inner computational structure [27]. Behavioural interpretability, by contrast, focuses on observable input–output behaviour under controlled perturbations, quantifying how specific input features causally affect predictions without probing internal mechanisms. Although each category offers unique insights into transformer behaviour, they share inherent limitations that undermine the reliability and trustworthiness of explanations.

Transformers rely heavily on attention mechanisms, making attention-based explainability a natural starting point. Methods such as attention rollout and attention flow trace information pathways by aggregating attention weights across layers [28]. However, multiple studies have shown that attention scores do not always align with feature importance and may not accurately reflect a model’s reasoning process [29,30].

Activation-based explainability methods analyse neuron activations to trace how input features influence model predictions. A key technique in this category is LRP, which distributes relevance scores from the output layer back to the input layer. However, these methods often assume a direct correlation between activation magnitudes and feature importance, a relationship that may not always hold in complex transformer architectures. Additionally, non-linearity and skip connections in transformers make the direct application of activation-based relevance propagation challenging without significant modifications [26].

Gradient-based xAI techniques compute the gradients of model outputs with respect to inputs to identify influential features. Notable methods include saliency maps, Integrated Gradients (IG), and Gradient-weighted Class Activation Mapping (Grad-CAM) [12,31,32]. While these approaches can highlight relevant input regions, they are prone to gradient saturation, in which gradients diminish in deep layers and fail to reflect the true importance of a feature. Additionally, gradient-based methods are highly sensitive to input perturbations, often producing inconsistent explanations for similar inputs [15].

Perturbation-based explainability techniques evaluate feature significance by systematically modifying or masking input features and observing their impact on model predictions. Methods like LIME approximate model behaviour using simpler surrogate models, while SHAP applies game theory to quantify the contribution of each feature to a prediction. However, these techniques often rely on unrealistic assumptions, such as feature independence, leading to potentially erroneous attributions [16,17]. Moreover, perturbation-based methods can be computationally expensive, making them impractical for large language models [26].

2.2. xAI Evaluation

Researchers have developed a comprehensive suite of techniques to assess the quality of xAI systematically [33]. Common approaches include pixel flipping, which iteratively masks input regions to quantify feature relevance [9,34,35,36,37], and sanity checks, which ensure attribution patterns remain stable under controlled variations [38,39]. Axiomatic evaluations test whether explanations adhere to theoretical principles such as sensitivity or completeness [31,40,41]. Furthermore, synthetic benchmarks provide ground truth validation by allowing controlled evaluations of simulated data with predefined feature-outcome relationships [33,42,43,44].

Hedström et al. [37] further classify these evaluation methods into six distinct categories, as shown in Figure 1: robustness [6,45], localisation masks [42,46], complexity [41,47], randomisation [38,48], axiomatic measures [31,40], and faithfulness [41,45,47,49].

Among these categories, faithfulness is particularly critical, as it measures how accurately an explanation reflects the model’s true predictive behaviour. Ensuring that identified features genuinely influence the output is essential for trust in xAI methods. A widely used approach to assess faithfulness is input perturbation experiments [37,50], where the most important tokens in the input domain are iteratively replaced with a baseline value. If the attribution method correctly identifies key features, the model’s confidence in its prediction should decrease sharply. Conversely, perturbing the least relevant tokens first should have minimal impact, gradually decreasing confidence.

One of the earliest implementations of feature corruption, commonly known as pixel flipping, was introduced by Samek et al. [50], which proposed the Area Over the Perturbation Curve (AOPC) metric to measure the area over the most significant perturbation curve [50]. While AOPC remains widely used, it has notable limitations, including sensitivity to baseline selection and the risk of introducing out-of-distribution inputs [51]. To address these challenges, Brocki and Chung [52] and Blücher et al. [23] developed an improved metric, SRG, which quantifies the area between the least and most relevant order perturbation curves. This refinement offers a more robust and reliable measure of faithfulness in xAI evaluations [23,52].

2.3. Evolutionary Computation and Grammatical Evolution

Inspired by Darwinian principles of natural selection, EAs—such as Genetic Programming (GP) [53] and GE—iteratively refine solutions using genetic operators such as crossover and mutation. By emphasising the “survival of the fittest”, these algorithms drive optimisation and improve the quality of the solution over successive generations.

From a theoretical perspective, the convergence behaviour of EAs is well studied through runtime analysis, which quantifies the expected number of iterations (or fitness evaluations) required for an algorithm to discover the global optimum. These analyses provide theoretical guarantees on convergence speed and establish bounds on computational complexity.

For the simple

(1 + 1)

Evolutionary Algorithm (EA) where a single parent produces a single offspring through mutation, classical results show that the expected runtime on for problem size n with mutation rate

1 / n

is

O (n log n)

[54]. This provides a baseline understanding of single-parent evolutionary dynamics. However, the convergence behaviour becomes considerably more complex when multi-parent populations are introduced.

The theoretical understanding of multi-parent population-based EAs has advanced significantly through the work of Antipov et al. [55], which resolves the long-standing open problem of characterising the asymptotic runtime of the

(μ + λ)

EA for arbitrary parent population size

(μ)

and offspring population size

(λ)

. For the same setting, the

(μ + λ)

EA with standard bit mutation (mutation rate

p = 1 / n

) the expected number of iterations T until the optimum is found satisfies

E [T] = Θ (\frac{n log n}{λ} + \frac{n}{λ / μ} + \frac{n {log}^{+} {log}^{+} (λ / μ)}{{log}^{+} (λ / μ)}),

where

{log}^{+} x : = max {1, log x}

for all

x > 0

. This bound is asymptotically tight for all values of

μ

and

λ

.

While classical Genetic Algorithms (GAs) operate on fixed-length real-valued vectors or binary strings within predefined search spaces, many real-world optimisation problems demand the evolution of structured, syntactically valid solutions such as programs, mathematical expressions, or constraint sets. GE extends the theoretical and practical foundations of evolutionary algorithms to these domains by introducing a principled genotype-phenotype separation mediated by a formal grammar. In GE, individuals are represented as variable-length integer sequences (codons), which serve as genotypes. These genotypes are translated into phenotypes, which are executable programs or mathematical expressions, using a predefined grammar.

Evolution in GE occurs at the genotypic level, where genetic operators modify codons, while fitness evaluation takes place at the phenotypic level, assessing functionality against problem-specific criteria [56]. During evolution, crossover and mutation are applied to selected parents to generate offspring for the next generation. Crossover combines codon sequences from two parents to produce new offspring, while mutation alters individual codons within a single parent, introducing variation.

The context-free grammars (CFGs) used in GE are typically expressed in Backus-Naur Form (BNF) and defined by the tuple ⟨N, T, P, S⟩, where

N represents the set of non-terminals, which serve as intermediary structures in the mapping process.
T represents the set of terminals, the elements that appear in the final program.
P denotes the production rules that define how non-terminals can be expanded.
S is the starting symbol from which the derivation begins.

GE inherits the convergence properties of its underlying evolutionary mechanism; the

(μ + λ)

runtime bounds thus apply directly to genotype evolution. Well-designed grammars further accelerate convergence by constraining the effective search space to the subset of problem-relevant derivations [57].

It is important to note that while the

(μ + λ)

runtime bounds provide theoretical guarantees for standard evolutionary search in bitstring spaces such as OneMax, the search space of EvoDropX is substantially more complex. Specifically, EvoDropX operates over permutations of token indices, corresponding to a factorial search space of size

n!

, which can be approximated via Stirling’s formula as

n! \approx \sqrt{2 π n} {(\frac{n}{e})}^{n},

highlighting the combinatorial explosion compared to

2^{n}

for OneMax.

Moreover, unlike OneMax, where each bit flip provides a clear fitness gradient, perturbations in EvoDropX (e.g., adjacent token transpositions) can increase or decrease the SRG score non-monotonically. GE helps navigate this challenge through grammar-guided movement constraints: at the level of a single token, only a limited set of permissible moves is allowed, and multiple sequential moves can be composed to explore the space effectively. By structuring the search in this way, GE implicitly partitions the factorial search space into smaller, more tractable subspaces, thereby improving the locality of search without eliminating global exploration.

In this work, GE is used as the optimisation engine for searching over structured feature-corruption sequences; the paper’s core contribution is the formulation of explanation generation as an optimisation problem, rather than the use of GE itself. GE was selected over alternative strategies such as GP, reinforcement learning, or heuristic-based methods due to its ability to enforce structured constraints through CFGs, its established success in symbolic search tasks, its explainable nature and its flexibility in handling variable-length feature sequences in a non-differentiable, combinatorial optimisation landscape. These properties make GE well-suited to evolving feature corruption sequences. Implementation details of the mapping process and its use for generating corruption sequences are provided in Section 3.

3. Methodology and Experiment

Figure 2 illustrates the overall architecture of EvoDropX. EvoDropX frames post hoc explanation as a constrained optimisation task: given an input and a model, it searches for a feature-corruption sequence that maximises SRG. To perform this search in a large, discrete space while preserving validity, EvoDropX uses GE as an optimisation engine. Our methodology consists of three key stages:

Automated Grammar Generation: We introduce a method for automatically generating a CFG for GE based on the input text.
GE-based Feature Corruption Sequence Generation: Using the generated grammar and input text, we evolve an optimal feature corruption sequence that maximises SRG.
Feature Attribution based on Generated Sequence: We compute feature importance using the generated sequence and the probability drop associated with the corruption sequence.

3.1. GE-Based Feature Attribution Generation

To generate an explanation for a given text, the required inputs are an ML model M (which, in this case, is a transformer-based binary classifier) and a text input

x_{i}

, which can be of variable length. While we assume M is a binary classifier, the proposed methodology is, in principle, applicable to any ML model that outputs a class probability or score.

3.1.1. Automated Grammar Construction

The first stage involves dynamically constructing a CFG tailored to the input text’s token length. As shown in Listing 1, the generated grammar for a seven-token input defines rules for permuting the initial corruption order, which is assumed to follow the original token sequence in the sentence. The grammar operates hierarchically, with production rules structured to manipulate the corruption sequence through three fundamental components:

Feature Selection: The <token_choice> non-terminal specifies which token (e.g., index 0–6 for a seven-token text) is selected for reordering.
Movement Direction: The <direction> rule determines whether the chosen token shifts forward (later in the sequence) or backward (earlier in the sequence).
Step Size: The <steps>non-terminal defines how many positions the token moves (e.g., 0–6 steps).

The grammar dynamically scales with the input’s token length: for an N-token text, both the <token_choice> and <steps> ranges extend to N, ensuring adaptability across varying input sizes.

Listing 1. Sample CFG for input of token length 6.

3.1.2. Genotype to Corruption Sequence Mapping

In GE, candidate solutions are evolved at the genotypic level as variable-length integer strings (codons), but are evaluated at the phenotypic level after being mapped through a grammar. The genotype–phenotype mapping, therefore, acts as the bridge between evolutionary variation (mutation/crossover on codons) and the fitness evaluation.

The modulo operator in the grammar selects production rules during the mapping process as shown in Algorithm 1 and Example 1, ensuring that codons are mapped to valid grammatical structures. This grammar-based approach enables flexible, expressive search spaces tailored to specific problem domains.

Algorithm 1 Genotype-to-Phenotype Mapping in Grammatical Evolution
Require: Genotype $G = [g_{1}, g_{2}, \dots, g_{L}]$ , grammar $G = (N, T, S, P)$
Ensure: Phenotype $Φ$
1:	Initialise derivation $D \leftarrow [S]$
2:	Initialise codon index $i \leftarrow 1$
3:	Initialise wrap count $w \leftarrow 0$
4:	while a non-terminal exists in D and $w < w_{max}$ do
5:	Select leftmost non-terminal $X \in D$
6:	Let $R_{X} = {r_{1}, r_{2}, \dots, r_{\| R_{X} \|}}$ be production rules for X
7:	Compute rule index $j = g_{i} mod \| R_{X} \|$
8:	Replace X in D with the rule $r_{j}$
9:	$i \leftarrow i + 1$
10:	if $i > L$ then
11:	$i \leftarrow 1$
12:	$w \leftarrow w + 1$
13:	end if
14:	end while
15:	if no non-terminals remain in D then
16:	$Φ \leftarrow D$
17:	else
18:	$Φ \leftarrow Invalid Mapping$
19:	end if
20:	return $Φ$

As shown in Listing 1, the <start> rule recursively combines corruption instructions, enabling multiple tokens to be reordered within a single sequence. For example, a derivation like <start> ::= <line> | <line> ; <start> allows iterative modifications, such as first shifting Token 5 backwards by two steps, followed by moving Token 1 forward by three steps. Example 2 demonstrates this complete workflow, detailing the transformation from a genotype to the final corruption sequence.

3.1.3. Optimisation Objective

We formulate this as a maximisation problem to determine the optimal sequence of feature corruption that maximises SRG. We employ SRG as the fitness criterion to evaluate the suitability of individuals for selection in the next generation. For clarity and efficiency, we illustrate our approach using a sentiment classification task, specifically focusing on identifying the most important features for positive sentiment in a given transformer model.

Given a text of token length N, let

$x_{m} \in [0, 1]$ and $x_{l} \in [0, 1]$ denote the fractions of features corrupted in descending and ascending order of relevance, respectively.
$P_{M I F} (x_{m})$ and $P_{L I F} (x_{l})$ represent the model’s output probability after corrupting first $x_{m} \times 100 %$ and $x_{l} \times 100 %$ of features, respectively.

The area under the MIF and LIF region is then defined as

\begin{matrix} M I F_{a r e a} = \int_{0}^{1} P_{M I F} (x_{m}) d x_{m} \end{matrix}

(1)

\begin{matrix} L I F_{a r e a} = \int_{0}^{1} P_{L I F} (x_{l}) d x_{l} \end{matrix}

(2)

The SRG is computed as the area between the MIF and LIF curves, as defined in Equation (3):

\begin{matrix} S R G = \int_{0}^{1} P_{M I F} (x_{m}) d x_{m} - \int_{0}^{1} P_{L I F} (x_{l}) d x_{l} \end{matrix}

(3)

We corrupt each feature by setting its embedding vector to zero, following the approach of Achtibat et al. [58].

3.1.4. Feature Importance Calculation

To quantify the significance of each feature, we utilise the MIF corruption sequence generated by EvoDropX. This sequence prioritises features such that perturbing them earlier results in the largest cumulative drop in model confidence. We compute importance scores using Sequential Redistribution with Equal Splitting (SR-ES), which distributes the probability drop at each step among all features removed up to that point.

Let

σ = [h_{1}, h_{2}, \dots, h_{N}]

denote the MIF corruption sequence for an input x with N tokens, where

h_{k}

represents the k-th corrupted feature. Let

x^{(k)}

be the perturbed input after corrupting the first k tokens in

σ

, with

x^{(0)} = x

(original input) and

x^{(N)} = b

(fully corrupted baseline). The importance score

ϕ_{i}

for feature

h_{i}

is then computed as

\begin{matrix} ϕ_{i} = \sum_{k = i}^{N} \frac{f (x^{(k - 1)}) - f (x^{(k)})}{k}, \end{matrix}

(4)

where

f (x^{(k - 1)})

and

f (x^{(k)})

are the model’s confidence scores before and after corrupting

h_{k}

at step k. The Sequential Redistribution with Equal Splitting (SR-ES) method splits the probability drop

Δ P_{k} = f (x^{(k - 1)}) - f (x^{(k)})

at step k into k equal parts, assigning

\frac{Δ P_{k}}{k}

to each of the first k features as shown in Example 3. Thus, earlier features accumulate contributions from all subsequent steps where they were corrupted. To ensure comparability across inputs, scores are normalised by the total probability drop:

\begin{matrix} {\hat{ϕ}}_{i} = \frac{ϕ_{i}}{\sum_{j = 1}^{N} ϕ_{j}} \end{matrix}

(5)

3.1.5. Evaluation

To systematically evaluate and compare explanation techniques for transformer models on text classification tasks, we employ a suite of complementary metrics that assess explanation faithfulness from multiple perspectives. Since longer sequences naturally inflate area-based metrics, all such measures are normalised by sentence length N to enable fair comparisons across varying input lengths. The evaluation metrics are

Regularised SRG: This is the primary faithfulness metric used in our study. It quantifies the area between the MIF and LIF curves, normalised by sentence length.

$\begin{matrix} {SRG}_{reg} = \frac{SRG}{N} = \frac{1}{N} (\int_{0}^{1} P_{MIF} (x_{m}) d x_{m} - \int_{0}^{1} P_{LIF} (x_{l}) d x_{l}) \end{matrix}$

(6)

Higher ${SRG}_{reg}$ values indicate that the explanation method effectively distinguishes between important and unimportant features.
Regularised MIF: This metric measures the area under the MIF curve, normalised by sentence length:

$\begin{matrix} {MIF}_{reg} = \frac{1}{N} \int_{0}^{1} P_{MIF} (x_{m}) d x_{m} \end{matrix}$

(7)

where $P_{MIF}$ and $P_{LIF}$ denote the model’s confidence in the originally predicted class as features are progressively corrupted. Higher ${SRG}_{reg}$ values indicate that the explanation method effectively distinguishes between important and unimportant features.
Regularised LIF: This metric computes the area under the LIF curve, normalised by sentence length:

$\begin{matrix} {LIF}_{reg} = \frac{1}{N} \int_{0}^{1} P_{LIF} (x_{l}) d x_{l} \end{matrix}$

(8)

Higher ${LIF}_{reg}$ values suggest that removing the least-important features has minimal impact on model predictions.
Performance at K% Token Corruption (p@K): We evaluate p@10, p@20, and p@30, which represent the model’s confidence after corrupting a specified percentage of tokens ranked as most important by the explanation method. For a given input with N tokens and corruption percentage K, we compute

$\begin{matrix} p @ K = f (x_{corrupted}^{(K)}) \end{matrix}$

(9)

where $x_{corrupted}^{(K)}$ denotes the input after corrupting the top $⌈ N \times K / 100 ⌉$ tokens according to the explanation method’s importance ranking, and $f (\cdot)$ represents the model’s confidence score for the originally predicted class. Lower p@K values indicate better explanation quality.
Counterfactual conciseness (CFC): This metric measures the average number of token corruptions required to flip the model’s prediction to a different class. Lower CFC values indicate more concise explanations, as they identify the minimal set of features whose perturbation can alter the model’s decision.

To compute average performance across the test dataset containing M samples, we aggregate each metric as follows:

\begin{matrix} \bar{Metric} = \frac{1}{M} \sum_{j = 1}^{M} {Metric}^{(j)} \end{matrix}

(10)

where

Metric

represents any of the aforementioned evaluation measures.

3.1.6. Extending EvodropX

EvoDropX’s modular architecture enables principled extension across data modalities and model types by decoupling feature corruption strategies from fitness evaluation. This extensibility proceeds along two axes.

First, the feature representation and corruption methodology are defined by a modality-specific grammar, G. For text, G generates token-corruption sequences as defined in Listing 1. For an image input I, G can be adapted to operate over superpixels or patches

s_{i}

. In this domain, the corruption function

C (s_{i})

may replace the zero-embedding substitution used for text with domain-appropriate perturbations such as Gaussian noise,

I^{'} (s_{i}) = I (s_{i}) + N (0, σ^{2})

, or mean-value in-painting, subject to the requirements of the task and model.

Second, while the overarching evolutionary objective remains the maximisation of the SRG, defined as the area between the MIF and LIF trajectories, the underlying metric used to construct these curves can be adapted to the specific task. In the binary classification setting (Section 3), the curves track the degradation in confidence for the predicted class,

P (c | X)

. This formulation can be extended to multi-class scenarios by tracking the probability shift toward a target counterfactual class

c_{j}

, where the SRG quantifies the gap between the most and least effective features for inducing this class transition. Similarly, for regression tasks with output

f (X)

, the metric can be defined in terms of deviation in predicted value,

D (M) = | f (X) - f (M (X)) |

, where

M (X)

denotes the input X subject to a corruption mask M. In this case, optimisation seeks to maximise the integral difference between the curves of maximum deviation (MIF) and minimum deviation (LIF). Thus, the core principle of optimising the integral gap between most- and least-influential features is preserved, with the prediction function serving as the task-specific component of the objective.

3.2. Experiment Details

We evaluate EvoDropX against six established explainability techniques across three transformer architectures—Bidirectional Encoder Representations from Transformers (BERT) [59], RoBERTa [60], and DistilBERT [61]—fine-tuned on three binary sentiment classification datasets: (1) IMDb movie reviews (IMDB) [62], (2) Stanford Sentiment Treebank (SST-2) [63], and (3) the Amazon Polarity (AP) Classification Dataset [64,65,66].

To ensure consistency across experiments, we constructed test sets comprising 80 input texts per dataset–model combination, selecting instances that were correctly classified as belonging to the positive class (label = 1). This restriction controls for prediction correctness and enables fair comparison across explanation methods. While our evaluation focuses on positive-class predictions, the framework readily generalises to other target classes and multi-class settings.

Table 1 summarises the token length distributions for each dataset–model pairing. The results show that AP reviews are substantially shorter (mean ≈ 23 tokens) than SST-2 texts (mean ≈ 31 tokens) and IMDB reviews (mean ≈ 53 tokens). Within each dataset, token length distributions remain nearly identical across models, ensuring comparable experimental conditions.

Evolutionary optimisation parameters used by EvoDropX are reported in Table 2. These values were selected based on standard practice in grammatical evolution and preliminary exploratory runs, and were fixed across all experiments.

For comparison, we selected baselines spanning explainability paradigms: gradient-based (gradient [12]), propagation-based (Attention-Aware Layer-Wise Relevance Propagation for Transformers (AttnLRP) [58]), perturbation-based (LIME, SHAP), and axiomatic methods (IG, IG × Input [31]). This mix includes foundational techniques, model-agnostic methods, and transformer-specific approaches like AttnLRP.

4. Results

Table 3 demonstrates that EvoDropX consistently outperforms competing methods in regularised SRG (

{\bar{SRG}}_{reg}

) across all datasets and model architectures, suggesting it more effectively distinguishes between important and unimportant features. EvoDropX achieves the highest SRG scores in every experimental configuration across IMDb, SST-2, and AP, with most differences reaching statistical significance (p < 5 × 10⁻²). On IMDB with RoBERTa, for example, EvoDropX yields

0.83 \pm 0.12

, substantially higher than LIME’s

0.68 \pm 0.25

and SHAP’s

0.58 \pm 0.24

, while gradient-based methods often produce values approaching zero. We observe similar patterns on SST-2, where EvoDropX attains

0.81 \pm 0.13

with RoBERTa, markedly surpassing AttnLRP (

0.44 \pm 0.29

) and IG, which points to more dependable separation of relevant and irrelevant tokens regardless of input length.

Beyond these SRG improvements, EvoDropX strikes a favourable balance between sensitivity to important features and robustness to unimportant ones. It records the lowest

{\bar{MIF}}_{reg}

values across all experimental settings, meaning that removing its top-ranked tokens causes sharp drops in model confidence, while concurrently preserving high

{\bar{LIF}}_{reg}

scores. On SST-2 with BERT, EvoDropX achieves

{\bar{MIF}}_{reg} = 0.19 \pm 0.13

versus

0.54 \pm 0.23

for LIME and

0.66 \pm 0.25

for AttnLRP, all while maintaining

{\bar{LIF}}_{reg} = 0.95 \pm 0.01

. Gradient-based methods, by comparison, tend to show higher

{\bar{MIF}}_{reg}

coupled with lower

{\bar{LIF}}_{reg}

, indicating weaker discrimination between causally relevant and irrelevant features.

This enhanced feature ranking translates into a clear, progressive decline in model confidence as an increasing proportion of top-ranked tokens are corrupted. Across all datasets and architectures, EvoDropX exhibits a monotonic decrease in

\bar{p @ K}

values as corruption levels rise from 10% to 20% and 30%, confirming that its highest-ranked tokens are collectively crucial for the model’s predictions. On IMDB with BERT, confidence falls from

0.55 \pm 0.47

at

\bar{p @ 10}

to

0.28 \pm 0.43

at

\bar{p @ 20}

and further to

0.18 \pm 0.34

at

\bar{p @ 30}

, with comparable monotonic patterns emerging on SST-2 and AP. Baseline methods generally show flatter or more erratic declines, implying their top-ranked tokens carry less causal weight. These results align with EvoDropX’s superior counterfactual conciseness, requiring fewer perturbations to reverse model predictions (e.g.,

5.1 \pm 3.4

on SST-2 with BERT, compared to

11.7 \pm 9.2

with LIME), especially on longer, more complex inputs.

To assess the statistical significance of these findings, we performed a one-way Analysis of Variance (ANOVA) followed by Tukey’s Honestly Significant Difference (HSD) and Games–Howell post hoc multiple comparison tests, using a significance threshold of

p < 0.05

. Tukey’s test was applied under the assumption of equal variances, while Games–Howell was used to account for potential heteroscedasticity. All corresponding p-values for statistically significant comparisons are reported in Table 3.

In addition, Figure 3 visualises these differences for the SST-2 dataset with DistilBERT by plotting a single confidence interval per method, following the recommendations of Hochberg and Tamhane [67]. This configuration is shown for clarity; analogous plots for the remaining model–dataset combinations are provided in Appendix A.1.

Complementing this quantitative analysis, Figure 4 presents qualitative visualisations comparing EvoDropX with top-performing baselines (AttnLRP and LIME) on the IMDB dataset using the BERT model. In these visualisations, a cool colourmap highlights the importance of individual words, making it easy to compare how each method ranks features. EvoDropX consistently emphasises words whose perturbation leads to a substantial drop in model confidence. This is reflected in the probability drop curves displayed in the diagram: for each sample, the MIF curve for EvoDropX shows a much steeper descent than those of AttnLRP and LIME. Conversely, the LIF curves remain relatively flat, confirming that EvoDropX more accurately isolates features critical to the model’s decision-making process. Moreover, EvoDropX exhibits refined selectivity in its attributions. Compared to other methods, it highlights fewer tokens as high-importance, focusing on the truly critical features and reducing noise in the explanation, consistent with its superior performance on the CFC metric.

To probe the behavioural differences between explanation methods beyond aggregate faithfulness metrics, we conducted a linguistic analysis of the extracted rationales, focusing specifically on EvoDropX versus AttnLRP. We part-of-speech (POS) tagged all selected tokens and compared the composition of the top 10% most highly ranked words across methods. Figure 5 shows the resulting distributions for all three backbones, providing a model-agnostic view of the linguistic evidence both explainer prioritises.

Across all models, AttnLRP consistently assigns more weight to adjectives (ADJ). These overt descriptive tokens directly express sentiment, such as ‘great’ or ‘excellent’, aligning with surface-level sentiment indicators. In contrast, EvoDropX not only captures these core sentiment-bearing adjectives but also assigns significant importance to connecting words, including pronouns (PRON), adpositions (ADP), and proper nouns (PROPN). Pronouns are reference words like “I”, “it”, and “this”. At the same time, adpositions are relation words (mostly prepositions) such as “of”, “in”, “with”, and “than” that link entities and concepts.

This POS-level difference shows that EvoDropX can recover decision-relevant patterns that go beyond individual sentiment words by also modelling how these cues are connected within the sentence. Figure 6 supports this observation via a t-SNE projection of the top 10% most relevant words across all combined datasets for BERT: EvoDropX (blue) forms a dense cluster in the bottom-left region, corresponding largely to PRON, PROPN, and ADP tokens that AttnLRP rarely highlights, indicating that it attends to a complementary portion of the feature space that includes both sentiment-bearing terms and their structural roles. The t-SNE plots for RoBERTa and DistilBERT show a similar trend and are reported in the Appendix A.2.

The analysis of feature importance is incomplete without considering the sequential dynamics of corruption, which reveals not just which words matter, but how their removal impacts model confidence step-by-step. Figure 7 captures this granular view by plotting the MIF and LIF trajectories for an example from IMDB, tracking the specific word corrupted at each index.

A close examination of the corruption order validates our earlier findings regarding linguistic structure. EvoDropX identifies a sequence that respects the sentence’s natural syntax, removing “wonderful”,“film”, “that”, “mixes”, “and”, “in”, “that”, “makes”. In contrast, AttnLRP adopts a more disjointed approach, skipping over connectives to target “wonderful”, “film”, “mixes”, “makes”, “a”, “way”, and “that”. A close examination of the corruption order reveals that, while both methods identify the same core sentiment anchors, their prioritisation strategies differ fundamentally. Both EvoDropX and AttnLRP correctly flag “wonderful” and “film” as the most critical tokens. However, beyond these obvious features, EvoDropX identifies a sequence that respects the sentence’s natural syntax, removing “that”, “mixes”, “and”, “in”, “that”, and “makes” in a structured flow. By interleaving functional connectives (PRON and ADP) with content words, it effectively dismantles the semantic bridge between concepts. In contrast, AttnLRP adopts a disjointed, keyword-focused approach, skipping over these connectives to aggressively group content terms like “mixes” and “makes” before addressing structure. This divergence indicates that whereas AttnLRP treats the input akin to a bag of sentiment words, EvoDropX preserves the linguistic dependencies essential for the model’s decision-making process.

The impact of this difference is visible in the probability curves. EvoDropX drives a steeper initial decline in the MIF curve, confirming that its holistic combination of sentiment and structure is more essential to the model’s prediction than keywords alone. Furthermore, the AttnLRP ranking exhibits inconsistencies; specifically, it removes “way” before “spectator”, leading to a counterintuitive rise in probability when “spectator” is subsequently removed. EvoDropX avoids such monotonicity violations by correctly prioritising “spectator” over “way”, maintaining a smooth, reliable decay that better reflects the model’s decision boundary.

Beyond these comparative results, Figure 8 reveals how the evolutionary search progresses in EvoDropX. The figure tracks the average fitness of the best individual at each generation across all three runs, with the shaded region showing variability between runs. A consistent pattern emerges: fitness rises sharply in early generations as the evolutionary process rapidly identifies high-quality corruption sequences, then gradually plateaus as the search shifts toward fine-tuning these solutions.

In most cases, fitness stabilises well before reaching the maximum generation budget, signalling convergence with diminishing gains from continued evolution. As generations advance, the shaded bands narrow, indicating that independent runs increasingly agree on strong solutions once the population focuses on similar high-performing structures. These patterns suggest that the evolutionary process efficiently reaches near-optimal performance, demonstrating stable convergence rather than unstable, run-dependent results. This reliability in evolutionary search underlies the superior, consistent performance of EvoDropX shown in Table 3.

While the evolutionary process converges efficiently, the computational demands of this robust search warrant discussion. As detailed in Table 4, the theoretical sequential complexity of EvoDropX is

O (P \cdot G \cdot L \cdot F)

, as it requires evaluating P individuals across L corruption steps for G generations. However, this worst-case analysis ignores the parallel capabilities of modern hardware. In our optimised implementation, we mitigate the latency by vectorising the evaluation: the corrupted inputs for the entire population are aggregated into high-dimensional batches. This allows the model to process an entire generation simultaneously, reducing the effective wall-clock complexity to ≈

O (G \times F + ϵ)

, where

ϵ

accounts for the overhead of batch computation. This parallelisation strategy ensures that while EvoDropX performs a rigorous search, it remains computationally feasible for real-world deployment.

Overall, these results underscore the advantages of EvoDropX: it not only generates more faithful explanations, as measured by the regularised SRG metric, but also offers greater consistency, robustness across diverse input lengths, and precise attribution of critical features. The statistical significance of these findings, combined with the clear qualitative differences observed in the visualisations, strongly supports the efficacy of EvoDropX over conventional xAI methods in explaining transformer model behaviour.

5. Conclusions, Limitations and Future Scope

In this study, we introduced EvoDropX, a novel framework that leverages GE to improve the faithfulness of explanations for transformer-based models. By optimising the order of feature removal to maximise the SRG metric, EvoDropX mitigates key limitations of existing xAI methods, which often produce inconsistent or weakly faithful attributions when evaluated under rigorous faithfulness criteria. Our empirical analysis shows that EvoDropX consistently outperforms SOTA approaches when evaluated using intervention-based faithfulness criteria across all evaluated model–dataset combinations, while also highlighting that many current xAI methods perform poorly under the SRG metric when applied to transformer architectures.

Despite its promising performance, EvoDropX presents several limitations. Its computational complexity is a primary challenge; the evolutionary optimisation process is resource-intensive, making deploying on large-scale datasets or in real-time applications challenging. This computational burden stems from the algorithm’s iterative nature, which requires evaluating numerous candidate solutions before converging on an optimal feature corruption sequence. However, as GPU acceleration and high-performance computing infrastructure evolve, these constraints are expected to become increasingly manageable.

Looking ahead, several promising avenues for future research exist. First, we plan to explore strategies to reduce computational overhead, such as refining evolutionary operators, adopting more efficient search strategies, and incorporating parallel processing. Second, extending EvoDropX beyond text-based tasks to other data modalities, such as images and tabular data, could further demonstrate its versatility and broaden its applicability. Additionally, we aim to develop principled characterisations of feature importance in transformer architectures, explicitly distinguishing between standalone (intrinsic) importance and context-dependent importance arising from feature interactions. Finally, we will investigate the theoretical properties of the SRG metric, including conditions under which monotonicity with respect to interaction-aware feature importance may hold, as well as its sensitivity to perturbation order and baseline choice, supported by controlled empirical analyses.

Overall, EvoDropX represents a promising advance toward more trustworthy and interpretable AI. Delivering more faithful feature attributions sets the stage for future innovations in transparent and accountable decision-making across high-stakes applications.

Author Contributions

Conceptualisation, D.K.S. and C.R.; methodology, D.K.S.; software, D.K.S.; validation, D.K.S.; formal analysis, D.K.S.; investigation, D.K.S. and C.R.; resources, D.K.S.; writing—original draft preparation, D.K.S.; writing—review and editing, D.K.S. and C.R.; visualisation, D.K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This publication has emanated from research conducted with the financial support of Taighde Éireann–Research Ireland under Grant number 18/CRT/6223.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors acknowledge Lero–the Research Ireland Centre for Software and the CSIS Department at the University of Limerick for providing a world-class compute environment.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

This appendix provides the significance-comparison plots for the remaining model–dataset combinations not shown in the main text. In the main paper, Figure 3 presents the SST-2 + DistilBERT case using one confidence interval per method following Hochberg and Tamhane [67]. Figure A1, Figure A2 and Figure A3 report the significance plots for IMDB dataset. Figure A4, Figure A5 and Figure A6 report the corresponding results for AP. Figure A7 and Figure A8 report the corresponding results for SST-2.

Figure A1. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on IMDB using BERT.

Figure A1. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on IMDB using BERT.

Figure A2. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on IMDB using RoBERTa.

Figure A2. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on IMDB using RoBERTa.

Figure A3. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on IMDB using DistilBERT.

Figure A3. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on IMDB using DistilBERT.

Figure A4. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on AP using BERT.

Figure A4. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on AP using BERT.

Figure A5. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on AP using RoBERTa.

Figure A5. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on AP using RoBERTa.

Figure A6. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on AP using DistilBERT.

Figure A6. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on AP using DistilBERT.

Figure A7. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on SST-2 using BERT.

Figure A7. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on SST-2 using BERT.

Figure A8. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on SST-2 using RoBERTa.

Figure A8. Significant differences in group means for multiple evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on SST-2 using RoBERTa.

Appendix A.2

Following the analysis in Figure 6 for BERT, this section reports the corresponding t-SNE projections for RoBERTa and DistilBERT. Figure A9 shows the t-SNE visualisation for RoBERTa, and Figure A10 shows the corresponding visualisation for DistilBERT.

Figure A9. t-SNE (t-distributed stochastic neighbour embedding) visualisation of the top 10% most relevant words selected from each sentence by EvoDropX and AttnLRP for the RoBERTa model on all combined datasets. Blue points denote words selected by EvoDropX, while yellow points denote words selected by AttnLRP; the size of each point is proportional to the frequency of that word appearing in the top 10% selection. The orange box highlights the dense cluster formed predominantly by EvoDropX-selected words, with comparatively fewer words selected by AttnLRP.

Figure A10. t-SNE (t-distributed stochastic neighbour embedding) visualisation of thetop 10% most relevant words selected from each sentence by EvoDropX and AttnLRP for the DistilBERT model on all combined datasets. Blue points denote words selected by EvoDropX, while yellow points denote words selected by AttnLRP; the size of each point is proportional to the frequency of that word appearing in the top 10% selection. The orange box highlights the dense cluster formed predominantly by EvoDropX-selected words, with comparatively fewer words selected by AttnLRP.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
Murdoch, W.J.; Singh, C.; Kumbier, K.; Abbasi-Asl, R.; Yu, B. Interpretable machine learning: Definitions, methods, and applications. arXiv 2019, arXiv:1901.04592. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L.K.; Müller, K.R. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer Nature: Cham, Switzerland, 2019; Volume 11700. [Google Scholar]
Covert, I.; Lundberg, S.; Lee, S.I. Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res. 2021, 22, 1–90. [Google Scholar]
Samek, W.; Montavon, G.; Lapuschkin, S.; Anders, C.J.; Müller, K.R. Explaining deep neural networks and beyond: A review of methods and applications. Proc. IEEE 2021, 109, 247–278. [Google Scholar] [CrossRef]
EU Artificial Intelligence Act. The EU Artificial Intelligence Act, 2024. Available online: https://artificialintelligenceact.eu/ (accessed on 26 February 2026).
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef]
Srinivas, S.; Fleuret, F. Rethinking the role of gradient-based attribution methods for model interpretability. arXiv 2020, arXiv:2006.09128. [Google Scholar]
Ghorbani, A.; Abid, A.; Zou, J. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Washington, DC, USA, 2019; Volume 33, pp. 3681–3688. [Google Scholar]
Kumar, I.E.; Venkatasubramanian, S.; Scheidegger, C.; Friedler, S. Problems with Shapley-value-based explanations as feature importance measures. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; JMLR: New York, NY, USA, 2020; pp. 5491–5500. [Google Scholar]
Huang, X.; Marques-Silva, J. On the failings of Shapley values for explainability. Int. J. Approx. Reason. 2024, 171, 109112. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Wilming, R.; Kieslich, L.; Clark, B.; Haufe, S. Theoretical behavior of XAI methods in the presence of suppressor variables. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; JMLR: New York, NY, USA, 2023; pp. 37091–37107. [Google Scholar]
Molnar, C.; König, G.; Herbinger, J.; Freiesleben, T.; Dandl, S.; Scholbeck, C.A.; Casalicchio, G.; Grosse-Wentrup, M.; Bischl, B. General pitfalls of model-agnostic interpretation methods for machine learning models. In Proceedings of the International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, Vienna, Austria, 18 July 2020; Springer: Cham, Switzerland, 2020; pp. 39–68. [Google Scholar]
Freiesleben, T.; König, G. Dear XAI community, we need to talk! Fundamental misconceptions in current XAI research. In Proceedings of the World Conference on Explainable Artificial Intelligence, Lisbon, Portugal, 26–28 July 2023; Springer: Cham, Switzerland, 2023; pp. 48–65. [Google Scholar]
Krishna, S.; Han, T.; Gu, A.; Wu, S.; Jabbari, S.; Lakkaraju, H. The disagreement problem in explainable machine learning: A practitioner’s perspective. arXiv 2022, arXiv:2202.01602. [Google Scholar]
Blücher, S.; Vielhaben, J.; Strodthoff, N. Decoupling pixel flipping and occlusion strategy for consistent XAI benchmarks. arXiv 2024, arXiv:2401.06654. [Google Scholar] [CrossRef]
Ryan, C.; Collins, J.J.; Neill, M.O. Grammatical evolution: Evolving programs for an arbitrary language. In Proceedings of the Genetic Programming: First European Workshop, EuroGP’98, Paris, France, 14–15 April 1998; Proceedings 1; Springer: Berlin/Heidelberg, Germany, 1998; pp. 83–96. [Google Scholar]
Li, H.; Xu, Z.; Taylor, G.; Studer, C.; Goldstein, T. Visualizing the loss landscape of neural nets. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Fantozzi, P.; Naldi, M. The explainability of transformers: Current status and directions. Computers 2024, 13, 92. [Google Scholar] [CrossRef]
Nanda, N.; Chan, L.; Lieberum, T.; Smith, J.; Steinhardt, J. Progress measures for grokking via mechanistic interpretability. arXiv 2023, arXiv:2301.05217. [Google Scholar] [CrossRef]
Abnar, S.; Zuidema, W. Quantifying Attention Flow in Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4190–4197. [Google Scholar] [CrossRef]
Jain, S.; Wallace, B.C. Attention is not explanation. arXiv 2019, arXiv:1902.10186. [Google Scholar]
Bibal, A.; Cardon, R.; Alfter, D.; Wilkens, R.; Wang, X.; François, T.; Watrin, P. Is attention explanation? an introduction to the debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 3889–3900. [Google Scholar]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, PMLR; JMLR: New York, NY, USA, 2017; pp. 3319–3328. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 618–626. [Google Scholar]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; Van Keulen, M.; Seifert, C. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Rieger, L.; Hansen, L.K. Irof: A low resource evaluation metric for explanation methods. arXiv 2020, arXiv:2003.08747. [Google Scholar] [CrossRef]
Gevaert, A.; Rousseau, A.J.; Becker, T.; Valkenborg, D.; De Bie, T.; Saeys, Y. Evaluating feature attribution methods in the image domain. Mach. Learn. 2024, 113, 6019–6064. [Google Scholar] [CrossRef]
Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. $M^{4}$ : A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models. Adv. Neural Inf. Process. Syst. 2023, 36, 1630–1643. [Google Scholar]
Hedström, A.; Weber, L.; Krakowczyk, D.; Bareeva, D.; Motzkus, F.; Samek, W.; Lapuschkin, S.; Höhne, M.M.C. Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. J. Mach. Learn. Res. 2023, 24, 1–11. [Google Scholar]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Binder, A.; Weber, L.; Lapuschkin, S.; Montavon, G.; Müller, K.R.; Samek, W. Shortcomings of top-down randomization-based sanity checks for evaluations of deep neural network explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 16143–16152. [Google Scholar]
Kindermans, P.J.; Hooker, S.; Adebayo, J.; Alber, M.; Schütt, K.T.; Dähne, S.; Erhan, D.; Kim, B. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer Nature: Cham, Switzerland, 2019; pp. 267–280. [Google Scholar]
Nguyen, A.P.; Martínez, M.R. On quantitative aspects of model interpretability. arXiv 2020, arXiv:2007.07584. [Google Scholar] [CrossRef]
Arras, L.; Osman, A.; Samek, W. CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Inf. Fusion 2022, 81, 14–40. [Google Scholar] [CrossRef]
Deyoung, J.; Jain, S.; Lehman, E.; Rajani, N.; Socher, R.; Wallace, B. A benchmark to evaluate rationalized nlp models. Comput. Lang. 2020. [Google Scholar]
Budding, C.; Eitel, F.; Ritter, K.; Haufe, S. Evaluating saliency methods on artificial data with different background types. arXiv 2021, arXiv:2112.04882. [Google Scholar] [CrossRef]
Alvarez Melis, D.; Jaakkola, T. Towards robust interpretability with self-explaining neural networks. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Zhang, J.; Bargal, S.A.; Lin, Z.; Brandt, J.; Shen, X.; Sclaroff, S. Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 2018, 126, 1084–1102. [Google Scholar] [CrossRef]
Bhatt, U.; Weller, A.; Moura, J.M. Evaluating and aggregating feature-based model explanations. arXiv 2020, arXiv:2005.00631. [Google Scholar] [CrossRef]
Sixt, L.; Granz, M.; Landgraf, T. When explanations lie: Why many modified bp attributions fail. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; JMLR: New York, NY, USA, 2020; pp. 9046–9057. [Google Scholar]
Arya, V.; Bellamy, R.K.; Chen, P.Y.; Dhurandhar, A.; Hind, M.; Hoffman, S.C.; Houde, S.; Liao, Q.V.; Luss, R.; Mojsilović, A.; et al. One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv 2019, arXiv:1909.03012. [Google Scholar] [CrossRef]
Samek, W.; Binder, A.; Montavon, G.; Lapuschkin, S.; Müller, K.R. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2660–2673. [Google Scholar] [CrossRef] [PubMed]
Chang, C.H.; Creager, E.; Goldenberg, A.; Duvenaud, D. Explaining image classifiers by counterfactual generation. arXiv 2018, arXiv:1807.08024. [Google Scholar]
Brocki, L.; Chung, N.C. Feature perturbation augmentation for reliable evaluation of importance estimators in neural networks. Pattern Recognit. Lett. 2023, 176, 131–139. [Google Scholar] [CrossRef]
Koza, J.R. On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA, USA, 1993. [Google Scholar]
Droste, S.; Jansen, T.; Wegener, I. On the analysis of the (1 + 1) evolutionary algorithm. Theor. Comput. Sci. 2002, 276, 51–81. [Google Scholar] [CrossRef]
Antipov, D.; Doerr, B.; Fang, J.; Hetet, T. A tight runtime analysis for the (μ + λ) EA. In Proceedings of the Genetic and Evolutionary Computation Conference, Kyoto, Japan, 15–19 July 2018; ACM: New York, NY, USA, 2018; pp. 1459–1466. [Google Scholar]
O’Neill, M.; Ryan, C. Grammatical evolution. IEEE Trans. Evol. Comput. 2001, 5, 349–358. [Google Scholar] [CrossRef]
McKay, R.I.; Hoai, N.X.; Whigham, P.A.; Shan, Y.; O’neill, M. Grammar-based genetic programming: A survey. Genet. Program. Evolvable Mach. 2010, 11, 365–396. [Google Scholar] [CrossRef]
Achtibat, R.; Hatefi, S.M.V.; Dreyer, M.; Jain, A.; Wiegand, T.; Lapuschkin, S.; Samek, W. Attnlrp: Attention-aware layer-wise relevance propagation for transformers. arXiv 2024, arXiv:2402.05602. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; Li, S., Sun, M., Liu, Y., Wu, H., Liu, K., Che, W., He, S., Rao, G., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1218–1227. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, 19–24 June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 142–150. [Google Scholar]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 1631–1642. [Google Scholar]
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. arXiv 2022, arXiv:2210.07316. [Google Scholar]
McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In RecSys ’13: Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; ACM: New York, NY, USA, 2013. [Google Scholar]
Enevoldsen, K.; Chung, I.; Kerboua, I.; Kardos, M.; Mathur, A.; Stap, D.; Gala, J.; Siblini, W.; Krzemiński, D.; Winata, G.I.; et al. MMTEB: Massive Multilingual Text Embedding Benchmark. arXiv 2025, arXiv:2502.13595. [Google Scholar] [CrossRef]
Hochberg, Y.; Tamhane, A.C. Multiple Comparison Procedures; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1987. [Google Scholar]

Figure 1. Classification of xAI evaluation techniques by Hedström et al. [37]. The figure categorises evaluation methods for xAI into six distinct groups: robustness, localisation, complexity, randomisation, axiomatic measures, and faithfulness.

Figure 2. Proposed Methodology: This figure illustrates generating explanations for any given input. The first phase involves automatically generating a context-free grammar for the input. The subsequent phase employs GE to evolve a diverse set of candidate corruption sequences and identifies the most suitable sequence that maximises SRG. In the final phase, the selected corruption sequence and probability drop are used to determine each feature’s significance.

Figure 3. The figure shows the significant differences in group means for various evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on the SST-2 dataset using the DistilBERT model.

Figure 3. The figure shows the significant differences in group means for various evaluation metrics (

{\bar{SRG}}_{reg}, {\bar{MIF}}_{reg}, {\bar{LIF}}_{reg}, \bar{P @ 10}, \bar{P @ 20}, \bar{P @ 30}, \bar{CFC}

) across different xAI methods on the SST-2 dataset using the DistilBERT model.

Figure 4. Qualitative comparison on the IMDB dataset with the BERT model: This figure compares EvoDropX with the two top-performing baseline methods. It presents four examples, each showing an SRG plot that visually quantifies the effect of feature corruption on model prediction performance, with the y-axis denoting the probability of the positive class and the x-axis denoting the number of features corrupted. Accompanying each plot is the corresponding textual input, annotated to reflect feature importance using a cool colour map, where pink indicates the most important features and light blue the least important ones.

Figure 5. Part-of-speech distribution of the top 10% highest-ranked words extracted by EvoDropX and AttnLRP across BERT, RoBERTa, and DistilBERT models on all combined datasets.

Figure 6. t-SNE (t-distributed stochastic neighbour embedding) visualisation of the top 10% most relevant words selected from each sentence by EvoDropX and AttnLRP for the BERT model on all combined datasets. Blue points denote words selected by EvoDropX, while yellow points denote words selected by AttnLRP; the size of each point is proportional to the frequency of that word appearing in the top 10% selection. The orange box highlights the dense cluster formed predominantly by EvoDropX-selected words, with comparatively fewer words selected by AttnLRP.

Figure 7. This figure illustrates the sequential corruption process for EvoDropX (top) and AttnLRP (bottom) on a single example sentence (shown at the top). Each plot displays the MIF and LIF curves, with the y-axis representing the probability of the positive class and the x-axis representing the corruption step index. At each step along the x-axis, the specific word being corrupted is annotated, enabling direct visualisation of how removing individual features affects model confidence.

Figure 8. Evolutionary fitness progression across generations for each dataset–model combination for a single example. Each panel shows how the best individual’s fitness evolves over generations, averaged across three independent runs, with shaded regions indicating standard deviation. The y-axis represents SRG fitness, and the x-axis shows the generation number. Subplot titles indicate the corresponding dataset and model.

Table 1. Token length statistics (mean ± standard deviation) for test sets across dataset-model combinations. Each test set contains 80 instances correctly classified as positive (class 1).

Model	IMDb	Amazon Polarity	SST-2
BERT	$53.57 \pm 8.09$	$23.42 \pm 2.36$	$31.61 \pm 1.07$
RoBERTa	$53.40 \pm 7.33$	$23.58 \pm 2.30$	$31.26 \pm 1.06$
DistilBERT	$53.73 \pm 8.61$	$23.35 \pm 2.33$	$31.61 \pm 1.07$

Table 2. Evolution Parameters.

Evolution Parameter	Value
Population size	400
Maximum initial tree depth	35
Minimum initial tree depth	4
Maximum number of generations	200
Probability of crossover	0.8
Probability of mutation	0.01
Elitism size	1
Hall of fame size	1
Maximum tree depth	35
Number of Runs (different seeds)	3

Table 3. Performance comparison across all datasets and models (mean ± std). Here, SRG denotes

{\bar{S R G}}_{r e g}

, MIF denotes

{\bar{M I F}}_{r e g}

, LIF denotes

{\bar{L I F}}_{r e g}

,

p @ 10

denotes

\bar{p @ 10}

,

p @ 20

denotes

\bar{p @ 20}

,

p @ 30

denotes

\bar{p @ 30}

, and CFC denotes

\bar{C F C}

. The upward arrow (↑) indicates that higher values correspond to better performance, whereas the downward arrow (↓) indicates that lower values correspond to better performance. Higher

{\bar{S R G}}_{r e g}

,

{\bar{L I F}}_{r e g}

and lower

{\bar{M I F}}_{r e g}

,

\bar{p @ K}

,

\bar{C F C}

indicate better explanations. The best performance is highlighted in green and the second-best in orange. The best-performing value is additionally shown in bold when the difference between the best and second-best methods is statistically significant.

Table 3. Performance comparison across all datasets and models (mean ± std). Here, SRG denotes

{\bar{S R G}}_{r e g}

, MIF denotes

{\bar{M I F}}_{r e g}

, LIF denotes

{\bar{L I F}}_{r e g}

,

p @ 10

denotes

\bar{p @ 10}

,

p @ 20

denotes

\bar{p @ 20}

,

p @ 30

denotes

\bar{p @ 30}

, and CFC denotes

\bar{C F C}

. The upward arrow (↑) indicates that higher values correspond to better performance, whereas the downward arrow (↓) indicates that lower values correspond to better performance. Higher

{\bar{S R G}}_{r e g}

,

{\bar{L I F}}_{r e g}

and lower

{\bar{M I F}}_{r e g}

,

\bar{p @ K}

,

\bar{C F C}

indicate better explanations. The best performance is highlighted in green and the second-best in orange. The best-performing value is additionally shown in bold when the difference between the best and second-best methods is statistically significant.

Dataset	Model	Metric	EvoDropX	AttnLRP	SHAP	LIME	Gradient	IG	IG × Input	p-Value
IMDb	BERT	SRG ↑	0.76 ± 0.14	0.39 ± 0.25	0.30 ± 0.25	0.44 ± 0.23	0.05 ± 0.14	0.01 ± 0.16	0.41 ± 0.24	<1 × 10⁻³
		MIF ↓	0.19 ± 0.14	0.51 ± 0.25	0.58 ± 0.26	0.44 ± 0.23	0.82 ± 0.11	0.79 ± 0.15	0.50 ± 0.24	<1 × 10⁻³
		LIF ↑	0.95 ± 0.01	0.90 ± 0.03	0.88 ± 0.05	0.88 ± 0.05	0.87 ± 0.12	0.80 ± 0.13	0.91 ± 0.03	<1 × 10⁻³
		p@10 ↓	0.55 ± 0.47	0.77 ± 0.39	0.82 ± 0.37	0.72 ± 0.43	0.95 ± 0.19	0.97 ± 0.16	0.80 ± 0.37	7 × 10⁻²
		p@20 ↓	0.28 ± 0.43	0.65 ± 0.44	0.70 ± 0.43	0.49 ± 0.46	0.87 ± 0.29	0.94 ± 0.21	0.61 ± 0.45	2 × 10⁻²
		p@30 ↓	0.18 ± 0.34	0.58 ± 0.42	0.63 ± 0.45	0.36 ± 0.43	0.87 ± 0.24	0.91 ± 0.26	0.52 ± 0.44	17 × 10⁻²
		CFC ↓	9.2 ± 7.0	20.3 ± 15.5	25.2 ± 18.7	14.4 ± 12.1	38.8 ± 20.0	39.6 ± 16.5	18.1 ± 14.4	37 × 10⁻²
	RoBERTa	SRG ↑	0.83 ± 0.12	0.54 ± 0.27	0.58 ± 0.24	0.68 ± 0.25	0.30 ± 0.22	0.04 ± 0.29	0.61 ± 0.25	<1 × 10⁻³
		MIF ↓	0.14 ± 0.13	0.42 ± 0.27	0.37 ± 0.24	0.29 ± 0.25	0.62 ± 0.21	0.77 ± 0.24	0.36 ± 0.25	<1 × 10⁻³
		LIF ↑	0.97 ± 0.00	0.97 ± 0.01	0.95 ± 0.05	0.96 ± 0.01	0.91 ± 0.12	0.81 ± 0.20	0.97 ± 0.01	99 × 10⁻²
		p@10 ↓	0.51 ± 0.49	0.74 ± 0.40	0.80 ± 0.39	0.60 ± 0.48	0.90 ± 0.26	0.94 ± 0.21	0.79 ± 0.38	73 × 10⁻²
		p@20 ↓	0.20 ± 0.38	0.60 ± 0.46	0.69 ± 0.45	0.49 ± 0.48	0.79 ± 0.37	0.92 ± 0.26	0.55 ± 0.46	<1 × 10⁻³
		p@30 ↓	0.08 ± 0.26	0.45 ± 0.46	0.48 ± 0.47	0.34 ± 0.44	0.73 ± 0.41	0.84 ± 0.33	0.41 ± 0.44	<1 × 10⁻³
		CFC ↓	7.7 ± 6.5	18.8 ± 15.9	17.1 ± 11.9	13.9 ± 12.8	24.8 ± 16.2	37.2 ± 18.5	16.2 ± 12.3	7 × 10⁻²
	DistilBERT	SRG ↑	0.68 ± 0.11	0.53 ± 0.16	0.53 ± 0.16	0.54 ± 0.14	0.32 ± 0.15	0.01 ± 0.19	0.55 ± 0.14	<1 × 10⁻³
		MIF ↓	0.20 ± 0.11	0.31 ± 0.16	0.31 ± 0.16	0.29 ± 0.14	0.49 ± 0.14	0.62 ± 0.17	0.30 ± 0.14	<1 × 10⁻³
		LIF ↑	0.88 ± 0.04	0.84 ± 0.06	0.84 ± 0.07	0.83 ± 0.06	0.82 ± 0.11	0.63 ± 0.18	0.85 ± 0.06	41 × 10⁻²
		p@10 ↓	0.45 ± 0.38	0.60 ± 0.35	0.61 ± 0.37	0.55 ± 0.37	0.72 ± 0.29	0.89 ± 0.20	0.62 ± 0.36	44 × 10⁻²
		p@20 ↓	0.18 ± 0.25	0.39 ± 0.33	0.35 ± 0.33	0.28 ± 0.31	0.53 ± 0.31	0.79 ± 0.29	0.39 ± 0.34	37 × 10⁻²
		p@30 ↓	0.09 ± 0.13	0.29 ± 0.28	0.24 ± 0.26	0.20 ± 0.24	0.46 ± 0.28	0.72 ± 0.32	0.25 ± 0.27	9 × 10⁻²
		CFC ↓	6.5 ± 4.4	10.3 ± 7.7	10.4 ± 8.0	8.7 ± 6.9	14.6 ± 13.1	29.0 ± 16.1	10.5 ± 7.4	79 × 10⁻²
SST-2	BERT	SRG ↑	0.76 ± 0.12	0.27 ± 0.26	0.33 ± 0.23	0.39 ± 0.23	0.17 ± 0.18	0.02 ± 0.22	0.32 ± 0.23	<1 × 10⁻³
		MIF ↓	0.19 ± 0.13	0.66 ± 0.25	0.59 ± 0.23	0.54 ± 0.23	0.75 ± 0.18	0.84 ± 0.17	0.61 ± 0.23	<1 × 10⁻³
		LIF ↑	0.95 ± 0.01	0.93 ± 0.05	0.92 ± 0.05	0.93 ± 0.03	0.92 ± 0.07	0.85 ± 0.12	0.93 ± 0.04	22 × 10⁻²
		p@10 ↓	0.47 ± 0.44	0.82 ± 0.33	0.83 ± 0.34	0.77 ± 0.37	0.92 ± 0.22	0.91 ± 0.24	0.81 ± 0.35	<1 × 10⁻³
		p@20 ↓	0.26 ± 0.38	0.74 ± 0.35	0.69 ± 0.43	0.58 ± 0.44	0.83 ± 0.31	0.89 ± 0.25	0.71 ± 0.40	<1 × 10⁻³
		p@30 ↓	0.13 ± 0.25	0.66 ± 0.40	0.56 ± 0.45	0.52 ± 0.43	0.76 ± 0.35	0.89 ± 0.25	0.64 ± 0.41	<1 × 10⁻³
		CFC ↓	5.1 ± 3.4	16.4 ± 11.7	14.2 ± 10.1	11.7 ± 9.2	17.6 ± 11.2	24.1 ± 10.1	14.0 ± 9.8	<1 × 10⁻³
	RoBERTa	SRG ↑	0.81 ± 0.13	0.44 ± 0.29	0.37 ± 0.31	0.57 ± 0.27	0.10 ± 0.20	0.00 ± 0.21	0.53 ± 0.30	<1 × 10⁻³
		MIF ↓	0.16 ± 0.13	0.53 ± 0.29	0.59 ± 0.31	0.39 ± 0.27	0.84 ± 0.17	0.88 ± 0.17	0.43 ± 0.31	<1 × 10⁻³
		LIF ↑	0.97 ± 0.00	0.96 ± 0.02	0.96 ± 0.05	0.96 ± 0.05	0.94 ± 0.08	0.88 ± 0.19	0.96 ± 0.06	99 × 10⁻²
		p@10 ↓	0.45 ± 0.47	0.79 ± 0.39	0.90 ± 0.28	0.69 ± 0.44	0.89 ± 0.29	0.95 ± 0.21	0.77 ± 0.40	<1 × 10⁻³
		p@20 ↓	0.21 ± 0.38	0.55 ± 0.48	0.72 ± 0.42	0.46 ± 0.48	0.84 ± 0.35	0.95 ± 0.21	0.57 ± 0.47	<1 × 10⁻³
		p@30 ↓	0.09 ± 0.26	0.49 ± 0.47	0.58 ± 0.47	0.40 ± 0.47	0.79 ± 0.39	0.95 ± 0.19	0.43 ± 0.48	<1 × 10⁻³
		CFC ↓	5.3 ± 3.3	12.2 ± 9.9	15.7 ± 10.5	9.6 ± 7.8	19.9 ± 11.1	25.0 ± 8.9	11.4 ± 8.9	<1 × 10⁻³
	DistilBERT	SRG ↑	0.69 ± 0.08	0.26 ± 0.13	0.30 ± 0.14	0.26 ± 0.12	0.39 ± 0.16	0.00 ± 0.20	0.27 ± 0.14	<1 × 10⁻³
		MIF ↓	0.07 ± 0.05	0.13 ± 0.08	0.13 ± 0.09	0.12 ± 0.09	0.15 ± 0.08	0.29 ± 0.13	0.15 ± 0.10	2 × 10⁻³
		LIF ↑	0.75 ± 0.08	0.39 ± 0.11	0.42 ± 0.09	0.38 ± 0.09	0.54 ± 0.17	0.29 ± 0.17	0.41 ± 0.11	<1 × 10⁻³
		p@10 ↓	0.20 ± 0.34	0.54 ± 0.46	0.54 ± 0.46	0.49 ± 0.47	0.66 ± 0.43	0.88 ± 0.28	0.58 ± 0.47	<1 × 10⁻³
		p@20 ↓	0.03 ± 0.12	0.23 ± 0.35	0.20 ± 0.34	0.21 ± 0.35	0.28 ± 0.38	0.73 ± 0.38	0.32 ± 0.42	2 × 10⁻²
		p@30 ↓	0.01 ± 0.01	0.05 ± 0.15	0.06 ± 0.18	0.05 ± 0.14	0.07 ± 0.19	0.46 ± 0.43	0.09 ± 0.23	99 × 10⁻²
		CFC ↓	3.1 ± 1.5	4.9 ± 2.7	4.8 ± 2.7	4.8 ± 2.9	5.4 ± 2.9	9.7 ± 4.8	5.5 ± 3.1	1 × 10⁻²
AP	BERT	SRG ↑	0.62 ± 0.22	0.25 ± 0.29	0.20 ± 0.23	0.26 ± 0.27	0.01 ± 0.12	−0.03 ± 0.15	0.25 ± 0.26	<1 × 10⁻³
		MIF ↓	0.33 ± 0.23	0.70 ± 0.29	0.75 ± 0.24	0.70 ± 0.27	0.91 ± 0.09	0.92 ± 0.08	0.70 ± 0.26	<1 × 10⁻³
		LIF ↑	0.96 ± 0.01	0.95 ± 0.01	0.95 ± 0.01	0.96 ± 0.01	0.93 ± 0.09	0.89 ± 0.14	0.96 ± 0.01	99 × 10⁻²
		p@10 ↓	0.59 ± 0.49	0.72 ± 0.44	0.78 ± 0.41	0.68 ± 0.46	0.91 ± 0.27	0.96 ± 0.19	0.79 ± 0.41	74 × 10⁻²
		p@20 ↓	0.39 ± 0.49	0.71 ± 0.45	0.69 ± 0.46	0.59 ± 0.49	0.91 ± 0.27	0.96 ± 0.19	0.71 ± 0.45	4 × 10⁻²
		p@30 ↓	0.25 ± 0.42	0.68 ± 0.46	0.64 ± 0.47	0.56 ± 0.49	0.92 ± 0.25	0.95 ± 0.19	0.64 ± 0.47	<1 × 10⁻³
		CFC ↓	5.6 ± 4.8	13.2 ± 9.9	12.6 ± 9.6	11.6 ± 9.6	16.8 ± 8.7	19.3 ± 7.4	12.4 ± 9.1	<1 × 10⁻³
	RoBERTa	SRG ↑	0.66 ± 0.19	0.19 ± 0.23	0.25 ± 0.27	0.31 ± 0.25	0.05 ± 0.14	−0.01 ± 0.16	0.27 ± 0.26	<1 × 10⁻³
		MIF ↓	0.29 ± 0.19	0.76 ± 0.23	0.69 ± 0.27	0.64 ± 0.26	0.86 ± 0.12	0.88 ± 0.14	0.68 ± 0.26	<1 × 10⁻³
		LIF ↑	0.95 ± 0.01	0.95 ± 0.02	0.94 ± 0.03	0.95 ± 0.03	0.91 ± 0.10	0.87 ± 0.13	0.95 ± 0.02	99 × 10⁻²
		p@10 ↓	0.54 ± 0.49	0.84 ± 0.35	0.80 ± 0.39	0.69 ± 0.45	0.89 ± 0.30	0.94 ± 0.24	0.79 ± 0.40	17 × 10⁻²
		p@20 ↓	0.21 ± 0.39	0.72 ± 0.42	0.72 ± 0.44	0.51 ± 0.49	0.82 ± 0.36	0.91 ± 0.28	0.60 ± 0.47	1 × 10⁻³
		p@30 ↓	0.04 ± 0.19	0.64 ± 0.44	0.59 ± 0.47	0.46 ± 0.48	0.79 ± 0.38	0.85 ± 0.33	0.55 ± 0.46	<1 × 10⁻³
		CFC ↓	4.2 ± 3.2	13.2 ± 9.7	11.9 ± 8.7	9.4 ± 8.3	14.9 ± 8.9	17.4 ± 8.0	10.9 ± 8.6	1 × 10⁻³
	DistilBERT	SRG ↑	0.42 ± 0.09	0.22 ± 0.11	0.22 ± 0.12	0.23 ± 0.12	0.16 ± 0.12	−0.01 ± 0.14	0.23 ± 0.12	<1 × 10⁻³
		MIF ↓	0.34 ± 0.10	0.44 ± 0.13	0.43 ± 0.13	0.42 ± 0.13	0.50 ± 0.08	0.57 ± 0.13	0.43 ± 0.13	3 × 10⁻³
		LIF ↑	0.76 ± 0.04	0.65 ± 0.07	0.65 ± 0.06	0.65 ± 0.06	0.66 ± 0.13	0.57 ± 0.11	0.66 ± 0.07	<1 × 10⁻³
		p@10 ↓	0.56 ± 0.34	0.63 ± 0.31	0.62 ± 0.33	0.62 ± 0.31	0.72 ± 0.29	0.81 ± 0.27	0.63 ± 0.31	82 × 10⁻²
		p@20 ↓	0.36 ± 0.25	0.49 ± 0.28	0.48 ± 0.29	0.45 ± 0.29	0.59 ± 0.24	0.76 ± 0.27	0.49 ± 0.29	36 × 10⁻²
		p@30 ↓	0.29 ± 0.16	0.40 ± 0.22	0.39 ± 0.23	0.36 ± 0.21	0.50 ± 0.17	0.68 ± 0.25	0.39 ± 0.22	6 × 10⁻²
		CFC ↓	4.3 ± 2.7	6.6 ± 4.8	6.5 ± 5.3	6.0 ± 5.0	7.7 ± 5.6	12.6 ± 7.0	6.5 ± 5.2	4 × 10⁻¹

Table 4. Computational complexity analysis of EvoDropX compared to baseline attribution methods. Complexity is measured in terms of the model’s forward (F) and backward (B) passes. We distinguish between the theoretical sequential complexity of EvoDropX and its optimised parallel performance, where population evaluation is vectorised.

Method	Model Access	Complexity	Parameter Definition
White-Box (Model-Specific)
Gradient (Saliency)	Required	$1 F + 1 B$	None
Gradient × Input	Required	$1 F + 1 B$	None
AttnLRP	Required	$1 F + 1 B$	None
Integrated Gradients (IG)	Required	$m (1 F + 1 B)$	m: Integral approx. steps
Black-Box (Model-Agnostic)
LIME	Agnostic	$N \times F$	N: Perturbation samples
Partition SHAP	Agnostic	$k \times F$	k: Feature coalitions
EvoDropX (Sequential)	Agnostic	$(P \times G \times L) \times F$	P: Pop., G: Gen., L: Tokens
EvoDropX (Parallel)	Agnostic	≈ $G \times F + ϵ$	$ϵ$ : Batching Overhead

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Singh, D.K.; Ryan, C. EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models. Algorithms 2026, 19, 187. https://doi.org/10.3390/a19030187

AMA Style

Singh DK, Ryan C. EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models. Algorithms. 2026; 19(3):187. https://doi.org/10.3390/a19030187

Chicago/Turabian Style

Singh, Dhiraj Kumar, and Conor Ryan. 2026. "EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models" Algorithms 19, no. 3: 187. https://doi.org/10.3390/a19030187

APA Style

Singh, D. K., & Ryan, C. (2026). EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models. Algorithms, 19(3), 187. https://doi.org/10.3390/a19030187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EvoDropX: Evolutionary Optimization of Feature Corruption Sequences for Faithful Explanations of Transformer Models

Abstract

1. Introduction

2. Related Work

2.1. Post Hoc xAI Methods for Transformer Models

2.2. xAI Evaluation

2.3. Evolutionary Computation and Grammatical Evolution

3. Methodology and Experiment

3.1. GE-Based Feature Attribution Generation

3.1.1. Automated Grammar Construction

3.1.2. Genotype to Corruption Sequence Mapping

3.1.3. Optimisation Objective

3.1.4. Feature Importance Calculation

3.1.5. Evaluation

3.1.6. Extending EvodropX

3.2. Experiment Details

4. Results

5. Conclusions, Limitations and Future Scope

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI