SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models

Brini, Iheb; Rahal, Najoua; Mehri, Maroua; Ingold, Rolf; Essoukri Ben Amara, Najoua

doi:10.3390/jimaging11120424

Open AccessArticle

SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models

by

Iheb Brini

^1,2,*

,

Najoua Rahal

²,

Maroua Mehri

¹

,

Rolf Ingold

²

and

Najoua Essoukri Ben Amara

¹

Ecole Nationale d’Ingénieurs de Sousse, Laboratory of Advanced Technology and Intelligent Systems (LATIS), Université de Sousse, Sousse 4054, Tunisia

²

DIVA Group, University of Fribourg, 1700 Fribourg, Switzerland

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(12), 424; https://doi.org/10.3390/jimaging11120424

Submission received: 26 September 2025 / Revised: 8 November 2025 / Accepted: 24 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Explainable AI in Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In recent years, deep learning networks have demonstrated remarkable progress in the semantic segmentation of historical documents. Nonetheless, their limited explainability remains a critical concern, as these models frequently operate as black boxes, thereby constraining confidence in the trustworthiness of their outputs. To enhance transparency and reliability in their deployment, increasing attention has been directed toward explainable artificial intelligence (XAI) techniques. These techniques typically produce fine-grained attribution maps in the form of heatmaps, illustrating feature contributions from different blocks and layers within a deep neural network (DNN). However, such maps often closely resemble the segmentation outputs themselves, and there is currently no consensus regarding appropriate explainability metrics for semantic segmentation. To overcome these challenges, we present SegClarity, a novel workflow designed to integrate explainability into the analysis of historical documents. The workflow combines visual and quantitative evaluations specifically tailored to segmentation-based applications. Furthermore, we introduce the Attribution Concordance Score (ACS), a new explainability metric that provides quantitative insights into the consistency and reliability of attribution maps. To evaluate the effectiveness of our approach, we conducted extensive qualitative and quantitative experiments using two datasets of historical document images, two U-Net model variants, and four attribution-based XAI methods. A qualitative assessment involved four XAI methods across multiple U-Net layers, including comparisons at the input level with state-of-the-art perturbation methods RISE and MiSuRe. Quantitatively, five XAI evaluation metrics were employed to benchmark these approaches comprehensively. Beyond historical document analysis, we further validated the workflow’s generalization by demonstrating its transferability to the Cityscapes dataset, a challenging benchmark for urban scene segmentation. The results demonstrate that the proposed workflow substantially improves the interpretability and reliability of deep learning models applied to the semantic segmentation of historical documents. To enhance reproducibility, we have released SegClarity’s source code along with interactive examples of the proposed workflow.

Keywords:

deep neural networks; document layout analysis; semantic segmentation; explainable artificial intelligence; attribution maps; evaluation metrics

1. Introduction

Historical documents preserved in museums, libraries, and archives represent an essential part of the cultural heritage of human history and civilization, containing valuable information that can provide deep insights into the past [1]. However, these documents are highly vulnerable to degradation over time. Consequently, the effective protection, preservation, and valorization of such materials has become an urgent objective. A widely adopted solution is to convert them into digital form. In recent years, the digitization of historical documents has emerged as a paramount task. As a result, millions of digitized documents are now stored on servers, offering rapid and remote access. Beyond this significant achievement in terms of accessibility and preservation, the next major step and real promise of digitization lies in making the content of these vast collections of stored documents exploitable. Ongoing efforts aim to provide researchers with the tools and opportunities to apply computational techniques that advance historical document image processing across numerous tasks. This momentum is particularly evident in the area of layout analysis, which has witnessed a remarkable increase in research initiatives over the past few decades [2].

Layout analysis is a fundamental component of document image processing and a prerequisite for text recognition. It enables the segmentation of a document into semantically homogeneous units such as background, text blocks, tables, and other structural elements [3]. The main challenges of layout analysis in Historical Document Images (HDIs) stem from the heterogeneity and complexity of layouts, as well as from various degradations. As illustrated in Figure 1, HDIs suffer from various forms of degradations such as stain, scanning artifacts, character alteration due to stain or damage, cluttered background, ink fading, and ink intensity variation.

Recent advances in document image processing and multimedia security have also demonstrated the effectiveness of hybrid-domain approaches that integrate multiple transform techniques to improve robustness, stability, and interpretability. Shubuh et al. [4] proposed a hybrid watermarking scheme that distributes watermark data across multiple transform blocks to resist geometric and noise attacks while maintaining high perceptual quality. Similarly, Kusuma and Panggabean [5] combined the Discrete Wavelet Transform (DWT), Hessenberg Decomposition (HD), and Singular Value Decomposition (SVD) to achieve enhanced numerical stability and imperceptibility under filtering, noise, and compression conditions. In a complementary review, Amrullah and Aminuddin [6] examined fragile and semi-fragile watermarking methods for tamper detection and restoration, emphasizing the need for adaptive algorithms that maintain both visual fidelity and robustness. Collectively, these studies underline how hybrid-domain fusion and intelligent restoration strategies can strengthen the reliability and interpretability of imaging systems. The introduction of deep neural networks (DNNs) has enabled recent research to substantially solve many challenges of HDIs by achieving near-perfect accuracy in different layout analysis tasks, particularly through pixel-wise semantic segmentation. They generally focus on page segmentation [7,8] as well as text line detection and classification [9,10,11,12]. However, these models are commonly considered as “black boxes”, which pose significant challenges in rendering their predictions interpretable to humans. Not even engineers or data scientists are able to clearly interpret what happens inside these models or how their results are generated. Explainable Artificial Intelligence (XAI) has emerged as a response to these challenges, aiming to enhance the transparency and interpretability of DNNs [13,14]. In the context of HDIs analysis, the need for XAI has become particularly pressing, as it provides clear and concise explanations of model predictions.

The XAI approaches can broadly be categorized into attribution-based, perturbation-based, attention-based, and transformer-based approaches [14]. In this paper, we focus on attribution-based approaches, also referred to as visualization approaches [15]. The attribution-based approaches generally aim to highlight the most relevant features or attributes in the input that influence the model’s decision, typically through various forms of visual representation. The most common approaches are the Gradient-based methods that leverage gradients to identify the most influential input features contributing to a model’s predictions. These methods analyze how small changes in input affect the model’s output, highlighting critical regions in images for interpretability. The most widely used techniques are Layerwise Relevance Propagation (LRP) [16], DeepLift [17], and GradCAM [18].

Figure 1. Zoomed-in regions of HDIs exhibiting various types of degradation. Examples are drawn from the CB55 [19] and UTP-110 [20] datasets.

In this work, we present SegClarity at https://github.com/iheb-brini/SegClarity (access on 8 November 2025), a continuation of our previous framework DocSegExp [21]. SegClarity expands the methodological scope and analytical depth of explainable segmentation models by integrating additional XAI methods, refined evaluation metrics, and broader experimental validation. The key distinctions between the two works are summarized below:

Broader XAI Method Integration: SegClarity extends beyond the limited set of attribution-based methods used in DocSegExp by incorporating four XAI techniques: rad-CAM, DeepLIFT, LRP, and Gradient × Input.
Compatibility with Perturbation-Based XAI: SegClarity is fully compatible with perturbation-based methods such as RISE and MiSuRe, revealing that document layout analysis, unlike other segmentation domains, supports only a subset of XAI techniques that can effectively capture its complex structural characteristics (see Section 5).
Layer-Wise Attribution Analysis: Unlike DocSegExp, which generated attributions solely from the final layer, SegClarity performs multilayer attribution analysis across five U-Net decoder layers, enabling a deeper understanding of model behavior throughout the feature hierarchy.
Expanded Evaluation Metrics: While DocSegExp relied primarily on ADCC and wADCC, SegClarity introduces a more comprehensive evaluation suite, including Infidelity [22], Sensitivity [22], content heatmap (CH) [23], pointing game (PG) [24], and our novel Attribution Concordance Score (ACS) for assessing both robustness and interpretability.
Dual Evaluation Framework: SegClarity introduces a combined quantitative and qualitative evaluation strategy. Quantitative metrics measure faithfulness and stability, while qualitative visualizations enhanced through post-processing and normalization improve interpretability and human alignment.
Enhanced and Diverse Experimental Validation: DocSegExp was evaluated only on a synthetic dataset, whereas SegClarity includes two real-world historical document datasets (CB55 and UTP-110), offering a more diverse and robust validation.
Cross-Domain Generalization: Finally, beyond document layout analysis, SegClarity demonstrates domain transferability by successfully adapting the workflow to urban scene segmentation using the Cityscapes dataset, confirming its generalizability across semantic segmentation domains.

In this paper, we investigate the explainability of DNNs in the analysis of HDIs, with a particular focus on richly decorated medieval manuscripts. These documents present specific challenges due to the presence of diverse ornaments and decorations that need to be distinguished. Additional difficulties include decorative scripts and ink bleed-through, as well as colorful and intricately decorated objects.

The main contributions of this work are summarized below:

We propose SegClarity, a comprehensive workflow dedicated to evaluating segmentation-based applications through advanced visualization and normalization tools and using a broad set of evaluation metrics. SegClarity contributes to having human-understandable DNNs specifically tailored for pixel-wise semantic segmentation of HDI.
We adapt and evaluate six state-of-the-art XAI methods and four metrics for HDI, enabling a rigorous assessment of model explanation quality.
We present a targeted perturbation technique that leverages semantic annotations to generate meaningful perturbations, specifically designed to enable the accurate evaluation of interpretability in pixel-wise prediction tasks with faithfulness-based metrics.
We introduce a metric, called the Attribution Concordance Score (ACS), designed to enhance the robustness of explainability assessments. This metric is tailored to evaluate the alignment of model attributions with the semantic and structural characteristics of HDI layouts. We demonstrate the effectiveness of our metric both quantitatively, by comparison with two state-of-the-art metrics, and qualitatively, through expert and non-expert human assessments across three different datasets.
We evaluate the domain transferability of the proposed workflow and validate its generalization by extending our experiments beyond HDI analysis to include Cityscapes, one of the most widely used and complex benchmarks in the state-of-the-art for urban scene segmentation.

The remainder of this paper is organized as follows. Section 2 is dedicated to a literature review of layout analysis, semantic segmentation, and explainablity. Section 3 introduces the proposed attribution-based XAI workflow, presents the adapted evaluation metrics, and details both the targeted perturbation technique and the proposed ACS metric. Section 4 describes the experimental corpora and protocol, analyzes the results, and concludes with a comparative study of XAI metrics. Section 5 extends our work by illustrating the use of SegClarity on a different application domain, namely, Urban Street View, with the Cityscapes dataset. Section 6 highlights key investigations, findings, and limitations. Finally, Section 7 presents our conclusion and further work.

2. Literature Review

2.1. Explainable Semantic Segmentation Models

The general context of our work is semantic segmentation, which is the process of grouping regions of an image into classes at the pixel level. It means that each pixel in an image is assigned to a class label. Recently, several research studies have been conducted to propose reliable semantic segmentation of HDIs. In this section, we present the most relevant works in this area. Then, an overview of XAI methods applied to semantic segmentation is provided.

Grüning et al. [9] proposed ARU-Net, a variant of the U-Net model [25], for baseline detection in historical documents. Boillet et al. [26] presented the Doc-UFCN model, inspired by the dhSegment model [8], for text line segmentation. Rahal et al. [12] proposed L-U-Net to address two sub-tasks of layout analysis of HDI: page segmentation and text line detection. They showed that a smaller network with fewer parameters is well suited for the semantic segmentation of HDI. Rahal et al. [11] addressed the text line detection and classification with transfer learning strategies when only a few annotated training data are available. Da et al. [27] introduced a novel model called the Vision Grid Transformer (VGT), specifically designed for document layout analysis. Binmakhashen et al. [28] presented an extensive review of various methods and approaches used in document layout analysis. They cover techniques for page segmentation, text zone detection, text line extraction, and character recognition.

The previously mentioned works highlight the significance of the results achieved by deep learning models in the field of semantic segmentation for historical document analysis. Nevertheless, a major limitation persists: the absence of mechanisms that allow humans to understand, interpret, and trust the outputs generated by these models. This shortcoming is mainly due to the fact that the integration of XAI in this field remains largely unexplored. In contrast, most existing applications of XAI have focused on image classification tasks. For an accessible introduction to explainable image classification, we refer the reader to [29], while a more comprehensive survey on the subject can be found in [14]. However, despite being a ubiquitous task, interpretability in semantic image segmentation has not received the same level of attention. It remains a particularly challenging area of research. While it can be regarded as an extension of the relatively more intuitive task of interpretable image classification, it requires accounting for the combined influence of individually classified pixels of interest [30]. In the following, we present the most relevant works on XAI applied to semantic image segmentation.

In [31], Seg-Grad-CAM was introduced as an extension of Grad-CAM [18]. It was among the most well-known explainability techniques for semantic image segmentation. The generated saliency maps were obtained through a weighted sum of selected feature maps. Its effectiveness was demonstrated on a U-Net model trained on the Cityscapes [32] dataset. Hasany et al. [33] proposed MiSuRe, a model-agnostic two-stage method for generating saliency maps in image segmentation. They demonstrated its applicability on three diverse datasets: an artificial dataset (Triangle Dataset) [34], a medical dataset (Synapse multiorgan CT) [35], and a natural dataset (COCO-2017) [36], using both convolutional and transformer-based architectures.

The previously mentioned works indicate that the main application domains of XAI for semantic image segmentation have been focused on medicine and industry, such as building detection, pedestrian environments, and common objects. To the best of our knowledge, we are the first to tackle the challenge of applying XAI to historical document images, a particularly complex and underexplored domain.

In this context, we proposed DocSegExp, the first framework to extend the XAI for semantic segmentation of HDIs [21]. We introduced an adaptation of the evaluation metric ADCC [37], originally developed for classification, and proposed wADCC (weighted ADCC), a modification of the original metric. The metric wADCC is computed by considering the weighted average of each target class ADCC and its pixel distribution. The three attribution-based XAI methods, GradCAM [18], Gradient × Input [38], and DeepLIFT [17], were also repurposed from the classification task to the semantic segmentation of HDIs. The experiments were conducted on SynDoc12K [39], a synthetic dataset comprising 12,000 annotated historical document images, using two architectures: FCN101 [40] and ResUNet [41]. The obtained results revealed that the predictions made with ResNet outperformed those of FCN, although the ADCC metric indicated the opposite. This discrepancy arose because ADCC does not handle target classes equitably. In contrast, by assigning a weight to each class in the wADCC metric, we ensured a more balanced evaluation, which subsequently reflected the superior performance of ResNet over FCN in the predictions.

In [42], we proposed DocXAI-Pruner, an explainable AI-driven approach for pruning semantic segmentation models HDIs. Traditional pruning techniques rely on gradients or filter magnitudes. In contrast, DocXAI-Pruner utilizes Grad-CAM attribution maps to assess the contribution of individual channels, identifying those that are consistently less relevant across foreground target classes, while excluding the background class. Experiments were carried out on two UNet variants: the standard UNet (∼31 M parameters) [25] and L-U-NET (∼17,000 parameters) [12]. Models were trained on synthetic HDIs and fine-tuned on the CB55, a subset of the DIVA-HisDB dataset [19]. Both structured and unstructured pruning strategies were tested, with fine-tuning configurations including full retraining, freezing pruned layers, and freezing individual channels. Results show that unstructured pruning combined with channel-freezing fine-tuning achieved the best trade-off, with parameter reductions up to 26% and improved or preserved performance. In particular, standard UNET benefited most from pruning, achieving notable accuracy gains despite reduced complexity, whereas L-U-NET showed limited improvement due to its already compact design.

2.2. XAI Method Categorization

The broader AI community has given rise to XAI, a burgeoning field focused on shedding light on the rationale behind model outputs. However, given the diverse nature of tasks, it is essential to customize XAI solutions for each specific domain, as different approaches are required to address the explainability challenge effectively. XAI methods have distinct characteristics, as outlined by the research community [13,43,44]. An agnostic method is designed to operate with any model architecture, irrespective of its complexity or internal structure. In contrast, most XAI methods are model-specific, meaning they are tailored for particular architectures, such as CNNs, recurrent neural networks (RNNs), or simple multilayer perceptrons (MLPs). Another key distinction lies in the scope of the explanations: some methods focus on explaining individual predictions, known as local explanations, while others aim to provide insights into the model’s overall behavior, referred to as global explanations. XAI methods can be categorized into three classes:

Gradient-based methods: These are a class of XAI techniques that leverage gradients to identify the most influential input features contributing to a model’s predictions. These methods analyze how small changes in input affect the model’s output, highlighting critical regions in images for interpretability.
One of the most widely used techniques is Grad-CAM, which generates a localization map by using gradients within a CNN to highlight critical regions for a target concept [18]. Grad-CAM uses the gradients of the target class score $y^{c}$ with respect to the feature map activations $A_{k}$ of a convolutional layer. The importance weight $α_{k}^{c}$ for each feature map is computed as follows:

$α_{k}^{c} = \frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial y^{c}}{\partial A_{k}^{i j}}$

(1)

where c is the target class, k is the filter index of a convolution layer, i and j denote the spatial coordinates of the feature map, and Z is the number of pixels in the feature map. The Grad-CAM heatmap $L^{c}$ is then obtained as follows:

$L^{c} = ReLU (\sum_{k} α_{k}^{c} A_{k})$

(2)

While Grad-CAM effectively identifies discriminative regions, it often produces coarse explanations due to its dependence on high-level feature maps. To overcome this limitation, Guided Grad-CAM [18] combines Grad-CAM with Guided Backpropagation [45], which preserves fine-grained details from lower convolutional layers. Guided Backpropagation modifies the backward ReLU operation by allowing only positive gradients to flow:

$R^{l} = (R^{l + 1} > 0) ⊙ (x^{l} > 0) ⊙ \frac{\partial x^{l + 1}}{\partial x^{l}}$

(3)

where $R^{l}$ represents the relevance map at layer l, and $x^{l}$ denotes the activations.
The final Guided Grad-CAM map is obtained by element-wise multiplying the high-resolution Guided Backpropagation map with the coarse Grad-CAM map:

$L_{Guided - GC}^{c} = L_{Grad - CAM}^{c} ⊙ L_{GuidedBP}^{c}$

(4)

G*I is a different gradient method that multiplies the gradient of the output with respect to the input by the input itself, producing a saliency map that identifies the most influential pixels on the outcome [38]. The attributions for each input pixel $x_{i j}$ in x are given by the following:

$A_{i j} = x_{i j} \cdot \frac{\partial y}{\partial x_{i j}}$

(5)

where i and j denote the spatial coordinates of the input image.
G*I provides an intuitive measure of how changes in the input directly affect the output prediction, highlighting influential regions in the image.
Perturbation-based methods: These assess feature importance by systematically modifying input data and analyzing the model response to these alterations.
Local interpretable model-agnostic explanations (LIME) perturbs input features and fits a locally interpretable model to approximate the behavior of complex models, highlighting the contribution of individual features [46].
Similarly, Shapley additive explanations (SHAP) leverages Shapley values from cooperative game theory, systematically perturbing input features to quantify their influence on model predictions [47]. In the same vein, the RISE method [48] generates a large number of random binary masks that selectively occlude parts of the input image. Each mask’s contribution to the model prediction is measured, and the importance of each pixel is estimated as the weighted average of these random perturbations as described in Figure 2. Unlike LIME or SHAP, RISE is model-agnostic and gradient-free, making it applicable even to non-differentiable models.
While RISE uses random perturbations to estimate pixel importance, MISURE [33] takes a more targeted approach by providing a dual explanation framework. MISURE systematically identifies both the minimal set of features sufficient for maintaining the current prediction (sufficiency) and the minimal modifications required to alter the prediction to a different class (necessity). Through an optimization-based formulation that balances explanation compactness and prediction confidence, MISURE delivers complementary perspectives on feature importance via sufficient and counterfactual explanations.
Decomposition-based methods: These focus on breaking down a model decision-making process into interpretable components by redistributing relevance scores across input features. Unlike perturbation-based methods, which remove or modify input regions, decomposition-based techniques directly trace the flow of information through a neural network to determine how different parts of the input contribute to the output.
A prominent example is LRP, which assigns important scores to input pixels by backpropagating relevance from the output layer to the input, ensuring that the total relevance is conserved across layers [16]. LRP redistributes relevance scores R, layer by layer, using the following propagation rule:

$R_{p} = \sum_{q} \frac{z_{p q}}{\sum_{p} z_{p q}} R_{q}$

(6)

where $z_{p q}$ represents the contribution of neuron p to neuron q, and $R_{q}$ is the relevance of neuron q in the upper layer. The redistribution follows conservation principles, ensuring relevance is neither created nor lost.
Another know method is DeepLIFT which calculates contribution scores by comparing neuron activations to a reference, enhancing traditional gradient methods by considering neuron reference activations [17]. DeepLIFT defines the contribution score C of an input neuron $x_{n}$ to an output neuron y as

$C_{Δ x_{n}} = (y - y_{ref}) \cdot \frac{Δ x_{n}}{\sum_{n} Δ x_{n}}$

(7)

where $y_{ref}$ is the reference activation, and $Δ x_{n}$ is the difference between the actual and reference activation of neuron n.
DeepLIFT ensures that small variations in the input, which do not significantly affect the output, receive small attributions.

In this paper, we evaluate six attribution methods, summarized in Table 1, covering diverse categories of explainability approaches. These include four gradient and decomposition methods and two perturbation-based techniques. We classify each method according to its ability to provide explanations at the input or intermediate layer levels. All six methods support input-layer explanations, while only the perturbation-based methods (RISE and MiSuRe) do not support layer-wise analysis.

For consistency, we adopt a unified notation for Grad-CAM: when applied to the input layer, we use its variant Guided Grad-CAM. However, for simplicity and readability, we collectively refer to both as Grad-CAM throughout the paper.

3. Proposed Attribution-Based XAI Workflow

In this work, we propose a comprehensive workflow that elucidates the layer contributions within a DNN. Our workflow utilizes the input image, the DNN parameters, the identified or target class, the chosen XAI technique, and a specific layer of interest within the network.

This section follows the pipeline of the proposed workflow illustrated in Figure 3 and unfolds into five steps.

Attribution map generation has two complementary stages:
(a)
Segmentation adaptation reshapes dense segmentation outputs into a classification-like form (via the adapter in Algorithm A1);
(b)
Attribution computation using with four different XAI methods to generate layer-wise maps.
Post-processing improves readability through a clipping step to suppress outliers and a normalization step that separates positive/negative evidence.
Quantitative evaluation measures explanation quality using Infidelity with targeted perturbations, Sensitivity_max, content heatmap (CH), pointing game (PG), and our new novel metric, called the Attribution Concordance Score (ACS).
Qualitative evaluation visually inspects positive/negative and blended overlays on the input images to contextualize the maps.
Hybrid evaluation integrates the two evaluation phases into a hybrid evaluation, by leveraging human assessments of the generated heatmaps to examine how well the evaluation metrics align with human interpretation.

3.1. Attribution Map Generation

Attribution maps serve as heatmaps conveying XAI insights to interpret model decisions. As most attribution methods were initially designed for classification models that output a single score vector per image, their application to semantic segmentation requires specific adaptations. To address this challenge, this phase is divided into two complementary stages: segmentation adaptation and attribution computation, which are described below (Further theoretical details are provided in Appendix A.1.1 and Appendix A.1.2):

Segmentation adaptation: We reshape the dense outputs of segmentation networks into a classification-like format, enabling the direct use of established attribution methods.
Attribution computation: We adapt and apply attribution methods to produce fine-grained attribution maps from selected layers of the network.

3.2. Post-Processing

Visualizing the attribution maps can be a challenging task for several reasons, primarily:

Noise and artifacts: Attribution maps may include noise or irrelevant artifacts inherent from the dataset [49] that can dominate the visualization if not addressed. Therefore, applying normalization and post-processing steps (e.g., smoothing, denoising, or thresholding) helps to highlight the most relevant regions and suppress irrelevant details.
Dynamic range and scaling: Attribution maps often have values in a wide or inconsistent range or very small floating-point values depending on the method and the software implementation used. Ignoring normalization or scaling can obscure patterns, as raw values may not map well to a perceivable range of colors or intensities.

To enhance the comprehension of the attribution maps generated from the first step of our workflow and extract relevant information encompassed inside them, two post-processing steps are introduced: (1) a Clipping step (Section 3.2.1) used to remove outliers, and (2) a Normalization step to distinguish positive and negative attribution and improve their visualization (Section 3.2.2).

3.2.1. Clipping Step

Some generated attribution maps contain exceptional values, leading to imbalances in visualization. Histogram analysis of the attribution values, as shown in Figure A1, indicates a bias toward both higher and lower values, reducing the significance of intermediate values. To tackle this issue, a refinement step is introduced to either substitute or truncate outlier data (see Figure A2).

The clipping algorithm, detailed in Algorithm A2, is designed to enhance the visualization of generated attribution maps by truncating extreme values while preserving the range of intermediate values. It starts first by sorting the input vector of attribution values V and identifies thresholds for high and low “outliers” based on a user-defined percentile parameter p. Values exceeding these thresholds are replaced with the closest permissible values below or above the thresholds. This step reduces the visual imbalances in the normalized attribution maps, as demonstrated in Figure A2. We observe in Figure A2 hidden attributions when using the clipping algorithm, as it eliminates the large gap in the distribution of values within the normalized attribution maps. The optimal value of the parameter p was determined empirically. In what follows, p was set to

5 %

when generating attribution maps.

3.2.2. Normalization Step

Conventional normalization techniques have the potential to alter neutral (zero) values, which can impact the comprehensibility of the data. In this work, we consider three different normalization techniques (symmetric, max-abs, and bipolar range) and analyze their effects through heatmap visualizations and kernel density estimation (KDE) plots.

The symmetric normalization has a major drawback, as any attribution with zero values (neutral values) will shift in position if [a,b] is not symmetrical.

Figure A3 illustrates the results of normalizing two different attribution maps by means of the three aforementioned techniques (symmetric, max-abs, and bipolar range).

A comparison of the first two columns in Figure A3 reveals that the symmetric normalization (first column) differs noticeably from the max-abs normalization (second column). The symmetric normalization fails to maintain neutral attributions, resulting in misleading heatmaps. Using the symmetric normalization, we observe that in the first attribution map (first row), neutral values (yellow) are incorrectly assigned as negative (red), while in the second attribution map (second row), neutral values are misrepresented as positive (green). On the other side, using the max-abs normalization, we note that the neutral values are preserved; however, a visual gap between the positive and negative values appears if the distance between a and b is relatively large. The KDE plots illustrated in Figure A4 confirm that the density around zero is preserved more accurately with the max-abs normalization compared to the symmetric one.

A comparison of the second and third columns in Figure A3 reveals that, for the first attribution map (first row), negative values (red) are present but less apparent between the green lines in the center when using the bipolar range normalization method (third column). Similarly, for the second attribution map (second row), some positive values (green) are visible only in the third column. The KDE plots in Figure A4 confirm this observation, as the bipolar range normalization method enhances the density of extreme values while retaining neutral attributions.

Based on these observations, bipolar range normalization is the most suitable for our work, as it offers the following advantages:

Clear distinction between positive and negative values;
Representation of the relationship between target and non-target classes attribution pixels to compare with the ground truth;
Improved visuals for qualitative inspection.

3.3. Quantitative Evaluation

In the literature, various metrics have been used to evaluate the performance of DNNs dedicated to the layout analysis task. Quantitative evaluation is crucial for objectively assessing the accuracy and limitations of these networks. To measure their performance, we calculated the accuracy and intersection over union metrics. The higher the values of the computed performance evaluation metrics, the better the results.

In addition to visually evaluating the attribution maps, their quality can be objectively measured using dedicated evaluation metrics.

Although XAI-based metrics have been widely explored in the literature, their application to semantic segmentation tasks is still relatively uncommon. In our work, we focus on quantitatively assessing the performance of XAI-based methods across different classes defined in the ground truth for semantic segmentation-based applications. To ensure a fair and constructive comparison between the four XAI methods evaluated in this work, four state-of-the-art evaluation metrics are computed.

3.4. Qualitative Evalutation

After quantitatively assessing model performance, we perform a deeper analysis of interpretability by examining attribution maps generated through the SegClarity workflow. The purpose of this evaluation is to measure how effectively input relevance explains the model predictions.

A visual example is illustrated in Figure 4, where heatmaps are used to highlight the contribution of different pixel groups to the final prediction. An association between the input regions and the model prediction is established, which will be further detailed in Section 4.5.3.

3.5. Hybrid Evaluation

In this section, we propose a hybrid approach to attribution evaluation that combines qualitative and quantitative methods. It starts by calculating objective scores from evaluation metrics and then performing human-level assessments of generated heatmaps to evaluate how well each metric aligns with human perception. To this purpose, we examine the

C H

metric by highlighting its limitations, and introduce our novel metric, called the Attribution Concordance Score (

A C S

).

A C S

is specifically designed to enhance the robustness of attribution-based explanation assessments.

A C S

shares the properties of

C H

and

P G

, while being tailored to the specifics of our evaluation setting.

3.5.1. Infidelity

Infidelity (

I n f i d

) measures the quality of an explanation by quantifying how well it aligns with the predictor response to significant input perturbations.

I n f i d

is designed to evaluate the quality of an explanation

ϕ (x)

for a model f at a given input x. It specifically quantifies the expected discrepancy between the dot product of the input perturbation and the explanation, and the resulting change in the model output, thereby capturing variations in the f values using this formula [22].

I n f i d (ϕ, x) = \underset{δ \sim D}{E} [{(δ^{⊤} ϕ (x) - (f (x) - f (x - δ)))}^{2}]

(8)

where

$ϕ (x)$ is the explanation map for input x;
$δ$ denotes a perturbation on the input x sampled from a distribution D;
$f (x)$ represents the model output on x;
$f (x - δ)$ represents the model output on the perturbed image);
$\underset{δ \sim D}{E}$ denotes the expected value over $δ$ .

Figure 4. Generated explanations for the GL class using Grad-CAM on Dec2 of the S-U-Net model. (a): Complete explanation heatmap; (b): Positive attribution heatmap; (c): Negative attribution heatmap; (d): Blended attribution heatmap superimposed on the input image.

The concept of fidelity and faithfulness has been recently applied to segmentation tasks [50]. In the context of image modality, we distinguish the target class to interpret

c l

and the perturbation technique.

I n f i d

is adapted to semantic segmentation on a class

c l

according to the following equation:

I n f i d_{c l} (ϕ, x) = \underset{δ \sim D}{E} [{(ϕ_{c l} (x) ⊙ δ_{c l} - (f_{c l} (x) - f_{c l} (x - δ_{c l})))}^{2}]

(9)

where

$ϕ_{c l} (x)$ is the explanation map specific to class label $c l$ for input image x;
$δ_{c l}$ denotes a perturbation that depends on the input x, the label $c l$ , sampled from a distribution D;
$f_{c l} (x)$ represents the model output for label $c l$ at input x, producing a per-pixel confidence score or probability map (i.e., selecting probability map only from target $c l$ );
$ϕ_{c l} (x) ⊙ δ_{c l}$ denotes the element-wise dot product, reflecting the predicted change in the model output according to the explanation $ϕ_{c l} (x)$ ;
$\underset{δ \sim D}{E}$ denotes the expected value over $δ$ .

In our implementation, the perturbation distribution D corresponds to a uniform random sampling of square-like localized masks applied over the target class regions (more details in Figure 6). Each perturbation

δ_{c l} \sim D

is generated by randomly selecting spatial positions and mask sizes within a predefined range, ensuring that every sampled mask occludes a portion of the class-relevant area while preserving the overall image structure.

The

I n f i d

values for each individual target class are then aggregated to compute the global

I n f i d

score. The selected perturbations

δ_{c l}

are applied to the input image x through localized noise or occlusions, to more accurately capture the structured nature of segmentation tasks.

The closer the

I n f i d

value is to zero, the more accurately the explanation

ϕ (x)

aligns with the true behavior of the model under perturbations. A lower

I n f i d

value indicates that the explanation

ϕ (x)

reliably captures the model behavior.

Measuring the faithfulness of XAI methods quantifies how accurately an explanation reflects the decision-making process of a DNN and the contribution of input features. Metrics based on ablation are particularly well suited for this purpose, as they systematically remove parts of the input using the perturbation methods and evaluate the resulting impact on the overall interpretation. Common perturbation methods include Gaussian noise, structured noise, or targeted occlusions, chosen based on the specific interpretability task and the characteristics of the input [22]. However, perturbations involving random pixel subsets in high-dimensional images often fail to meaningfully affect model predictions, as minor and unstructured pixel loss is typically compensated for by the surrounding context as shown in Figure 5. This issue becomes particularly critical in tasks involving structured image, where spatial relationships are crucial, such as in document layout analysis.

Figure 5 illustrates the effect of applying a random square perturbation on a localized region of HDI image and its impact on the segmentation maps. In Figure 5, a zoomed-in portion of the input image is shown, focusing on TXT and HL regions. Figure 5 displays the model prediction on the original image, which correctly segments the main text lines and highlights compared to Figure 5, corresponding to the ground truth. In Figure 5, a square perturbation is applied, occluding a central portion of the input. The prediction on this perturbed image, shown in Figure 5, reveals a significant degradation in segmentation performance. Notably, the occlusion introduces segmentation errors, including the appearance of new, unintended classes, such as GL, within the main text area.

In this work, we propose our perturbation method based on semantic annotations, which consists of a square removal technique for generating perturbations, specifically suited to the task of HDI segmentation. We evaluate our method using the

I n f i d

metric, which we specifically selected due to its design to work under large perturbations area, as opposed to sensitivity-based metrics that focus on minor input variations.

Our targeted perturbation method is based on constraining perturbations to only pixels that belong to the selected target label, as illustrated in Figure 6. This targeted perturbation method ensures that only the regions relevant to the label of interest are masked, thereby enhancing both the interpretability and relevance of the perturbation on the selected target. By selecting square regions from HDIs, where only the target pixels are considered active, we effectively evaluate the impact on model predictions while preserving the surrounding context. Our square removal technique is a suitable perturbation method for HDI segmentation, particularly in scenarios where regions have semantic significance.

Using the targeted perturbation method, we iteratively generate perturbation masks (i.e., a perturbation mask contains randomly placed square-like patches) until all target classes are covered for all the images of a dataset (see Algorithm 1). The square-like perturbation patches are randomly generated with the condition that each target class present in the input (i.e., where pixels are non-zero) is properly represented in the perturbation mask. Our method ensures that the perturbed regions contain at least a group of pixel from each class in the set of target classes T.

Algorithm 1 Procedure

1:: Input: X (input image), T (set of target classes), $G T$ (ground truth mask)
2:: Output: M (perturbation mask containing multiple square patches and covering all target classes present in X)
3:: repeat
4:: $M \leftarrow GenerateTargetedPerturbationMask (X)$
5:: $C \leftarrow {G T_{i, j} ∣ M_{i, j} = 1} \cap T$
6:: until $C = T$
return M

3.5.2. Sensitivity

Sensitivity maximum (

S e n s_M a x

) measures the robustness of an explanation by evaluating how much it changes in response to small variations in the input.

S e n s_M a x

, specifically, focuses on the maximum change in the explanation. It identifies the largest deviation in the explanation, highlighting features or areas most affected by input perturbations [22].

Formally,

S e n s_M a x

for an explanation

Φ

of a model f at an input x with a neighborhood radius r is defined as

S e n s_M a x (Φ, f, x, r) = max_{| ϵ | < r} ∥ Φ (f, x + ϵ) - Φ (f, x) ∥

(10)

where

$ϵ$ denotes a perturbation vector applied to the input x, with $| ϵ | < r$ constraining it to a neighborhood of radius r around x;
$Φ (f, x)$ is the explanation map generated by the explanation functional $Φ$ for the model f at the input x;
$Φ (f, x + ϵ)$ is the explanation produced for the perturbed input $x + ϵ$ , providing insight into the model sensitivity to local changes;
$∥ \cdot ∥$ represents a norm (commonly the $L_{2}$ -norm), reflecting the magnitude of the difference in the explanation values.

S e n s_M a x

captures the peak response of the explanation to perturbations, indicating how stable or robust the explanation is under local changes in the input.

A low

S e n s_M a x

value suggests that the explanation remains relatively consistent even when the input is slightly altered, which is desirable for trustworthy and interpretable models. In contrast, a high

S e n s_M a x

value may indicate that the explanation is highly sensitive to small changes, potentially reducing the reliability of the interpretability of the adopted XAI-based method.

3.5.3. Pointing Game

Pointing game (

P G

) evaluates the precision of attribution maps by checking whether their most salient point falls within the ground truth region of the target class. Unlike overlap-based metrics, PG focuses on the single pixel with the highest attribution score and assesses whether this “pointing” corresponds to a semantically meaningful location.

A higher PG score indicates that the attribution method successfully highlights the most relevant region for the target class.

In our work, we adopt the

P G

metric following the formulation introduced by Zhang et al. [24]. Originally proposed for weakly supervised object localization, we extend this approach to semantic segmentation, enabling pixel-level evaluation of attribution methods.

P G_{c l}

for a target class

c l

is defined according to the following equation:

P G_{c l} = \{\begin{matrix} 1, & if arg {max}_{i, j} A_{i j} \in {(i, j) : M_{c l_{i j}} = 1}, \\ 0, & otherwise, \end{matrix}

(11)

where

$P G_{c l}$ represents the $P G$ score for class $c l$ ;
$A_{i j}$ is the attribution value at pixel $(i, j)$ ;
$M_{c l_{i j}}$ is a binary mask indicating whether pixel $(i, j)$ belongs to class $c l$ .

The overall PG score is defined as the mean over N evaluated samples, computed as

P G = \frac{1}{N} \sum_{n = 1}^{N} P G_{c l}^{(n)} .

(12)

3.5.4. Content Heatmap

The content heatmap (

C H

) measures the overlap between the attribution values and the segmentation mask, enabling a more contextually relevant evaluation of model explanations. It works by measuring the proportion of attribution values that fall within the regions corresponding to the target class, effectively capturing the contextual relevance of the explanation. A higher CH score indicates that the model focuses its attention on semantically meaningful regions when making predictions. In our work, we use the

C H

metric by leveraging the entire segmentation map, following the approach proposed by Wand et al. [51]. This approach provides a novel perspective, differing from previous applications of

C H

that were not tailored to segmentation-based applications.

C H_{c l} = \frac{\sum_{i, j} A_{i j} \times M_{c l_{i j}}}{\sum_{i, j} M_{c l_{i j}}}

(13)

where

$C H_{c l}$ represents the $C H$ value for class $c l$ ;
$A_{i j}$ is the attribution value at pixel $(i, j)$ ;
$M_{c l_{i j}}$ is a binary mask indicating whether pixel $(i, j)$ belongs to class $c l$ .

3.5.5. Proposed ACS Metric

C H

has been used to evaluate XAI methods by quantifying the proportion of heatmap regions overlapping with the ground truth mask of the target object relative to the entire heatmap (see Section 3.4).

Nonetheless, it suffers from a major limitation: attribution values falling outside the boundaries of the target class are disregarded. To illustrate this shortcoming, we construct a forged attribution map

A_{F}

(see Equation (16)) from an original attribution map A generated using the proposed workflow (see Section 3). In

A_{F}

, the heatmap of the target region (T) remains visually unchanged, while the heatmap outside the target region is altered. The forged attribution map is computed as follows:

χ_{T} (x) = \{\begin{matrix} 1, & if x = T \\ 0, & otherwise \end{matrix}

(14)

M_{T} (A) = {(χ_{T} (A (i, j)))}_{i \leq n, j \leq m}

(15)

A_{F} = A ⊙ M_{T} (A) + N_{[0, 1]} (A ⊙ M_{O \neq T} (A)) \cdot {∥ A ⊙ M_{T} (A) ∥}_{\infty}

(16)

M_{T} (A)

isolates target pixels T,

M_{O \neq T} (A)

isolates non-target regions, and

N_{[0, 1]}

normalizes the irrelevant heatmap outside T. The normalization ensures consistent visual comparison after modifying non-target attributions.

Figure A5 illustrates two different attribution maps, the original attribution map generated by GradCAM (see Figure A5) and the forged attribution map (see Figure A5). Both attribution maps have identical

C H

values, underscoring the inability of

C H

to penalize attributions that fall outside the boundaries of the target class. This highlights the need for an optimized evaluation metric that accounts for attribution distributions both within and beyond the target boundaries. To overcome this limitation, we introduce in this work the

A C S

metric, a metric designed to evaluate attribution maps more comprehensively. An effective attribution map should fulfill several key criteria. In the context of a target class T, it is expected that the positive attributions within the attribution map will mostly emphasize pixels that are associated with class T, while the negative attributions should primarily emphasize pixels that are not associated with class T.

To implement the

A C S

metric, as illustrated in Figure 7, we start by extracting the attribution maps and generating a binary mask that represents class T. In this mask, pixels that belong to class T are assigned a value of 1, while pixels that are not part of T are assigned a value of 0. Then, the mask is applied on the attribution maps, pixel-wise multiplication, resulting in a new attribution map referred to the attribution on target class

A_{T}

. This map highlights the attributions that are directly associated to the target class. On the other hand, by the inversion of values in a mask, another attribution map is generated, referred to the attribution on non-target class

A_{\bar{T}}

. This map includes attributions that are not linked to the target class. Our objective is to ensure that positive attributions are primarily represented in

A_{T}

, while negative attributions are more concentrated in

A_{\bar{T}}

. To formalize this analysis, we introduce terminology that distinguishes between true positives and false negatives in

A_{T}

, and false positives and true negatives in

A_{\bar{T}}

. By leveraging this terminology, we compute conventional evaluation metrics, such as precision, recall, accuracy, and F1-score. To enhance the robustness of attribution-based explanation assessments, we propose the

A C S

metric.

A C S

can be seen as the F1-score counterpart in our attribution-based XAI workflow.

Given a model M and an attribution map A, we first decouple A into two different components:

Attribution on target mask, defined as

$A_{T} = {(χ_{T} (A (i, j)))}_{i \leq n, j \leq m}$

(17)

where $χ$ is the indicator function defined in Equation (14).
Attribution on non-target mask, defined as

$A_{\bar{T}} = A - A_{T}$

(18)

Second, we compute the following measures:

\begin{matrix} T P & = \sum_{i, j} max (0, A_{T} (i, j)) & (True Positive); \\ F N & = \sum_{i, j} min (0, A_{T} (i, j)) & (False Negative); \\ F P & = \sum_{i, j} max (0, A_{\bar{T}} (i, j)) & (False Positive); \\ T N & = \sum_{i, j} min (0, A_{\bar{T}} (i, j)) & (True Negative) . \end{matrix}

After computing

T P

,

F N

,

F P

, and

T N

, we calculate the two following evaluation metrics: precision (P) (see Equation (19)) and recall (R) (see Equation (20)).

Precision = \frac{TP}{TP + FP}

(19)

Recall = \frac{TP}{TP + FN}

(20)

Based on the precision and recall metrics, we propose the

A C S

metric, as defined in Equation (21).

A C S

corresponds to the harmonic mean of precision and recall. Its formulation is similar to that of the F1-score metric.

A C S = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(21)

3.5.6. Adaptive ACS Metric

A C S

is well suited to the context of our work and yields a more accurate evaluation than

C H

; however, there are cases based on qualitative assessment of some attribution maps that prove a misalignment with the quantitative evaluation, as illustrated in Figure A6. The low

A C S

score can be explained by the distribution of the positive attributions over non-target classes, which reduces the precision of the

A C S

metric. In the first example in Figure A6, the positive attributions are concentrated on the target class GL, while other lower values positive attributions (in light green) are shown on the background pixels. Each of the four examples illustrates positive attribution of the target class GL, though affected to some extent by the background class.

To address this issue, we propose a thresholding step that enhances the saliency of key attribution areas and decreases the impact of noise from less significant attribution pixels. The thresholding step is based on using the Otsu method [52]. The Otsu method is a thresholding technique used widely in image processing to separate the foreground from the background. It works by finding the threshold that minimizes the intra-class variance, or equivalently maximizes the inter-class variance. The Otsu threshold

t^{*}

corresponds to the value of t that maximizes

σ_{B}^{2} (t)

.

σ_{B}^{2} (t)

is defined according to the following equation:

σ_{B}^{2} (t) = ω_{0} (t) ω_{1} (t) {[μ_{0} (t) - μ_{1} (t)]}^{2}

(22)

where

$σ_{B}^{2} (t)$ is the between-class variance at threshold t;
$ω_{0} (t)$ and $ω_{1} (t)$ are the probabilities of the two classes separated by the threshold t;
$μ_{0} (t)$ and $μ_{1} (t)$ are the means of the two classes separated by the threshold t.

In our work, the Otsu method is applied to determine optimal thresholds from the attribution maps. These thresholds are subsequently used to normalize the attribution maps, thereby improving the precision of the

A C S

metric. For this purpose, we adopt the following pipeline:

Divide the attribution map into two components: positive attributions ( $A +$ ) and negative attributions ( $A -$ );
Perform the Otsu method on $A +$ to determine the threshold $t h_{+}$ and retain values exceeding this threshold;
Generate $\tilde{A +}$ by assigning the threshold $t h_{+}$ to the values in $A +$ that exceed $t h_{+}$ ;
Perform the Otsu method on $A -$ to determine the threshold $t h_{-}$ and retain values less than this threshold;
Generate $\tilde{A -}$ by assigning the threshold $t h_{-}$ to the values in $A -$ that exceed $t h_{-}$ ;
Merge the refined attribution maps $\tilde{A +}$ and $\tilde{A -}$ to obtain the refined attribution map $\tilde{A}$ .

Figure 8 illustrates the effect of applying the thresholding method to the attribution maps and its impact on the

A C S

metric. We observe that the quality and interpretability of the visual outputs are enhanced, and the computed

A C S

metric aligns well with the visual results.

3.5.7. Human Assessment

Our main objective is not only to evaluate the effectiveness of an XAI-based method, but also to determine the level of comprehension of the model (i.e., plausibility). This corresponds to measuring the model’s ability to capture understanding and its alignment with human expectations or domain knowledge. The evaluation phase was divided into two parts. For the historical document datasets, the assessment was conducted by domain experts specializing in document analysis—primarily researchers and Ph.D. holders in information extraction from visually rich documents, historical document image analysis, and text-line segmentation of ancient manuscripts, including members of the https://iapr.org/ (access on 8 November 2025). For the Cityscapes dataset, the evaluation was carried out by Ph.D. candidates and engineering students with expertise in image processing, particularly in aerial imagery object detection and hand gesture recognition. To ensure objectivity and eliminate potential bias, none of the authors participated in the evaluation process. Hence, we introduce an evaluation protocol for human assessment of heatmaps, assigning scores according to the specific scenario:

High score (✓✓): In cases where the heatmap assigns positive attribution to the pixels that represent the class of interest;
Medium score (✓): In cases where the heatmap does not assign positive attribution to all pixels of the class of interest or it assigns positive attribution to the pixels of the relevant class and only one additional class;
Low score (×): In cases where the heatmap does not assign positive attribution to the pixels that represent the target class.

Figure 9 illustrates the three human-based scoring scenarios. From an input image in the UTP-110 dataset (see Figure A10), we selected the target class body (

B D

), highlighted in yellow. In the first case (see Figure 9), the positive attributions in green are dispersed across multiple classes, resulting in a low score. In the second case (see Figure 9), most positive attributions concentrate on the

B D

class, although another class (

B I n i t

) also shows a noticeable response, which corresponds to a medium score. Finally, in the third case (see Figure 9), the positive attributions are strongly concentrated on the

B D

class, while the other classes remain neutral or negative, indicating a high score.

Among the five computed metrics, three are particularly aligned with human assessments:

C H

,

A C S

, and

P G

, as they rely solely on heatmap analysis. In contrast,

I n f i d

and

S e n s_M a x

depend on perturbations. Further experiments and analyses of these metrics are presented in Section 4.6.

4. Experiments and Results

To showcase the applicability of the proposed workflow and the different contributions, we conduct a set of thorough experiments on two types of datasets: real and artificial, which we describe in detail in the following sections.

4.1. Experimental Corpora

To analyze the performance of the four evaluated XAI-based methods on the two considered models, we conducted experiments based on both qualitative and quantitative observations derived from HDI collected from various benchmark datasets dedicated to document layout analysis. These datasets were provided in the context of recent open competitions at the ICDAR and ICFHR conferences. Consequently, our experimental corpus consists of the following three datasets:

CB55 dataset (https://diuf.unifr.ch/main/hisdoc/diva-hisdb.html) (accessed on 8 November 2025): This is a freely available subset of the DIVA-HisDB dataset containing Medieval manuscripts. It is composed of 70 Latin handwritten document images digitized at 600 dpi (see Figure 10a). Four classes are defined in the ground truth column: TXT (i.e., central text), HL (i.e., a special line separating paragraphs in main text), GL (i.e., text on the page sides), and BG. The CB55 dataset presents various particularities (e.g., decorations and comments written in different calligraphy styles) [19].
UTP-110 dataset: This is a subset of Medieval manuscripts (collection Utopia, armarium codicum bibliophilorum, Cod. 110) [20]. It contains 300 images, primarily in Latin with some sections in French (see Figure 10b). The UTP-110 images were resized to $640 \times 960$ pixels while preserving the original aspect ratio. Seven classes are defined in the ground truth, as shown in Figure 10: background (BG), decoration (DEC), body (BD), text line (TXTL), big initial (BInit), small initial (SInit), and filler (FIL). The UTP-110 dataset presents complex challenges for layout analysis, including various types of ornaments, decorative text, faded writing, and ink bleed-through [20].
Synthetic dataset: This is a collection of 150 synthetic HDIs designed with fonts and layouts that capture the key characteristics of the CB55 dataset (see Figure 10c). To ensure adequate variability, three different fonts with distinct sizes are used [12].

The two real datasets, CB55 and UTP-110, provide complementary data sources to evaluate and enhance the performance of DNNs across multiple classes and resolutions.

4.2. Experimental Protocol

In our experiments, we used the ground truth defined at pixel level to assess the performance of two deep architectures dedicated to semantic segmentation. Our experiments focus on two U-NET variants:

S-U-NET: This is the standard U-NET version featuring a high number of channels in each decoder block (512, 256, 128, and 64, from the bottleneck to the segmentation head), comprising over 31 million parameters [25].
L-U-NET: This is a lightweight version of the standard U-NET introduced by Rahal et al. [12], which has only 16 channels in each decoder block and fewer than $17, 000$ parameters.

L-U-NET and S-U-NET architectures follow the same structural design, differing only in the depth and number of channels. For instance, L-U-NET has a single convolution layer in each block, whereas S-U-NET includes two convolution layers per block.

The training process for both two U-NET variants was carried out in two stages on the CB55:

Pre-training was performed using the synthetic dataset, during which the model with the lowest validation loss was selected;
Fine-tuning was performed using the best performing model on the real datasets of CB55.

For the UTP-110 dataset, both models were trained directly on the real dataset without a pre-training phase.

In our experiments, we initially pre-trained the L-U-NET and S-U-NET models using a synthetic dataset to address the challenge posed by the limited number of training pixels in the CB55 dataset, particularly for the HL class. The synthetic dataset consists of 150 images, divided into 120, 20, and 10 for training, testing, and validation, respectively.

After pre-training, we fine-tuned the two models (L-U-NET and S-U-NET) on the CB55 dataset, which contains 70 images divided into 40, 20, and 10 for training, testing, and validation, respectively. We also fine-tuned the two deep models on the UTP-110 dataset, which comprises 100 images, divided into 60, 20, and 20 for training, testing, and validation, respectively. The two U-NET architectures were trained with a batch size of 5, using the ADAM optimizer that was configured with epsilon and learning rate values of

10^{- 5}

and

10^{- 3}

, respectively. Cross-entropy was adopted as the loss function for training both U-NET architectures. All the experiments were performed on a Linux server equipped with 8 Tesla V100-SXM2 GPUs (32 GB memory each), 64 cores, and 755 GB RAM. Our training strategies were implemented using the PyTorch (version 2.0.1) and Captum frameworks (version 0.7.0) and executed on an Nvidia RTX 3060 GPU with 16 cores and 16 GB RAM.

4.3. Explanation Parameters

To generate attribution maps, we focus on five stages of the U-NET architecture. Specifically, we analyze the decoder part, as depicted in Figure A7. The decoder blocks (Dec4 to Dec1) are paired with corresponding encoder blocks via skip connections, while the bottleneck serves as the transition between the encoder and decoder. The segmentation head at the end of the architecture which corresponds to the final layer (FL), generates the final output.

We chose the final layer of each decoder block as it provides the most semantically informative feature representations. Earlier decoder layers often capture intermediate, task-specific transformations that are difficult to interpret independently. By combining upsampled features with skip-connected encoder outputs, the final decoder layer achieves a balance between semantic context and spatial detail. Accordingly, we focus on the final layer of each decoder block and the ultimate segmentation head.

To ensure a fair comparison of the explainability measures between the two U-NET variants, we limited our analysis to the first convolutional layer in each block.

All the attribution-based XAI methods are capable of generating both positive and negative attribution maps. Hence, in our work, we used a version of Grad-CAM without the ReLU layer, allowing negative attributions to be visualized.

Additionally, we note that each layer generates attribution maps with different spatial dimensions, as each attribution is tied to the parameters of its corresponding layer. As depicted in Table 2, lower layers have smaller dimensions compared to higher layers. For instance, Dec1 and FL spatial dimensions are 16 larger than Dec3 and 64 larger than Dec4. For visual comparison, we applied image interpolation to the resulting heatmaps to ensure that they align with the same scale.

For attribution-based XAI evaluation, the

I n f i d

and

S e n s_M a x

metrics were computed using 10 random perturbations per image. All attribution-based methods were evaluated with batches of 3 images, except for S-U-NET with LRP, which was restricted to a batch size of 1 due to memory constraints. The input neighborhood radius r was set to

0.2

, following the approach of Yeh et al. [22]. Although Yeh et al. [22] noted that varying r can affect

S e n s_M a x

outcomes, exploring this parameter within the segmentation context is computationally prohibitive, as shown in Table 7.

For the perturbation mask generation, we performed a mask-size sensitivity analysis to chose the optimal mask size for the experiment. This experiment validates the choice of 32 × 32 patch size for the infidelity metric by systematically testing different mask sizes (

2 \times 2, 4 \times 4, 8 \times 8, 16 \times 16, 32 \times 32, 64 \times 64

) and measuring their effect on model performance using model Confidence Drop, IoU Drop, DICE Drop, and Pixel Change Ratio. We fixed a L-UNET model on CB55 and averaged over 100 iterations and 50 random patches. The results are shown in Table 3 and Figure 11. Larger masks (>

64 \times 64

) produce excessive degradation due to over-occlusion, whereas smaller ones (<

16 \times 16

) yield minimal response variation. The

32 \times 32

mask provides an optimal trade-off between perturbation strength and attribution precision.

4.4. Generalization Setup

To evaluate the domain-specific behavior of explainable segmentation methods, we conducted a comparative experiment using six XAI techniques: four gradient- and decomposition-based methods (Grad-CAM, Gradient × Input, LRP, and DeepLIFT) and two perturbation-based methods (RISE and MiSuRe). The experiment was first performed on two historical document datasets, CB55 and UTP-110, focusing on the main text class to analyze how each method highlights semantically relevant regions. The main drawback of the perturbation method is the inability to use the intermediate layer; therefore, for this experiment, we only consider explaining the input layer of the network. The results show that perturbation-based methods, particularly RISE, failed to produce meaningful attributions in document images due to the complex background textures and sparse textual structures, while MiSuRe yielded overly generalized activations. In contrast, the other attribution-based methods generated precise and semantically coherent explanations aligned with textual regions. To further validate the robustness and generalization of the framework, we repeated the same analysis on the Cityscapes dataset, which represents a fundamentally different visual domain. In this context, both perturbation- and attribution-based methods produced consistent and interpretable heatmaps. Table 4 summarizes the experimental configurations used for the perturbation-based methods, with parameters empirically tuned to balance mask resolution, sparsity, and computational stability across datasets.

4.5. Results

This subsection reports both predictive performance and explainability outcomes across datasets and layers. We first benchmark the two U-NET variants (L-U-NET and S-U-NET) on the CB55 (four classes) and UTP-110 (seven classes) datasets using standard segmentation metrics, including accuracy, precision, recall, F1-score, and intersection over union (

I o U

), which are summarized in Table 5 and detailed per class in Figure 12. We then evaluate the quality of the generated attribution maps across five layers (Dec4 → Dec1 and FL) using four different XAI metrics, including infidelity (

I n f i d

), sensitivity (

S e n s_M a x

), content heatmap (

C H

), and the proposed Attribution Concordance Score (

A C S

)—aggregated as overall averages (A) and foreground-only averages (

f A

) in Table A1 and Table A2. Finally, we qualitatively analyze layer-wise attribution patterns with Grad-CAM and LRP on representative samples (see Figure A8 and Figure A9), relating visual trends to the quantitative scores.

4.5.1. Model Performance

The quantitative evaluation was conducted separately on the CB55 and UTP-110 datasets, which contain 4 and 7 target classes, respectively, across 5 layers (Dec4 to Dec1 and FL), as described in Section 4.3.

As shown in Table 5, S-U-NET outperforms L-U-NET in both IoU and F1-score across the CB55 and UTP-110 datasets, while L-U-NET maintains a relatively comparable performance despite having fewer parameters.

Figure 12 illustrates the comparison of performance using the IoU metric between the L-U-NET and S-U-NET models on the CB55 and UTP-110 datasets for each individual target class. For the CB55 dataset, both models achieve the same performance for the BG and TXT classes, but S-U-NET scores higher for the HL and GL classes. For the UTP-110 dataset, L-U-NET performs better on the BG, DEC, and FIL classes, while S-U-NET has a slight advantage in the other classes, particularly in the SDC class.

4.5.2. XAI Evaluation

In this work, we focus on assessing the attribution maps generated by each model. Table A1 and Table A2 compare different evaluation metrics:

I n f i d

,

S e n s_M a x

,

C H

, and

A C S

across various layers of the two U-NET models (L-U-NET and S-U-NET). In these two tables, each metric scores were aggregated using two techniques:

Average ( $A$ ): This represents the average metric score across all classes, including both foreground and background;
Foreground average ( $fA$ ): This represents the average metric score computed over the foreground classes only, excluding the background class.

Three key observations were deduced through Table A1 and Table A2:

(i) Layer-wise analysis: We observe that the best scores for

I n f i d

,

C H

, and

A C S

scores across all XAI methods correspond to the higher layers, especially

F L

, while the lowest scores tend to occur in the lower layers, particularly

D e c 4

for

A C S

, and

D e c 4

and

D e c 3

for

I n f i d

. For

S e n s_M a x

, we note similar trend for GradCAM, LRP, and G*I, with higher layers yielding better scores. However, for DeepLift, the lower layers outperform the higher ones.

(ii) Foreground vs. background impact: Analyzing the differences between A and

f A

provides insights into the impact of the background class on metric evaluations. For

I n f i d

,

f A

is consistently lower than A, indicating that the background contributes to higher

I n f i d

scores, mainly due to the inherent noise present in the background of HDIs. For

S e n s_M a x

,

f A

is lower than A in the lower layers, but gradually approaches A in the higher layers, eventually surpassing it for DeepLift. This suggests that DeepLift assigns greater relevance to deeper layers compared to other XAI methods. For

A C S

, the values do not exhibit a consistent trend, indicating that the influence of the background on this metric varies depending on the method and the layer being analyzed. This suggests that the background can have a misleading effect, as it is can bias the segmentation of the foreground classes. Hence, focusing only on the foreground classes helps mitigate the background impact on the XAI metrics.

(iii) Model-wise comparison: Comparing the performance of L-U-NET and S-U-NET across different explainability metrics provides insights into the models behavior. For

I n f i d

, L-U-NET achieves better performance in terms of A than S-U-NET. However, when considering

f A

, L-U-NET outperforms S-U-NET, particularly for GradCAM and DeepLift, while S-U-NET achieves better results for LRP and G*I. For

S e n s_M a x

, S-U-NET generally outperforms L-U-NET, except for DeepLift on the CB55 dataset, where L-U-NET achieves superior performance. For

A C S

, S-U-NET consistently achieves better performance than L-U-NET across all

f A

evaluations, highlighting its effectiveness in maintaining stable attributions in the foreground content.

4.5.3. Qualitative Evaluation

In this section, we visually evaluate the generated heatmaps using the proposed workflow and analyze their alignment with the computed XAI metrics. We selected TXT and TXTL as target classes for the CB55 and UTP-110 datasets, respectively, as they represent textual features to ease the comparison process better between the two datasets. We additionally selected two XAI methods: Grad-CAM and LRP due to their fundamentally different underlying algorithms.

Figure A8 and Figure A9 and present sample attribution maps obtained at different decoder layers (Dec4 to FL), from the CB55 and UTP-110 datasets, respectively. We observe differences in attribution quality and relevance distribution across the two U-NET architectures (L-U-NET and S-U-NET) and explainability methods (GradCAM vs. LRP), providing insights into the interpretability of the learned features at different layers.

In Figure A8 and Figure A9, the results of two XAI methods, GradCAM and LRP, are illustrated, with each row corresponding to one method. Each column represents the attribution map for a specific decoder layer, progressing from the deeper layers (Dec4) to the final layer (FL). The values of the

I n f i d

and

A C S

metrics are provided below each attribution map to quantitatively assess the quality of the attributions.

(i) CB55:

GradCAM: For L-U-NET, we observe that FL has strong positive attributions on the TXT pixels, with minimal noise around borders. However, intermediate layers, particularly Dec1 and Dec3, present intense noise, suggesting a negative effect caused by page borders on the model predictions. For S-U-NET, we also note similar positive attributions on the TXT pixels in FL, but with more distributed noise throughout the BG pixels. S-U-NET struggles more with noise in intermediate layers, particularly around the GL and borders, indicating challenges in distinguishing textual content from other document elements.
LRP: For L-U-NET, we observe that the TXT pixels have positive attributions. However, there is only negative attributions within the text lines, especially in Dec3 and Dec4, while the background shows little to no attribution. For S-U-NET, we observe that negative attributions are more pronounced within the TXT pixels compared to L-U-NET, creating gaps or holes, particularly in Dec1 and Dec2. This suggests that S-U-NET struggles to consistently recognize the main text regions, starting from Dec4 onward.

(ii) UTP-110:

GradCAM: For L-U-NET, we observe a progressive refinement in attribution from Dec4 to FL. In Dec4, positive attributions appear around the TXTL pixels and within the BD pixels, with negative attributions on the BG and DEC pixels. By Dec3, L-U-NET starts differentiating the FIL and DEC pixels. In Dec1, positive attributions focus on the TXTL pixels, while the FIL and DEC pixels have negative attributions. FL shows strong positive attributions for the TXTL pixels, while the FIL and BG have negative attributions, indicating the improved ability of L-U-NET to distinguish text from other document elements. For S-U-NET, we observe a similar trend to L-U-NET. In Dec4, positive attributions appear on the TXTL pixels, while the DEC pixels show strong negative attributions. By Dec3, negative attributions emerges between text lines, and in Dec2, the SDC pixels also receive negative attributions. Dec1 further reinforces positive attributions on the TXTL pixels, while keeping negative attributions for the background. FL maintains strong positive attributions for the TXTL pixels and strong negative attributions for the DEC and BG pixels. Compared to the L-U-NET, S-U-NET struggles more with the DEC pixels, while L-U-NET finds the FIL pixels more challenging.
LRP: Both models present similar behavior. In Dec4, both models have positive attributions on the TXTL regions, with S-U-NET showing concentrated negative attributions near the FIL and DEC pixels. As we move through Dec3, Dec2, and Dec1, negative attribution fades, with increasing focus on the TXTL pixels. For FL, both models strongly attribute the TXTL regions, with earlier negative attributions disappearing. The key difference is the stronger negative attributions of S-U-NET around FIL and DEC in Dec4 compared to L-U-NET.

4.6. Human-Centric Alignment of XAI Metrics

To validate the plausibility of attributions, we introduce an evaluation protocol integrating a human assessment (HA) approach, conducted with a group of users and domain experts, as detailed in Section 3.5. The resulting human judgments are combined with the following three metrics:

CH ↑

,

PG ↑

, and

ACS ↑

, computed across the five following layers: Dec4 → Dec1 and FL, on the CB55 and UTP-110 datasets.

4.6.1. Human-Expert Assessment

To objectively assess the attribution maps, we complement the computed XAI scores with expert evaluations provided by domain specialists experienced in analyzing historical and administrative documents. For each dataset (CB55 and UTP-110), class, layer (Dec4 → Dec1 and FL), and XAI method (GradCAM, LRP, DeepLift, and G*I), experts inspected blended overlays and rated them for semantic plausibility and focus on class-relevant regions. We use a simple binary protocol with a stronger endorsement (✓✓ for clear, unambiguous focus; ✓ for acceptable focus; and × otherwise), resolving ties by majority vote. These HA marks, reported in Table A3 and Table A4, align with quantitative metrics (PG, CH, and ACS) and consistently favor higher layers—especially FL—where explanations concentrate on semantically meaningful regions.

4.6.2. Metric Alignment

To quantify the agreement between human assessment (HA) and the computed XAI metrics, we calculate the mean squared error (MSE) between the HA score and each metric:

{MSE}_{m} = {(m - {HA}_{score})}^{2}

where m is the computed metric (PG, CH, or ACS).

All metrics are normalized to

[0, 1]

. We map HA judgments to numeric targets as follows:

${HA}_{score} = 1.0$ (✓✓ for clear, unambiguous focus);
${HA}_{score} = 0.6$ (✓ for acceptable focus);
${HA}_{score} = 0.0$ (× for bad focus).

Lower MSE indicates closer alignment with human judgment.

(i) CB55:Figure 13 reports the achieved average of MSE for TXT, HL, and GL across the four XAI methods using PG, CH, and ACS.

ACS aligns best for LRP and G*I. CH is strongest for DeepLift. PG excels for GradCAM in the last three layers (Dec2, Dec1, and FL). We attribute these discrepancies to dataset imbalance, especially the challenging HL class, which made human assessment of heatmaps harder and noisier.

(ii) UTP-110: Figure 14 shows the achieved average of MSE for the decorator (DEC), filler (FIL), text line (TXTL), body (BD), small initial (SInit), and big initial (BInit) classes.

ACS consistently achieves the lowest MSE across methods, indicating that on a more balanced dataset ACS agrees well with human judgments of heatmap quality.

4.7. Saliency Overlay Analysis

To further investigate the interpretability of attribution maps, we select the BInit class from the UTP-110 dataset, which represents decorative glyphs with a large character on a colored background, as shown in Figure A10. We generate attribution maps using the four XAI methods across five decoder layers of S-U-NET and apply a saliency overlay restricted to the ground truth of the BInit region to visualize attribution alignment (see Figure A11).

Moving from deeper (Dec4) to shallower layers (FL) improves the quality of attribution maps, which become increasingly focused on the BInit region as noisy, off-target responses from early layers are refined.

For GradCAM, heatmaps in Dec4 and Dec3 show strong interference from surrounding text, but from Dec2 onward, attributions stabilize with positive activations concentrated within the BInit region.
LRP initially produces mixed signals in Dec4 but crucially captures the glyph’s internal “flower” symbol, and by Dec2 converges toward dense positive attribution within the ground truth area. Notably, we apply a log normalization in Equation (23) to handle extremely high attribution values in Dec1 while preserving patterns. Formally, given a attribution vector with n values

$A t t r = (v_{1}, v_{2}, \dots, v_{n})$

each $v_{i}$ value is normalized as follows:

$v_{i, \log} = \{\begin{matrix} {log}_{10} (v_{i}) & if v_{i} > 1, \\ - {log}_{10} (- v_{i}) & if v_{i} < - 1, \\ 0 & if v_{i} \in [- 1, 1] \end{matrix}$

(23)

where $v_{i}$ is the $i^{t h}$ raw attribution value and $v_{i, \log}$ is its log-normalized form, preserving sign, compressing large magnitudes, and mapping values in $[- 1, 1]$ to zero.
DeepLift exhibits the most stable and consistent results, with positive attributions concentrating on the inner glyph from Dec4 and becoming densely localized on the target by Dec2.
In contrast, G*I fails to recognize the glyph in Dec4, producing scattered activations, but from Dec3 onward, its maps became more structured and resembled DeepLift outputs.

4.8. Computational Cost

In this section, we present the computational costs of the two U-NET models on the CB55 and UTP-110 datasets, along with the costs of the applied attribution-based XAI methods and the computed XAI metrics.

Table 6 presents the computational costs for model architectures (inference time, learning time, and number of trainable parameters) of the two investigated U-NET variants on the two datasets. We note that L-U-NET, with 17,000 trainable parameters, took 5 h to train on the CB55 dataset and 2 h on the UTP-110 dataset. On the other side, S-U-NET, with over

31 M

trainable parameters, took 10 h to train on CB55 and 3 h on UTP-110. The resolution of images in the UTP-110 dataset is

640 \times 960

(615 K pixels), which is smaller compared to the resolution of images in the CB55 dataset, which is

960 \times 1344

(1.3M pixels). This results in shorter training and inference times for the UTP-110 dataset.

Table 7 presents the computational costs (time and memory) required to generate explanations in the form of attribution maps, as well as their evaluation on infidelity, sensitivity,

A C S

, and

C H

. It is important to point out that

A C S

and

C H

share the same values as they have the same time and resources complexities. For instance,

A C S

involves generating a single attribution map per input sample and target class using a selected XAI method. It then applies cheaper operations like masking to isolate the target region and separates the attribution map into positive and negative components before computing an F1-score. Similarly,

C H

relies on the same attribution map and aggregates the attribution values over the target region.

The Attribution column in Table 7 refers to the computational resources used to generate the attribution maps for all the images in the test set of each dataset and their all corresponding target classes. For example, using GradCam, the S-U-NET model took 101 s and 7.3 GB to run on the CB55 dataset, while only 85 s and 3.8 GB for the UTP-110 dataset. This validates that image resolution significantly impacts resource consumption. For CB55, with 4 target classes, generating an attribution map for a single image takes approximately 1.26 s, whereas for UTP, with 7 target classes, it takes only 0.60 s.

From a computational perspective, the dominant cost for both metrics is mainly due to attribution map generation as this process typically involves a full forward pass through the whole model and an addition backward pass or, in the case of methods such as Grad-CAM, a partial backward pass. These steps are considerably more resource-intensive than the subsequent operations of masking or aggregation, which contribute minimally to the overall time and memory requirements.

When comparing the XAI methods, LRP tends to be the slowest, while Grad-CAM is the fastest. DeepLIFT, on the other hand, has the highest memory consumption. On extreme cases, S-U-NET exceeded the 32 GB memory limit of our hardware, forcing us to switch to the CPU, which significantly increased the computation time. GradCAM and G*I have similar computation times and memory usage, primarily due to their comparable computational complexities. Both methods rely on calculating gradients with respect to the model output and using backpropagation to compute these gradients.

Each of the computed XAI metrics uses different algorithms and varies in resource requirements. Among them,

S e n s_M a x

is the most resource-intensive. In terms of computation time,

I n f i d

has the highest computation time in most cases.

A C S

and

C H

have the lowest resource consumption among the computed metrics, as they only rely on a single attribution map computation. Finally among all the used models, S-U-NET on CB55 consumed the most resources, while L-U-NET on UTP-110 consumed the least.

4.9. Synthesis

Based on the achieved results, a concise synthesis highlighting our key findings and observations is proposed in this section.

From an interpretability standpoint, DNNs continue to pose many challenges, even with the integration of XAI methods. For instance, attribution maps, at a superficial level, may appear to validate model decisions when observing only the final segmentation layer, but this perspective can be misleading, as earlier layers often reveal hidden biases or non-obvious decision pathways within the model. For example, on the CB55 dataset, the S-U-NET architecture exhibits a noticeable gap in attributing textual content using the LRP method (see Figure A8). Negative attributions remain ambiguous until layer Dec4, where underlying factors begin to emerge. Similarly, for L-U-NET on the same dataset, the attribution of FL suggests strong alignment with the TXT pixels while attributing all other regions negatively. However, by examining preceding layers, we observe early signs of negative attribution in the GL pixels and along the page borders. These patterns reinforce the importance of a layer-wise interpretability approach, as used in our proposed workflow, rather than relying solely on the final output layer.

Our interpretation also highlights that explanation improves consistently from deeper decoder layers Dec4 to FL, as confirmed by both the Infid and ACS metrics. This reflects the decoder’s role in gradually fusing abstract features with high-resolution details from the skip connectors in U-NET.

We also report that S-U-NET produces more plausible and robust explanations compared to L-U-NET qualitatively judging on heatmaps, and on both high values of ACS and lower Sens_Max values. On the other hand, L-U-NET produces more faithful explanations with low Infid values. Thus, lightweight models contribute to more predictable explanation maps matching the model prediction.

Based on the quality of the generated attributions, we draw the following conclusions regarding the attribution methods:

GradCAM is efficient in terms of memory consumption and capable of highlighting negative attributions through gradients in the deeper layers. However, its reliability diminishes in lower or intermediate layers due to the gradient vanishing issues.
LRP demands greater computational resources but produces more consistent and informative attribution maps. Its ability to detect deeper anomalies stems from its adherence to the conservation of information principle.

When comparing the two datasets, UTP-110 consistently produces coherent, high-quality heatmaps. In this balanced dataset with distinct document components, positive attributions are strongly concentrated on target classes, while negative attributions are distributed over non-target regions, which leads to higher ACS values and clearer interpretability. By contrast, the CB55 dataset is more challenging: heatmaps are harder to interpret, particularly for the HL class, which separates text paragraphs, and the GL class, which is frequently confused with the TXT class.

We conclude that the dataset composition significantly affects the model predictions as well as the interpretability of attribution maps. Selecting well-separated document components, each with distinct shapes and features, enables the model to:

Build more robust feature representations;
Assist the segmentation head in producing clearer segmentation boundaries;
Enhance compatibility with a wide range of XAI methods for both visual and quantitative heatmap evaluations.

While increasing the number of target classes does not inherently create new neural pathways in the network, precise ground truth annotations and well-defined class separations substantially contribute to improved model predictions and more interpretable attribution maps.

5. Generalization

This section evaluates the domain transferability of the proposed workflow and validates its generalization by extending our experiments beyond HDI. Specifically, we conduct additional experiments on Cityscapes, one of the most widely used and complex benchmarks for urban scene segmentation [32]. Cityscapes includes 5000 finely annotated and 20,000 coarsely annotated street-view photos from 50 European cities for interpreting urban scenes (see Figure 15). Captured using a car-mounted camera, Cityscapes gives pixel-level labels in a 20-class arrangement (e.g., road, building, automobile, pedestrian) and is commonly used to train and evaluate semantic segmentation for autonomous driving [53,54].

Due to the large number of samples of Cityscapes compared to CB55 and UTP-110 datasets, only CH, PG, and ACS are computed. Although L-U-NET suffers from underfitting due to its limited number of learnable parameters, L-U-NET and S-U-NET are both trained for 200 epochs. For the Cityscapes dataset, the same XAI methods and selected layers are used as for the two HDI datasets. We also introduce a user study (see Table A5), integrating human-in-the-loop feedback, where several users manually assessed the generated heatmaps and assigned scores.

These experiments focus on evaluating how well each metric reflects an XAI method ability to explain a target class in the input image relative to human interpretation. The end goal is to quantify inter-rater agreement among users in relation to one or more metrics.

When evaluating performance on Cityscapes, we observe that the model predicts classes missing from the ground truth. This usually happens in dense, multiclass urban datasets like Cityscapes, where one image can contain many different object types, unlike HDI datasets with only a few categories. Therefore, these issues should be taken into consideration when evaluating XAI methods: (i) dataset artifacts that can mislead explanations, (ii) model quality (under/overfitting) that yields erroneous predictions, and (iii) the capability of the explanation method itself to extract meaningful insights from the model.

To address these issues, we include two different approaches to compute the metrics. The first approach relies on computing the average metrics on the predicted target classes, which is the standard in the state-of-the-art. The second approach relies on using only the intersection between the predicted classes and the classes that are annotated in the ground truth. The motivation behind the second approach is to mitigate the impact of incorrect classes in the predicted output, which could otherwise distort the explainability measure.

In Figure 16, we provide additional qualitative results using targeted heatmaps for different classes in Cityscapes. These heatmaps are generated with GradCAM and LRP applied to the last layer of the S-U-NET architecture.

The quantitative results are reported in Table A5, evaluated with different targets across the same five U-NET layers used for both the CB55 and UTP-110 datasets.

Similar to the comparative study in Section 4.6.1, we conducted additional experiments to evaluate the generalization of SegClarity on the Cityscapes dataset (see Figure 17). Across all attribution methods, ACS achieves the best performance at layer Dec2, and performs comparably to PG at the final layer (FL). However, for earlier layers (Dec4 and Dec3), both CH and ACS scored poorly. This is largely due to the low-resolution feature maps at those stages (

32 \times 64

and

64 \times 128

), where upsampling introduces significant distortion. As a result, nearly all Dec4 and Dec3 heatmaps are rated as poor by human evaluators. Therefore, for Cityscapes, higher-resolution layers (Dec2, Dec1, and FL) provide more reliable and interpretable attribution maps.

Beside layer-wise method comparison, we also include an additional study on the input layer explanation. The qualitative results presented in Figure 18, Figure 19 and Figure 20 highlight the strong domain dependency of explainability methods in segmentation tasks. For historical document datasets (CB55 and UTP-110), gradient and decomposition-based methods such as Grad-CAM, Gradient × Input, LRP, and DeepLIFT produced coherent and semantically meaningful attributions closely aligned with textual regions, whereas perturbation-based methods like RISE and MiSuRe failed to capture the fine-grained layout structure or produced overly diffuse activations. In contrast, on the Cityscapes dataset, all methods—including RISE and MiSuRe—generated stable and contextually relevant explanations, demonstrating their suitability for natural image domains. These observations confirm that XAI methods, although categorized as being agnostic, must be adapted to the visual characteristics of the target domain, as document image analysis requires finer structural sensitivity and dedicated attribution strategies to achieve reliable interpretability. We can also confirm this interpretation by computing three quantitative metrics (CH, PG, and ACS) for the input-layer attributions generated on the three datasets. The results are summarized in Table 8. For the CB55 dataset, the best performance is achieved by Grad-CAM on ACS and PG, while MiSuRe yields the highest CH score. In the UTP-110 dataset, DeepLIFT obtains the best ACS and PG values, and MiSuRe again leads in CH. Finally, for the Cityscapes dataset, MiSuRe achieves the highest ACS, whereas Grad-CAM and LRP exhibit the strongest results for PG and CH, respectively.

6. Discussion

In this article, we present an explainability workflow tailored for the document analysis community and XAI practitioners working with semantic segmentation applications. Our workflow is based on attribution maps, providing a comprehensive analysis of model interpretability both visually and through a set of quantitative measures to enhance the evaluation of DNNs. By leveraging various categories of XAI methods, we demonstrated the extent of information that can be extracted from a trained DNN and how these insights can be used to interpret and compare multiple DNNs using the proposed contributions.

Our findings highlight the importance of selecting appropriate XAI methods and metrics for semantic segmentation tasks. Unlike classification, where attribution maps often align with class activation maps, segmentation poses unique challenges:

They are inherently difficult to interpret, as attribution maps do not always resemble segmentation masks;
Applying and adapting the existing XAI methods and measures to rich datasets with pixel-wise annotations presents significant computational challenges.

In our previous work [21], we addressed these issues by introducing a framework for interpreting layout analysis models, employing three XAI methods and two evaluation metrics. This work extends this framework by investigating the LRP method and providing a broader set of metrics for evaluation. Furthermore, we demonstrated that a robust interpretability assessment should extend beyond the final output layer, as commonly used in the literature, and instead consider multiple levels of the network. Furthermore, we introduced a targeted perturbation technique for the infidelity metric and the

A C S

, a novel metric that aligns with human reasoning, providing mathematical justification for its formulation and showcasing its effectiveness in this context.

Throughout this work, we clearly observe the importance of carefully handling model explanations, as they provide valuable insights into model behavior. For instance, known perturbation-based methods, such as RISE, are dependent on the task and the quality of input data. We demonstrated in the end of the generalization section that document layout analysis is different from other segmentation domains due to its rich nature of textual pixels and its highly correlated components. MiSuRe, on the other hand, kept scoring stable results regardless of the task. We argue that MiSuRe is, at its core, suitable for a variably of data, because it compromises with an optimization phase to be more selective in the choice of saliency map.

Visually inspecting the model explanations without prior knowledge or suitable pre-processing can lead to significant misinterpretations. For instance, the LRP method outputs large values and creates a large gap in the distribution of attribution map values. Regarding the U-NET architecture, higher decoder layers tend to be more interpretable than lower layers closer to the bottleneck. This is due to the fact that when moving up in the decoder, U-NET is able to extract finer details from the input and produce higher-resolution output. This observation is supported by the computed values of the

A C S

and

I n f i d

metrics. However, it is reversed for the DeepLift method when using

S e n s_M a x

. This highlights the importance of carefully selecting the appropriate XAI method based on the specific task and dataset.

As further evidence of the alignment between our metric and human perception, we conducted a human evaluation phase in which participants were asked to judge the interpretability of the generated heatmaps using the proposed workflow. Among the computed metrics, PG, CH, and ACS, ACS consistently showed the strongest agreement with human assessments. Although ACS achieved the best overall performance, it exhibited a bias toward balanced datasets, underscoring the importance of data quality and class distribution in interpretability. In addition to human evaluation, the proposed workflow was applied to a dataset beyond the field of HDI analysis. The results remained consistent, demonstrating that SegClarity can be adapted to a wide range of segmentation tasks.

The choice of layers to interpret impacts the quality of attribution. In our case, we chose the input layer, like the majority of works on XAI, as well as intermediate layers in the network. Although we only consider the decoder part of the network, other components can provide additional insight to studying such models:

The encoder blocks as we progress through the network in a downsampling manner as well as the choice of encoder type to employ;
The skip connections where an interesting study can be included by measuring the impact of each skip connection on the decoder block.

The results also suggest a strong correlation between model performance and interpretability, as highlighted in prior studies [15,55,56]. This work could also be more developed if we explore more explainability methods. For instance, the randomized input sampling for explanation (RISE) [48] method relies on image ablation and impact assessment to generate attribution maps where we can use our perturbation method alongside it to produce meaningful perturbations for segmentation tasks and reduce time complexity of RISE. Another area to explore is the LRP variants, such as LRP-

γ

, LRP-

α β

, and deep Taylor decomposition (DTD). Regarding the

S e n s_M a x

metric, the perturbation radius is still not studied due to the high complexity and computational costs for segmentation tasks; for that, we strongly advise practitioners to look for more optimized ways for the sensitivity measurements, especially for costly methods, such as LRP and DeepLift.

7. Conclusions

In this work, we present a comprehensive workflow, called SegClarity, that integrates DNNs and XAI methods to advance the task of analyzing the layouts of HDIs. By addressing multiple challenges, such as the scarcity of annotated data and the complexity of document layouts, we emphasize the importance of optimized architectures and interpretable models. Our work highlights the role of XAI in enhancing the transparency and reliability of DNNs. By providing a variety of attribution- and perturbation-based XAI methods (GradCAM, DeepLift, LRP, G*I, RISE and MiSuRe), we provide valuable insights into model behavior and decision-making processes. We also propose a custom perturbation method based on semantic annotations and tailored to pixel-wise prediction tasks. This method generates meaningful perturbations which, when combined with an evaluation metric like

I n f i d

, can assess whether the highlighted regions in an explanation effectively influence the model’s predictions. The introduction of a novel XAI metric, called

A C S

, further strengthens the evaluation of explainability for pixel-wise segmentation, providing a robust metric for aligning model attributions with the complex layouts of HDIs. In addition to supporting existing evaluation methods,

A C S

offers an additional perspective into the assessment of XAI methods in terms of explanation plausibility. Through quantitative and qualitative evaluation using SegClarity, we demonstrated that layer-wise analysis reveals deeper understanding of model behavior and enables a more effective assessment from an interpretability and explainability perspective.

A key contribution of this work is the domain independence of SegClarity and the effect of generalization on the explanation. By applying SegClarity to the Cityscapes dataset, we demonstrate its versatility in the complex task of urban scene segmentation. This generalization confirms that the layer-wise analysis, visualization tools, and ACS metric provide consistent and meaningful insights across fundamentally different applications: from detecting textual glosses in medieval manuscripts to segmenting roads, vehicles, and pedestrians in modern street scenes. These results establish SegClarity as a versatile, domain-agnostic framework for evaluating and interpreting pixel-wise prediction models. Additionally, we demonstrated that traditional perturbation methods are domain-dependent, especially in the case of RISE, and require careful consideration their integration. These findings suggest that the document analysis domain requires specialized XAI adaptations, as not all conventional approaches are directly applicable to such structured visual data.

This work contributes to the growing field of explainability in document layout analysis, fostering trust and interpretability in DNN-based systems. Our findings serve as a benchmark for future research efforts, driving the development of advanced tools for segmenting HDIs and facilitating the preservation and dissemination of cultural heritage. Through the integration of efficient architectures and XAI methods, we aim to bridge the gap between technological innovation and practical application in the analysis of HDIs.

As future work, we plan to leverage SegClarity to actively impact model behavior during training, ensuring that the learning process is guided toward improved performance and enhanced generalization. By incorporating attribution-based insights during training, we aim to refine feature learning and reduce potential biases, leading to more robust document layout segmentation. Furthermore, we intend to extend our work by building upon the research of Achtibat et al. [57] on concept relevance propagation (CRP). Specifically, we intend to adapt and integrate CRP into our workflow, similar to the work introduced by Dreyer et al. [50]. This extension will enable us to explore concept-based attribution in HDI segmentation, further aligning model attributions with human-interpretable document structures and enhancing the overall explainability of DNNs dedicated to layout analysis. Although our work centers on document analysis as a case study, the proposed workflow and insights open promising perspectives for broader applications. In particular, they can be extended to other pixel-wise segmentation tasks, such as those in medical imaging [58,59,60] and beyond.

Author Contributions

Writing—review and editing, I.B., N.R., M.M., R.I. and N.E.B.A.; Software, I.B. and N.R.; Visualization, I.B. and N.R.; Supervision, M.M., R.I. and N.E.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the Rectorates of the University of Sousse and the University of Fribourg for supporting this work through the research scholarship program as part of a doctoral study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Attribution Generation

Appendix A.1.1. Segmentation Adaptation

XAI attribution methods are predominantly designed for image classification, where models generate a single class-score vector per image. By contrast, semantic segmentation models generate dense predictions as a tensor of size

[B, C, H, W]

, where B, C, H, and W correspond to the batch size, number of prediction classes, and the height and width of the segmentation map, respectively, with class probabilities or logits assigned independently to each pixel. Due to this mismatch in output format, classification-based XAI methods cannot be directly applied. To bridge this gap, we adapt the segmentation model output into a classification-compatible form that can be seamlessly integrated with existing XAI techniques. The core idea is to convert the per-pixel predictions into a single scalar score per class, thereby emulating the structure of a classification output. Through this adaptation, we can directly apply well-established attribution methods, eliminating the need for re-implementation.

Algorithm A1, adapted from the Captum XAI library [61] (https://captum.ai/tutorials/Segmentation_Interpret, accessed on 8 November 2025), formalizes the adaptation of segmentation outputs for integration with XAI techniques. First, we compute the model output Y and determine the original per-pixel argmax prediction, denoted

pred

. A one-hot mask M is then constructed to fix the predicted class at each pixel. By multiplying this mask with the original output, we retain only the logits corresponding to the predicted class for each pixel. Finally, these logits are summed over the spatial dimensions

(H, W)

to yield a scalar score for each class, producing a vector

s \in R^{B \times C}

that mirrors a classification output. This adaptation presents several advantages. First, it preserves stability during attribution: by linking the mask to the original predictions, the target class remains consistent even when XAI techniques require input perturbations. Second, it enables the direct re-utilization of robust and optimized XAI techniques initially adopted for classification tasks, ensuring that trials are both efficient and comparable.

Algorithm A1 Segmentation Adaptation

Input: f (segmentation model), X (input batch), C (number of classes)

Output:

s

(class-wise scalars)

1.

Y \leftarrow f (X)

segmentation logits,

Y \in R^{B \times C \times H \times W}

2:

pred \leftarrow arg {max}_{c} Y

predicted class index per pixel, shape

[B, 1, H, W]

3: Construct

M \in {0, 1}^{B \times C \times H \times W}

with

M [b, c, h, w] = 1

if

c = pred [b, 1, h, w]

, else 0

4:

Z \leftarrow Y ⊙ M

retain only logits of the predicted class at each pixel

5:

s \leftarrow \sum_{h, w} Z

aggregate over spatial dimensions;

s \in R^{B \times C}

6: return

s

Appendix A.1.2. Attribution Computation

Our workflow starts by generating fine-grained attribution maps using an attribution-based XAI method, an input vector x, a layer name l, and target class to explain t.

Given a segmentation model f, composed of an ordered set of layers, each layer is a parametric mapping

l_{i} : X_{i - 1} \to X_{i}

, where

X_{i - 1}

and

X_{i}

denote the input and output spaces of the layer, respectively. The overall model is then expressed as the following composition:

f (x) = l_{n} \circ l_{n - 1} \circ \dots \circ l_{1} (x)

To ensure precise referencing, each layer is assigned a unique identifier or name, allowing the network to be represented as a collection of named modules. Accordingly, f is defined as follows:

f = {(name (l_{1}), l_{1}), (name (l_{2}), l_{2}), \dots, (name (l_{n}), l_{n})}

Given a particular name, we denote by

l \in f

the unique layer corresponding to that identifier. For example, in a U-NET segmentation model, we may refer to a decoder block by the name

Dec 2

, so that retrieving

l = Dec 2

gives access to its intermediate feature maps.

An attribution map is a vector of numbers that represents the method interpretation of the model behavior given the target class to interpret. Based on the DNN output and gradients, our workflow calculates the contributions for the attribution maps, ensuring that the structure of the generated maps aligns with the corresponding layer shape.

First, to compute the raw prediction of a model for a given input image x, we perform a forward pass through the network f. This process yields a dense output tensor, denoted as

o u t

, which is defined as follows:

o u t = f (x) \in R^{C \times H \times W}

This dense, pixel-wise output is not directly suitable for attribution XAI-based methods designed for classification tasks. Therefore, if we apply the adapter function (as detailed in Algorithm A1) to aggregate these scores, we will obtain a class-wise score vector s, defined as follows:

s = w r a p (f, x, t) \in R^{C}

where

w r a p

is the adapter function that adapts the segmentation output into a classification-like representation. Given the model f, input x, and target class t, it aggregates the dense prediction tensor into a vector s of length C.

We then define

A t t r

, a generic attribution function, as follows:

A t t r = g (f, s, ℓ),

(A1)

where

f denotes the model, typically a DNN for segmentation or another prediction task;
s is the selected output, a scalar quantity derived from the model predictions (e.g., the logit for a specific class at a pixel, or a guided score pooled over a region of interest);
ℓ is the selected internal layer (input, intermediate, or output) whose contribution we aim to analyze.

This process integrates the generation of the map

A t t r \in R^{H \times W}

with the DNN architecture, and afterwards it is transformed into a visualization that is both informative and representative of the DNN internal workings. The attribution map generally includes favorable (positive) attributions linked to features that support the DNN prediction, unfavorable (negative) attributions indicating deviation from the DNN judgment, and neutral attributions with limited impact.

To generate attribution maps, four state-of-the-art attribution-based XAI methods are used in our work, specifically two gradient-based techniques (Grad-CAM and G*I) and two decomposition-based approaches (LRP and DeepLIFT). Perturbation-based methods are excluded from our analysis, as we aim to introduce perturbation metrics in our assessment. Each method provides unique characteristics that contribute to achieving a more comprehensive understanding of model behavior. Grad-CAM operates through gradient propagation, while LRP relies on relevance attribution with the mathematical property of information conservation. DeepLIFT combines gradient-based analysis with reference-based comparisons, and G*I serves as a baseline method driven by neuron activation.

In the process of generating attribution maps, we have customized Grad-CAM and LRP by adapting their implementations to meet our specific application needs.

For GradCAM, it is common practice to apply a ReLU function to the generated heatmap, thereby focusing solely on the features that positively influence the prediction of the target class (https://christophm.github.io/interpretable-ml-book/pixel-attribution.html, accessed on 8 November 2025). However, to capture a more comprehensive understanding of the model behavior, including both positive and negative attributions, we have modified the Grad-CAM implementation by omitting the ReLU operation.

For LRP, it is imperative to tailor the attribution map generation function by assigning propagation rules that correspond to the specific operations of each layer. This customization ensures that the relevance scores are accurately propagated through the network. For instance, convolutional layers, which apply filters to extract spatial features, should utilize the

ϵ

-rule to maintain numerical stability during relevance propagation. Activation functions, such as ReLU, which introduce non-linearity by zeroing out negative inputs, are best suited to the identity rule, allowing relevance to pass through unchanged. Normalization layers, such as InstanceNorm, which standardize the input to have a specific mean and variance, should also employ the

ϵ

-rule to ensure consistent relevance distribution. A comprehensive research work focusing on selecting the appropriate propagation rules for various layer types in DNNs was introduced by Montavon et al. [62].

The generated attribution maps are subjected to additional processing to improve their visuals. This involves separating positive and negative characteristics and overlaying them on input image to generate a combined heatmap. In Figure 4, we display positive and negative attributions in distinct columns, providing valuable insights into the significance of features for model predictions. To further elucidate the model’s prospective, we present a superimposed blended heatmap overlaid on the input image. This visualization integrates of the highest positive attribution and negative attribution values. These regions are overlaid with an opacity of 0.5 to enhance visual interpretability. In this example, in Figure 4, a 70% threshold for both positive and negative attribution values and an opacity level of 0.5 were selected based on empirical observations. These parameters are adjustable and can be tailored according to the specific method employed and the characteristics of the input image. Using both separate and blended visualizations helps to better understand model behavior and supports its refinement and validation.

Appendix A.1.3. Post-Processing Steps

(i) Symmetric normalization is used to scale the attribution values from a range of [a,b] (where

a, b \in R and a < b

) to [−1,1].

Given a vector

V

, the normalization of

V

such that its components

v_{i}

are scaled to the interval

[- 1, 1]

can be performed using Equation (A2).

v_{i, normalized} = 2 \times \frac{(v_{i} - V_{\min})}{(V_{\max} - V_{\min})} - 1

(A2)

where

V_{\min}

and

V_{\max}

are the minimum and maximum values of the components of

V

, respectively.

(ii) Max-abs normalization is a non-standard symmetric technique, based on scaling the attribution by the maximum absolute value from the attribution vector to preserve the initial neutral attribution values. This ensures that the values are constrained within the range [−1,1]. This normalization is defined according to Equation (A3):

v_{i, normalized} = \frac{v_{i}}{max (| V |)}

(A3)

where

| V |

denotes the absolute value of attributions.

(iii) Bipolar range normalization is a tailored normalization method used to emphasize the effect of each type of attribution (positive, negative, and neutral). A small positive value

ϵ

is first set to represent the neutral attributions. Positive attributions are scaled to a range of

ϵ

to 1, while negative attributions are scaled to a range of −1 to

- ϵ

, with zero values remaining unchanged.

Given a vector

V = (v_{1}, v_{2}, \dots, v_{n})

, each component

v_{i, normalized}

of the normalized vector

V_{normalized}

is computed according to Equation (A4).

v_{i, normalized} = \{\begin{matrix} \frac{v_{i}}{V_{\max}} & if v_{i} > ϵ, \\ 0 & if | v_{i} | ⩽ ϵ, \\ \frac{v_{i}}{- V_{\min}} & if v_{i} < - ϵ . \end{matrix}

(A4)

where

V_{\min}

and

V_{\max}

are the minimum and maximum values of the components of

V

, respectively.

Algorithm A2 Clipping step

Input:

V

(vector of attribution values), p (percentile parameter)

Output:

V

(post-processed vector)

1:

N \leftarrow length (V)

2:

S \leftarrow sort (V)

3:

k \leftarrow ⌈ p \times N ⌉

4:

V_{high} \leftarrow S [N - k]

5:

V_{low} \leftarrow S [k - 1]

6: for

i \leftarrow 1

to N do

7: if

V [i] \geq V_{high}

then

8:

V [i] \leftarrow S [N - k - 1]

9: else if

V [i] \leq V_{low}

then

10:

V [i] \leftarrow S [k - 2]

11: end if

12: end for

13: return

V

Figure A1. Histogram analysis showcasing the distribution of the attribution values before and after applying the clipping algorithm. The histogram indicates how clipping extreme values with proportions (1% and 5%) reduces the bias caused by outliers while preserving the overall distribution.

Figure A2. Impact of the refinement step on an attribution map obtained by the LRP method. On each row: from left to right, original attribution without normalization, after applying bipolar normalization, then with introducing the clipping algorithm by setting a 1% and 5% reduction of attributions.

Figure A3. A visual comparison of three normalization techniques applied to two different attribution maps (rows). Each column corresponds to the resulting heatmap before applying any normalization (first column) and after applying a specific normalization method: symmetric (see Equation (A2)), max-abs (see Equation (A3)), and bipolar range (see Equation (A4)). The fifth column shows the heatmap color bar: green for positive attributions, red for negative attributions, and yellow for neutral values.

Figure A4. Log-scaled KDE plots of attribution values for two different attributions, displayed row-wise. The first row corresponds to the first attribution, and the second row corresponds to the second attribution. The left column shows the KDE of the original attribution maps before normalization, while the right column compares KDEs of attribution values across three different normalization techniques: symmetric, max-abs, and bipolar range.

Figure A5. Comparison of the original and forged attribution maps. Despite visual differences, both attribution maps yield identical

C H

values, highlighting the failure to penalize attributions outside the target class. (a) Attribution with GradCAM,

C H = 0.574

|

A C S = 0.010

. (b) Forged attribution,

C H = 0.574

|

A C S = 0.030

.

Figure A5. Comparison of the original and forged attribution maps. Despite visual differences, both attribution maps yield identical

C H

values, highlighting the failure to penalize attributions outside the target class. (a) Attribution with GradCAM,

C H = 0.574

|

A C S = 0.010

. (b) Forged attribution,

C H = 0.574

|

A C S = 0.030

.

Appendix A.2. ACS Metric

Figure A6. Limitation of the

A C S

metric illustrated when using the proposed workflow by setting different model layers with GradCAM and GL target class. (a) GradCAM, Dec3,

C H = 0.644

,

A C S = 0.240

. (b) GradCAM, Dec1,

C H = 0.689

,

A C S = 0.185

. (c) GradCAM + ReLU, Dec3,

C H = 0.480

,

A C S = 0.241

. (d) GradCAM + ReLU, Dec1,

C H = 0.556

|

A C S = 0.185

.

Figure A6. Limitation of the

A C S

metric illustrated when using the proposed workflow by setting different model layers with GradCAM and GL target class. (a) GradCAM, Dec3,

C H = 0.644

,

A C S = 0.240

. (b) GradCAM, Dec1,

C H = 0.689

,

A C S = 0.185

. (c) GradCAM + ReLU, Dec3,

C H = 0.480

,

A C S = 0.241

. (d) GradCAM + ReLU, Dec1,

C H = 0.556

|

A C S = 0.185

.

Appendix A.3. U-NET-Model

Figure A7. Overview of the U-NET architecture, highlighting the encoder, bottleneck, and decoder blocks, along with the final layer (FL).

Appendix A.4. Explanation Results

Appendix A.4.1. Quantitative Results

Table A1. Results of the XAI evaluation metrics on the CB55 dataset. Bold values indicate the best values on each metric for each method. The bold values indicate the best recorded metrics for each model on each dataset. The ↑ symbols indicate that higher values are better. The ↓ indicate that the lower values are the better.

L-U-NET\|GradCAM
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	0.86	0.39	0.09	0.06	0.71	0.78	0.57	0.53
Dec3	0.85	0.32	0.06	0.05	0.72	0.80	0.74	0.67
Dec2	0.73	0.32	0.09	0.04	0.71	0.72	0.74	0.67
Dec1	0.72	0.32	0.03	0.03	0.77	0.77	0.64	0.66
FL	0.60	0.31	0.02	0.02	0.81	0.83	0.99	0.99
S-U-NET\|GradCAM
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	1.13	0.33	0.06	0.06	0.61	0.65	0.30	0.33
Dec3	1.16	0.32	0.02	0.02	0.62	0.60	0.50	0.35
Dec2	1.12	0.31	0.03	0.03	0.67	0.65	0.73	0.64
Dec1	1.17	0.31	0.03	0.03	0.76	0.73	0.75	0.67
FL	1.10	0.31	0.01	0.01	0.80	0.83	1.00	1.00
L-U-NET\|LRP
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	0.85	0.41	1.75	1.89	0.55	0.60	0.48	0.57
Dec3	0.85	0.33	1.83	1.98	0.61	0.57	0.63	0.77
Dec2	0.83	0.33	1.83	1.86	0.58	0.55	0.73	0.79
Dec1	0.71	0.35	1.76	1.85	0.55	0.51	0.73	0.77
FL	0.65	0.32	0.11	0.14	0.90	0.87	0.91	0.88
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	1.18	0.33	0.78	0.78	0.49	0.44	0.52	0.45
Dec3	1.19	0.35	0.90	0.89	0.54	0.52	0.66	0.66
Dec2	1.16	0.31	0.97	0.98	0.52	0.48	0.90	0.91
Dec1	1.16	0.32	1.31	1.37	0.54	0.50	0.76	0.81
FL	1.10	0.32	0.06	0.08	0.92	0.90	0.96	0.95
L-U-NET\|DeepLift
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	0.87	0.33	0.04	0.02	0.53	0.54	0.77	0.88
Dec3	0.84	0.32	0.03	0.02	0.55	0.55	0.96	0.98
Dec2	0.82	0.32	0.04	0.03	0.60	0.61	0.96	0.99
Dec1	0.57	0.31	0.04	0.04	0.65	0.65	0.98	0.98
FL	0.58	0.31	0.07	0.08	0.67	0.68	0.99	0.99
S-U-NET\|DeepLift
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	1.16	0.33	0.05	0.03	0.39	0.41	0.42	0.47
Dec3	1.13	0.32	0.03	0.02	0.60	0.57	0.78	0.94
Dec2	1.12	0.31	0.04	0.03	0.57	0.54	0.83	1.00
Dec1	1.08	0.32	0.05	0.05	0.67	0.71	0.75	0.99
FL	1.09	0.32	0.10	0.10	0.69	0.74	0.76	1.00
*L-U-NET\|GI**
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	0.82	0.40	0.17	0.14	0.50	0.54	0.65	0.81
Dec3	0.80	0.32	0.08	0.05	0.59	0.54	0.77	0.98
Dec2	0.78	0.32	0.19	0.08	0.56	0.56	0.95	0.98
Dec1	0.60	0.32	0.11	0.10	0.56	0.53	0.96	0.96
FL	0.60	0.31	0.13	0.17	0.63	0.62	0.96	0.94
*S-U-NET\|GI**
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	1.18	0.33	0.11	0.10	0.55	0.48	0.59	0.53
Dec3	1.20	0.35	0.10	0.08	0.57	0.55	0.71	0.70
Dec2	1.16	0.31	0.05	0.04	0.54	0.53	0.92	0.94
Dec1	1.15	0.32	0.04	0.04	0.58	0.60	0.98	0.99
FL	1.10	0.31	0.02	0.03	0.58	0.58	1.00	1.00

Table A2. Results of the XAI evaluation metrics on the UTP-110 dataset. The bold values indicate the best recorded metrics for each model on each dataset. The ↑ symbols indicate that higher values are better. The ↓ indicate that the lower values are the better.

L-U-NET\|GradCAM
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	4.71	1.09	0.10	0.11	0.63	0.61	0.47	0.39
Dec3	5.78	1.08	0.07	0.08	0.67	0.67	0.50	0.43
Dec2	4.89	1.10	1.70	1.98	0.74	0.73	0.63	0.57
Dec1	1.58	1.06	0.04	0.05	0.81	0.79	0.63	0.74
FL	1.29	1.03	0.04	0.04	0.86	0.85	0.96	0.95
S-U-NET\|GradCAM
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	3.25	1.41	0.04	0.04	0.64	0.62	0.46	0.47
Dec3	2.71	2.50	0.03	0.03	0.65	0.62	0.59	0.55
Dec2	3.49	1.21	0.03	0.03	0.72	0.72	0.52	0.60
Dec1	3.79	1.08	0.02	0.02	0.76	0.78	0.76	0.87
FL	1.42	1.09	0.01	0.01	0.85	0.83	0.96	0.95
L-U-NET\|LRP
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	7.22	2.33	2.18	2.20	0.50	0.48	0.54	0.57
Dec3	6.93	2.31	2.21	2.21	0.44	0.42	0.48	0.53
Dec2	6.32	2.04	2.40	2.40	0.40	0.38	0.49	0.50
Dec1	3.52	2.39	2.51	2.56	0.43	0.40	0.39	0.45
FL	1.31	1.05	0.15	0.16	0.94	0.93	0.94	0.93
S-U-NET\|LRP
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	4.94	2.34	1.12	1.10	0.47	0.42	0.59	0.64
Dec3	4.51	3.35	1.21	1.07	0.45	0.44	0.63	0.60
Dec2	3.77	1.35	1.35	1.26	0.50	0.48	0.66	0.71
Dec1	3.95	1.18	1.40	1.33	0.48	0.45	0.57	0.63
FL	1.45	1.12	0.07	0.07	0.93	0.92	0.95	0.94
L-U-NET\|DeepLift
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	4.69	1.20	0.07	0.06	0.60	0.64	0.82	0.81
Dec3	4.98	1.28	0.07	0.06	0.63	0.68	0.87	0.87
Dec2	1.28	1.04	0.09	0.10	0.69	0.69	0.85	0.83
Dec1	1.26	1.01	0.09	0.10	0.73	0.73	0.89	0.87
FL	1.28	1.03	0.14	0.16	0.79	0.78	0.88	0.86
S-U-NET\|DeepLift
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	1.77	1.28	0.02	0.02	0.53	0.54	0.80	0.77
Dec3	1.88	1.58	0.02	0.02	0.60	0.61	0.91	0.89
Dec2	1.52	1.16	0.03	0.03	0.62	0.63	0.92	0.91
Dec1	1.39	1.07	0.03	0.04	0.71	0.71	0.94	0.93
FL	1.42	1.10	0.05	0.06	0.77	0.76	0.96	0.95
*L-U-NET\|GI**
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	6.85	1.95	0.33	0.30	0.52	0.49	0.73	0.75
Dec3	6.50	1.85	0.22	0.19	0.54	0.52	0.72	0.79
Dec2	5.45	1.51	0.27	0.21	0.50	0.48	0.70	0.75
Dec1	2.30	1.87	0.28	0.29	0.49	0.47	0.72	0.84
FL	1.29	1.05	0.07	0.08	0.75	0.74	0.96	0.95
*S-U-NET\|GI**
	$Infid$ ↓		$Sens_Max$ ↓		$CH$ ↑		$ACS$ ↑
Layer	A	fA	A	fA	A	fA	A	fA
Dec4	4.82	2.24	0.08	0.07	0.51	0.46	0.67	0.73
Dec3	4.10	3.23	0.08	0.07	0.52	0.50	0.83	0.81
Dec2	3.55	1.29	0.10	0.08	0.61	0.58	0.86	0.90
Dec1	3.83	1.09	0.11	0.05	0.68	0.66	0.89	0.93
FL	1.43	1.11	0.03	0.04	0.76	0.73	0.96	0.95

Appendix A.4.2. Qualitative Results

Figure A8. Attribution maps for decoder layers (Dec4 to FL) in L-U-NET and S-U-NET, evaluated on the CB55 dataset targeting main text pixels (TXT). Rows (a,b) show GradCAM; rows (c,d) show LRP.

I n f i d

and

A C S

are displayed below each panel to quantify attribution quality across layers and methods.

Figure A8. Attribution maps for decoder layers (Dec4 to FL) in L-U-NET and S-U-NET, evaluated on the CB55 dataset targeting main text pixels (TXT). Rows (a,b) show GradCAM; rows (c,d) show LRP.

I n f i d

and

A C S

are displayed below each panel to quantify attribution quality across layers and methods.

Figure A9. Attribution maps for decoder layers (Dec4 to FL) in L-U-NET and S-U-NET, evaluated on the UTP-110 dataset targeting text line pixels (TXTL). Rows (a,b) use GradCAM; rows (c,d) use LRP.

I n f i d

and

A C S

are shown below each panel.

Figure A9. Attribution maps for decoder layers (Dec4 to FL) in L-U-NET and S-U-NET, evaluated on the UTP-110 dataset targeting text line pixels (TXTL). Rows (a,b) use GradCAM; rows (c,d) use LRP.

I n f i d

and

A C S

are shown below each panel.

Appendix A.4.3. Assessment

Table A3. Performance comparison of GradCAM, LRP, DeepLift, and G*I methods. Each block shows the computed metrics: human assessment (HA), pointing game (PG), content heatmap (CH), and Attribution Concordance Score (ACS), across different layers for each XAI method in the CB55 dataset. ✓✓, ✓ and × symbols indicate indicate good, medium and bad evaluation scores receptively.

GradCAM
Class	Layer	HA	PG	CH	AC
TXT	Dec4	✓	0.000	0.488	0.495
	Dec3	✓✓	1.000	0.613	0.984
	Dec2	✓✓	1.000	0.594	0.998
	Dec1	✓✓	1.000	0.618	0.999
	FL	✓✓	1.000	0.715	0.999
HL	Dec4	✓	0.308	0.470	0.000
	Dec3	✓	0.308	0.584	0.000
	Dec2	✓	0.308	0.608	0.019
	Dec1	✓✓	0.538	0.620	0.038
	FL	✓	0.692	0.757	0.996
GL	Dec4	✓	0.000	0.645	0.498
	Dec3	✓	0.000	0.511	0.153
	Dec2	✓	1.000	0.589	0.907
	Dec1	✓✓	1.000	0.607	0.982
	FL	✓✓	1.000	0.712	0.993
LRP
Class	Layer	HA	PG	CH	AC
TXT	Dec4	✓✓	0.100	0.534	0.518
	Dec3	✓✓	0.850	0.391	0.929
	Dec2	✓	0.700	0.419	0.922
	Dec1	✓	0.700	0.526	0.855
	FL	✓✓	0.500	0.923	0.966
HL	Dec4	✓✓	0.462	0.408	0.200
	Dec3	✓✓	0.615	0.532	0.818
	Dec2	✓✓	0.692	0.456	0.853
	Dec1	✓✓	0.538	0.335	0.724
	FL	✓✓	0.308	0.890	0.953
GL	Dec4	✓✓	0.278	0.440	0.597
	Dec3	✓✓	0.222	0.555	0.303
	Dec2	✓✓	0.833	0.496	0.899
	Dec1	✓✓	0.667	0.553	0.799
	FL	✓✓	0.333	0.872	0.943
DeepLift
Class	Layer	HA	PG	CH	AC
TXT	Dec4	✓✓	0.000	0.367	0.573
	Dec3	✓✓	1.000	0.523	0.995
	Dec2	×	1.000	0.543	0.999
	Dec1	×	1.000	0.712	0.995
	FL	×	1.000	0.726	0.999
HL	Dec4	✓✓	0.385	0.398	0.205
	Dec3	✓✓	0.692	0.571	0.988
	Dec2	✓✓	0.692	0.571	0.999
	Dec1	✓✓	0.692	0.755	0.991
	FL	✓✓	0.692	0.789	0.994
GL	Dec4	✓	0.000	0.505	0.618
	Dec3	×	0.333	0.568	0.850
	Dec2	×	1.000	0.487	0.993
	Dec1	×	1.000	0.665	0.988
	FL	×	1.000	0.700	0.996
*GI**
Class	Layer	HA	PG	CH	AC
TXT	Dec4	✓✓	0.000	0.493	0.627
	Dec3	✓✓	0.950	0.497	0.994
	Dec2	✓✓	1.000	0.523	0.999
	Dec1	✓✓	1.000	0.585	0.998
	FL	✓✓	1.000	0.574	0.999
HL	Dec4	✓✓	0.462	0.486	0.314
	Dec3	✓✓	0.692	0.539	0.777
	Dec2	✓✓	0.692	0.511	0.843
	Dec1	✓✓	0.692	0.592	0.973
	FL	✓✓	0.692	0.564	0.996
GL	Dec4	✓✓	0.278	0.500	0.636
	Dec3	✓✓	0.111	0.527	0.399
	Dec2	✓✓	1.000	0.524	0.987
	Dec1	✓✓	1.000	0.607	0.991
	FL	✓✓	1.000	0.575	0.996

Table A4. Performance comparison of GradCAM, LRP, DeepLift, and G*I methods. Each panel shows the computed metrics: human assessment (HA), pointing game (PG), content heatmap (CH), and Attribution Concordance Score (AC), across layers on UTP-110. ✓✓, ✓ and × symbols indicate indicate good, medium and bad evaluation scores receptively.

GradCAM
Class	Layer	HA	PG	CH	AC
DEC	Dec4	✓	0.737	0.610	0.938
	Dec3	✓✓	0.947	0.670	0.991
	Dec2	✓✓	0.947	0.742	1.000
	Dec1	✓✓	0.947	0.761	0.999
	FL	✓✓	0.947	0.870	0.999
FIL	Dec4	×	0.059	0.672	0.166
	Dec3	×	0.000	0.596	0.149
	Dec2	✓	0.882	0.633	0.261
	Dec1	✓	1.000	0.681	0.785
	FL	✓✓	1.000	0.814	0.998
TXTL	Dec4	✓	0.650	0.670	0.817
	Dec3	✓✓	1.000	0.509	0.967
	Dec2	✓✓	1.000	0.696	0.997
	Dec1	✓	1.000	0.709	0.984
	FL	✓✓	1.000	0.797	0.999
BD	Dec4	✓	0.950	0.583	0.686
	Dec3	✓✓	0.800	0.623	0.928
	Dec2	✓✓	1.000	0.671	0.949
	Dec1	✓✓	1.000	0.746	0.983
	FL	✓✓	1.000	0.820	0.986
SInit	Dec4	×	0.632	0.584	0.150
	Dec3	×	0.158	0.550	0.078
	Dec2	✓	0.842	0.593	0.175
	Dec1	✓	0.895	0.693	0.903
	FL	✓✓	0.895	0.809	0.997
BInit	Dec4	×	0.632	0.476	0.159
	Dec3	✓	0.526	0.622	0.255
	Dec2	✓	0.895	0.623	0.535
	Dec1	✓✓	0.895	0.672	0.947
	FL	✓✓	0.895	0.775	0.995
LRP
Class	Layer	HA	PG	CH	AC
DEC	Dec4	✓	0.579	0.777	0.586
	Dec3	✓✓	0.474	0.467	0.468
	Dec2	✓✓	0.526	0.511	0.575
	Dec1	✓	0.474	0.458	0.504
	FL	✓✓	0.368	0.994	0.996
FIL	Dec4	✓✓	0.882	0.415	0.802
	Dec3	✓✓	0.882	0.317	0.851
	Dec2	✓✓	0.941	0.452	0.917
	Dec1	✓✓	0.824	0.534	0.919
	FL	✓✓	0.353	0.961	0.975
TXTL	Dec4	✓✓	0.700	0.349	0.915
	Dec3	✓✓	0.950	0.371	0.892
	Dec2	✓✓	0.750	0.545	0.995
	Dec1	✓✓	0.450	0.346	0.993
	FL	✓✓	0.300	0.985	0.987
BD	Dec4	✓	0.550	0.303	0.623
	Dec3	✓	0.700	0.542	0.510
	Dec2	✓✓	0.300	0.641	0.772
	Dec1	✓✓	0.400	0.476	0.435
	FL	✓✓	0.050	0.952	0.967
SInit	Dec4	✓✓	0.789	0.350	0.620
	Dec3	✓✓	0.579	0.630	0.563
	Dec2	✓✓	0.632	0.447	0.812
	Dec1	✓✓	0.632	0.364	0.737
	FL	✓✓	0.526	0.986	0.980
BInit	Dec4	✓✓	0.684	0.498	0.647
	Dec3	✓✓	0.579	0.507	0.653
	Dec2	✓✓	0.474	0.524	0.610
	Dec1	✓✓	0.579	0.514	0.542
	FL	✓✓	0.316	0.883	0.978
DeepLift
Class	Layer	HA	PG	CH	AC
DEC	Dec4	✓	0.842	0.797	0.906
	Dec3	✓✓	0.789	0.734	0.884
	Dec2	✓✓	0.895	0.600	0.993
	Dec1	✓✓	0.947	0.640	0.995
	FL	✓✓	0.947	0.748	0.996
FIL	Dec4	✓✓	1.000	0.620	0.946
	Dec3	✓✓	1.000	0.576	1.000
	Dec2	✓✓	1.000	0.720	0.997
	Dec1	✓✓	1.000	0.752	0.997
	FL	✓✓	1.000	0.836	0.999
TXTL	Dec4	✓✓	1.000	0.670	0.913
	Dec3	✓✓	1.000	0.477	0.999
	Dec2	✓✓	1.000	0.651	0.998
	Dec1	✓✓	1.000	0.761	0.999
	FL	✓✓	1.000	0.814	1.000
BD	Dec4	✓	0.850	0.373	0.598
	Dec3	✓✓	0.850	0.664	0.918
	Dec2	✓✓	1.000	0.651	0.981
	Dec1	✓✓	1.000	0.767	0.984
	FL	✓✓	1.000	0.802	0.990
SInit	Dec4	✓✓	0.895	0.464	0.996
	Dec3	✓✓	0.895	0.677	0.997
	Dec2	✓✓	0.895	0.755	0.996
	Dec1	✓✓	0.895	0.788	0.995
	FL	✓✓	0.895	0.833	0.997
BInit	Dec4	✓✓	0.895	0.481	0.995
	Dec3	✓✓	0.895	0.664	0.999
	Dec2	✓✓	0.895	0.692	0.997
	Dec1	✓✓	0.895	0.727	0.997
	FL	✓✓	0.895	0.747	0.999
*GI**
Class	Layer	HA	PG	CH	AC
DEC	Dec4	✓	0.737	0.823	0.821
	Dec3	✓✓	0.579	0.458	0.891
	Dec2	✓✓	0.789	0.577	0.983
	Dec1	✓✓	0.842	0.604	0.999
	FL	✓✓	0.947	0.833	0.999
FIL	Dec4	✓✓	1.000	0.477	0.936
	Dec3	✓✓	1.000	0.391	0.955
	Dec2	✓✓	1.000	0.622	0.995
	Dec1	✓✓	1.000	0.724	0.998
	FL	✓✓	1.000	0.792	0.999
TXTL	Dec4	✓✓	0.800	0.511	0.981
	Dec3	✓✓	1.000	0.437	0.907
	Dec2	✓✓	1.000	0.573	0.998
	Dec1	✓✓	1.000	0.747	0.997
	FL	✓✓	1.000	0.758	1.000
BD	Dec4	✓	0.600	0.283	0.602
	Dec3	✓✓	0.900	0.632	0.673
	Dec2	✓✓	1.000	0.694	0.995
	Dec1	✓✓	1.000	0.719	0.991
	FL	✓✓	1.000	0.744	0.991
SInit	Dec4	✓✓	0.789	0.402	0.685
	Dec3	✓✓	0.789	0.731	0.866
	Dec2	✓✓	0.842	0.605	0.991
	Dec1	✓✓	0.737	0.730	0.992
	FL	✓✓	0.895	0.770	0.998
BInit	Dec4	✓✓	0.737	0.485	0.740
	Dec3	✓✓	0.789	0.652	0.917
	Dec2	✓✓	0.684	0.556	0.919
	Dec1	✓✓	0.895	0.634	0.988
	FL	✓✓	0.895	0.685	0.999

Appendix A.5. Saliency Overlay Experiment

Figure A10. UTP-110 sample with cropped region of the BInit class (highlighted in red).

Appendix A.6. Generalization Section

Table A5. Comparison of attribution methods: GradCAM, LRP, DeepLift, and G*I. Each block shows the computed metrics—human assessment (HA), pointing game (PG), content heatmap (CH), and Attribution Concordance Score (ACS)—across layers for each XAI method on the CityScapes dataset. ✓✓, ✓ and × symbols indicate good, medium and bad evaluation scores receptively.

GradCAM
Class	Layer	HA	PG	CH	AC
road	Dec4	×	0.905	0.585	0.933
	Dec3	×	0.905	0.725	0.944
	Dec2	✓✓	0.905	0.700	0.943
	Dec1	✓✓	0.905	0.740	0.947
	FL	✓✓	0.905	0.801	0.945
sidewalk	Dec4	×	0.609	0.560	0.668
	Dec3	×	0.739	0.643	0.756
	Dec2	✓	0.783	0.650	0.816
	Dec1	✓	0.696	0.685	0.802
	FL	✓	0.696	0.697	0.800
person	Dec4	×	0.261	0.542	0.045
	Dec3	×	0.435	0.505	0.191
	Dec2	×	0.261	0.549	0.080
	Dec1	×	0.304	0.482	0.176
	FL	✓	0.609	0.576	0.562
car	Dec4	×	0.850	0.546	0.682
	Dec3	×	0.850	0.535	0.709
	Dec2	✓	0.900	0.594	0.724
	Dec1	✓	0.900	0.573	0.854
	FL	✓✓	0.950	0.626	0.906
bus	Dec4	×	0.500	0.576	0.500
	Dec3	×	0.667	0.529	0.959
	Dec2	×	0.667	0.627	0.910
	Dec1	×	0.500	0.654	0.761
	FL	✓	0.667	0.684	0.999
LRP
Class	Layer	HA	PG	CH	AC
road	Dec4	×	0.857	0.491	0.935
	Dec3	×	0.905	0.621	0.944
	Dec2	✓✓	0.810	0.626	0.943
	Dec1	✓✓	0.762	0.782	0.946
	FL	✓✓	0.333	0.971	0.937
sidewalk	Dec4	×	0.696	0.392	0.792
	Dec3	×	0.783	0.481	0.848
	Dec2	✓	0.391	0.443	0.855
	Dec1	✓	0.391	0.479	0.862
	FL	✓	0.217	0.749	0.831
person	Dec4	×	0.435	0.509	0.359
	Dec3	×	0.609	0.410	0.463
	Dec2	✓	0.522	0.379	0.597
	Dec1	✓	0.348	0.358	0.624
	FL	✓	0.261	0.571	0.601
car	Dec4	×	0.700	0.403	0.798
	Dec3	×	0.700	0.418	0.815
	Dec2	✓	0.350	0.415	0.782
	Dec1	✓	0.250	0.407	0.766
	FL	✓✓	0.300	0.863	0.843
bus	Dec4	×	0.667	0.414	0.933
	Dec3	×	0.667	0.308	0.997
	Dec2	✓	0.500	0.356	0.992
	Dec1	✓	0.667	0.218	0.942
	FL	✓	0.333	0.631	0.985
DeepLift
Class	Layer	HA	PG	CH	AC
road	Dec4	×	0.905	0.340	0.893
	Dec3	×	0.905	0.361	0.902
	Dec2	✓	0.905	0.374	0.927
	Dec1	✓	0.810	0.431	0.928
	FL	✓✓	0.810	0.472	0.931
sidewalk	Dec4	×	0.783	0.329	0.809
	Dec3	×	0.783	0.404	0.860
	Dec2	✓	0.783	0.406	0.880
	Dec1	✓	0.739	0.457	0.885
	FL	✓	0.783	0.468	0.879
person	Dec4	✓	0.348	0.463	0.321
	Dec3	×	0.565	0.349	0.563
	Dec2	✓	0.609	0.297	0.647
	Dec1	✓	0.565	0.343	0.648
	FL	✓	0.565	0.389	0.618
car	Dec4	×	0.850	0.357	0.839
	Dec3	×	0.900	0.391	0.909
	Dec2	✓✓	0.950	0.495	0.927
	Dec1	✓✓	0.950	0.563	0.922
	FL	✓✓	0.950	0.593	0.925
bus	Dec4	×	0.667	0.394	0.980
	Dec3	×	0.667	0.318	0.998
	Dec2	✓	0.667	0.427	0.998
	Dec1	✓	0.667	0.473	0.996
	FL	✓	0.667	0.494	0.998
*GI**
Class	Layer	HA	PG	CH	AC
road	Dec4	×	0.905	0.498	0.938
	Dec3	×	0.905	0.604	0.944
	Dec2	✓	0.905	0.609	0.944
	Dec1	✓✓	0.905	0.660	0.946
	FL	✓✓	0.905	0.775	0.947
sidewalk	Dec4	×	0.696	0.364	0.799
	Dec3	×	0.783	0.446	0.865
	Dec2	✓	0.739	0.420	0.886
	Dec1	✓	0.739	0.477	0.890
	FL	✓	0.696	0.521	0.878
person	Dec4	×	0.391	0.502	0.357
	Dec3	×	0.609	0.401	0.477
	Dec2	✓	0.522	0.352	0.605
	Dec1	✓	0.565	0.352	0.652
	FL	✓	0.609	0.333	0.646
car	Dec4	×	0.900	0.371	0.829
	Dec3	×	0.900	0.409	0.904
	Dec2	✓✓	0.850	0.486	0.921
	Dec1	✓✓	0.850	0.505	0.921
	FL	✓✓	0.950	0.559	0.933
bus	Dec4	×	0.667	0.414	0.920
	Dec3	×	0.667	0.278	0.954
	Dec2	✓	0.667	0.357	0.998
	Dec1	✓	0.667	0.456	0.992
	FL	✓	0.667	0.436	0.998

Figure A11. Evolution of the selected heatmap for the BInit class on the UTP-110 dataset using saliency overlay of the target input region.

References

Baird, H.S.; Govindaraju, V.; Lopresti, D.P. Document analysis systems for digital libraries: Challenges and opportunities. In Proceedings of the Document Analysis Systems VI: 6th International Workshop, DAS 2004, Florence, Italy, 8–10 September 2004; Proceedings 6. Springer: Berlin/Heidelberg, Germany, 2004; pp. 1–16. [Google Scholar]
Lombardi, F.; Marinai, S. Deep learning for historical document analysis and recognition—A survey. J. Imaging 2020, 6, 110. [Google Scholar] [CrossRef] [PubMed]
Ma, W.; Zhang, H.; Jin, L.; Wu, S.; Wang, J.; Wang, Y. Joint layout analysis, character detection and recognition for historical document digitization. In Proceedings of the 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 31–36. [Google Scholar]
Wahyudi, M.I.; Fauzi, I.; Atmojo, D. Robust Image Watermarking Based on Hybrid IWT-DCT-SVD. IJACI Int. J. Adv. Comput. Inform. 2025, 1, 89–98. [Google Scholar] [CrossRef]
Kusuma, M.R.; Panggabean, S. Robust Digital Image Watermarking Using DWT, Hessenberg, and SVD for Copyright Protection. IJACI Int. J. Adv. Comput. Inform. 2026, 2, 41–52. [Google Scholar]
Amrullah, A.; Aminuddin, A. Tamper Localization and Content Restoration in Fragile Image Watermarking: A Review. IJACI Int. J. Adv. Comput. Inform. 2026, 2, 62–74. [Google Scholar] [CrossRef]
Chen, K.; Seuret, M.; Hennebert, J.; Ingold, R. Convolutional neural networks for page segmentation of historical document images. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–12 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 965–970. [Google Scholar]
Oliveira, S.A.; Seguin, B.; Kaplan, F. dhSegment: A generic deep-learning approach for document segmentation. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Niagara Falls, NY, USA, 5–8 August 2018; pp. 7–12. [Google Scholar]
Grüning, T.; Leifert, G.; Strauß, T.; Michael, J.; Labahn, R. A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recognit. 2019, 22, 285–302. [Google Scholar] [CrossRef]
Renton, G.; Soullard, Y.; Chatelain, C.; Adam, S.; Kermorvant, C.; Paquet, T. Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 2018, 21, 177–186. [Google Scholar] [CrossRef]
Rahal, N.; Vögtlin, L.; Ingold, R. Historical document image analysis using controlled data for pre-training. Int. J. Doc. Anal. Recognit. 2023, 26, 241–254. [Google Scholar] [CrossRef]
Rahal, N.; Vögtlin, L.; Ingold, R. Layout analysis of historical document images using a light fully convolutional network. In Proceedings of the International Conference on Document Analysis and Recognition, San José, CA, USA, 21–26 August 2023; pp. 325–341. [Google Scholar]
Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Cheng, Z.; Wu, Y.; Li, Y.; Cai, L.; Ihnaini, B. A Comprehensive Review of Explainable Artificial Intelligence (XAI) in Computer Vision. Sensors 2025, 25, 4166. [Google Scholar] [CrossRef]
Sujatha Ravindran, A.; Contreras-Vidal, J. An empirical comparison of deep learning explainability approaches for EEG using simulated ground truth. Sci. Rep. 2023, 13, 17709. [Google Scholar] [CrossRef]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar]
Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via Gradient-based localization. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Simistira, F.; Seuret, M.; Eichenberger, N.; Garz, A.; Liwicki, M.; Ingold, R. Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Shenzhen, China, 23–26 October 2016; pp. 471–476. [Google Scholar]
Rahal, N.; Vögtlin, L.; Ingold, R. Approximate ground truth generation for semantic labeling of historical documents with minimal human effort. Int. J. Doc. Anal. Recognit. 2024, 27, 335–347. [Google Scholar] [CrossRef]
Brini, I.; Mehri, M.; Ingold, R.; Essoukri Ben Amara, N. An End-to-End Framework for Evaluating Explainable Deep Models: Application to Historical Document Image Segmentation. In Proceedings of the Computational Collective Intelligence, Hammamet, Tunisia, 28–30 September 2022; pp. 106–119. [Google Scholar]
Yeh, C.K.; Hsieh, C.Y.; Suggala, A.; Inouye, D.I.; Ravikumar, P.K. On the (in) fidelity and sensitivity of explanations. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
Pillai, V.; Pirsiavash, H. Explainable models with consistent interpretations. In Proceedings of the PAAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2431–2439. [Google Scholar]
Zhang, J.; Bargal, S.A.; Lin, Z.; Brandt, J.; Shen, X.; Sclaroff, S. Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 2018, 126, 1084–1102. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Boillet, M.; Kermorvant, C.; Paquet, T. Multiple document datasets pre-training improves text line detection with deep neural networks. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2134–2141. [Google Scholar]
Da, C.; Luo, C.; Zheng, Q.; Yao, C. Vision grid transformer for document layout analysis. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 19462–19472. [Google Scholar]
Binmakhashen, G.M.; Mahmoud, S.A. Document layout analysis: A comprehensive survey. ACM Comput. Surv. 2019, 52, 109. [Google Scholar] [CrossRef]
Hassija, V.; Chamola, V.; Mahapatra, A.; Singal, A.; Goel, D.; Huang, K.; Scardapane, S.; Spinelli, I.; Mahmud, M.; Hussain, A. Interpreting black-box models: A review on explainable artificial intelligence. Cogn. Comput. 2024, 16, 45–74. [Google Scholar]
Gipiškis, R.; Tsai, C.W.; Kurasova, O. Explainable AI (XAI) in image segmentation in medicine, industry, and beyond: A survey. ICT Express 2024, 10, 1331–1354. [Google Scholar] [CrossRef]
Vinogradova, K.; Dibrov, A.; Myers, G. Towards interpretable semantic segmentation via gradient-weighted class activation mapping (student abstract). In Proceedings of the AAAI conference on artificial intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13943–13944. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Hasany, S.N.; MÃŠriaudeau, F.; Petitjean, C. MiSuRe is all you need to explain your image segmentation. arXiv 2024, arXiv:2406.12173. [Google Scholar]
Riva, M.; Gori, P.; Yger, F.; Bloch, I. Is the U-NET Directional-Relationship Aware? In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3391–3395. [Google Scholar]
Igelsias, J.; Styner, M.; Langerak, T.; Landman, B.; Xu, Z.; Klein, A. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, Munich, Germany, 5–9 October 2015. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Poppi, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Revisiting the Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 20–25 June 2021; pp. 2299–2304. [Google Scholar]
Shrikumar, A.; Greenside, P.; Shcherbina, A.; Kundaje, A. Not just a black box: Learning important features through propagating activation differences. arXiv 2016, arXiv:1605.01713. [Google Scholar]
Monnier, T.; Aubry, M. docExtractor: An off-the-shelf historical document element extraction. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Dortmund, Germany, 8–10 September 2020; pp. 91–96. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Brini, I.; Rahal, N.; Mehri, M.; Ingold, R.; Essoukri Ben Amara, N. DocXAI-Pruner: Optimizing Semantic Segmentation Models for Document Layout Analysis Via Explainable AI-Driven Pruning. In Proceedings of the Horizons of AI: Ethical Considerations and Interdisciplinary Engagements; Springer: Singapore, 2025; pp. 333–348. [Google Scholar]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M.A. Striving for Simplicity: The All Convolutional Net. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You? ”: Explaining the Predictions of Any Classifier. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
Petsiuk, V.; Das, A.; Saenko, K. RISE: Randomized Input Sampling for Explanation of Black-box Models. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018; p. 151. [Google Scholar]
Lapuschkin, S.; Wäldchen, S.; Binder, A.; Montavon, G.; Wojciech.; Müller, K.R. Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun. 2019, 10, 1096. [Google Scholar] [CrossRef]
Dreyer, M.; Achtibat, R.; Wiegand, T.; Samek, W.; Lapuschkin, S. Revealing Hidden Context Bias in Segmentation and Object Detection Through Concept-Specific Explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3829–3839. [Google Scholar]
Wang, C.; Liu, Y.; Chen, Y.; Liu, F.; Tian, Y.; McCarthy, D.; Frazer, H.; Carneiro, G. Learning Support and Trivial Prototypes for Interpretable Image Classification. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2062–2072. [Google Scholar]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man, Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Espinosa Zarlenga, M.; Barbiero, P.; Ciravegna, G.; Marra, G.; Giannini, F.; Diligenti, M.; Shams, Z.; Precioso, F.; Melacci, S.; Weller, A.; et al. Concept embedding models: Beyond the accuracy-explainability trade-off. Adv. Neural Inf. Process. Syst. 2022, 35, 21400–21413. [Google Scholar]
Crabbé, J.; van der Schaar, M. Evaluating the robustness of interpretability methods through explanation invariance and equivariance. Adv. Neural Inf. Process. Syst. 2023, 36, 71393–71429. [Google Scholar]
Achtibat, R.; Dreyer, M.; Eisenbraun, I.; Bosse, S.; Wiegand, T.; Samek, W.; Lapuschkin, S. From attribution maps to human-understandable explanations through concept relevance propagation. Nat. Mach. Intell. 2023, 5, 1006–1019. [Google Scholar] [CrossRef]
Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef] [PubMed]
Ayoob, M.; Nettasinghe, O.; Sylvester, V.; Bowala, H.; Mohideen, H. Peering into the Heart: A Comprehensive Exploration of Semantic Segmentation and Explainable AI on the MnMs-2 Cardiac MRI Dataset. Appl. Comput. Syst. 2025, 30, 12–20. [Google Scholar] [CrossRef]
Hassan, M.; Fateh, A.A.; Lin, J.; Zhuang, Y.; Lin, G.; Xiong, H.; You, Z.; Qin, P.; Zeng, H. Unfolding Explainable AI for Brain Tumor Segmentation. Neurocomputing 2024, 599, 128058. [Google Scholar] [CrossRef]
Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for pytorch. arXiv 2020, arXiv:2009.07896. [Google Scholar] [CrossRef]
Montavon, G.; Binder, A.; Lapuschkin, S.; Samek, W.; Müller, K.R. Layer-wise relevance propagation: An overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer: Cham, Switzerland, 2019; pp. 193–209. [Google Scholar]

Figure 2. RISE explanation https://eclique.github.io/rep-imgs/RISE/rise-overview.png (accessed on 8 November 2025): The input image

(I)

is masked multiple times with random binary masks (

M_{i}

), generating occluded versions (

I ⊙ M_{i}

). Each masked image is fed through the black box model (f) to obtain class probabilities

P [s h a r k]

. The final saliency map

(S)

is computed as a weighted sum of all masks, where weights correspond to their respective prediction scores.

Figure 2. RISE explanation https://eclique.github.io/rep-imgs/RISE/rise-overview.png (accessed on 8 November 2025): The input image

(I)

is masked multiple times with random binary masks (

M_{i}

), generating occluded versions (

I ⊙ M_{i}

). Each masked image is fed through the black box model (f) to obtain class probabilities

P [s h a r k]

. The final saliency map

(S)

is computed as a weighted sum of all masks, where weights correspond to their respective prediction scores.

Figure 3. Overview of the proposed workflow. The workflow consists of five main steps: (1) Attribution map generation using a selected XAI method—Grad-CAM, LRP, DeepLIFT, Gradient × Input, or the perturbation-based methods RISE and MiSuRe—applied to a DNN with a chosen input image, target class, and selected input and intermediate layer in the network; (2) Post-processing to enhance the visual quality and interpretability of the generated attribution maps; (3) Qualitative evaluation through visual inspection and expert analysis; (4) Quantitative evaluation using objective metrics to assess consistency and reliability; and (5) Hybrid evaluation, combining both visual and metric-based assessments for comprehensive interpretability analysis.

Figure 5. Example of a random square mask perturbation and its impact on segmentation performance. (a): Zoomed-in region of the input image; (b): Ground truth segmentation mask; (c): Model prediction on the original image; (d): Perturbed image with a randomly placed square occlusion; (e): Prediction on the perturbed image.

Figure 6. Illustration of the targeted perturbation method applied on HDIs for main text area. (a): Original input image; (b): Randomly selected square regions for applying targeted perturbations; (c): Resulting perturbed image obtained by assigning a black RGB value to the TXT pixels defined in the randomly selected square regions.

Figure 7. Illustration of the

A C S

metric computation. It begins with the original input image (a), followed by the annotated ground truth (b) and the model predictions (c). An attribution map is then generated (d), which is subsequently overlaid on the target regions (TXT) (e). Finally, the attribution is disentangled from non-target regions, specifically the BG, GL, and HL classes (f).

Figure 7. Illustration of the

A C S

metric computation. It begins with the original input image (a), followed by the annotated ground truth (b) and the model predictions (c). An attribution map is then generated (d), which is subsequently overlaid on the target regions (TXT) (e). Finally, the attribution is disentangled from non-target regions, specifically the BG, GL, and HL classes (f).

Figure 8. Comparative visualizations of XAI methods after applying the thresholding method on attribution maps. (a) GradCAM, Dec3,

C H = 0.430

,

A C S = 0.914

. (b) GradCAM, Dec1,

C H = 0.503

,

A C S = 0.969

. (c) GradCAM + ReLU, Dec3,

C H = 0.430

,

A C S = 0.914

. (d) GradCAM + ReLU, Dec1,

C H = 0.503

,

A C S = 0.969

.

Figure 8. Comparative visualizations of XAI methods after applying the thresholding method on attribution maps. (a) GradCAM, Dec3,

C H = 0.430

,

A C S = 0.914

. (b) GradCAM, Dec1,

C H = 0.503

,

A C S = 0.969

. (c) GradCAM + ReLU, Dec3,

C H = 0.430

,

A C S = 0.914

. (d) GradCAM + ReLU, Dec1,

C H = 0.503

,

A C S = 0.969

.

Figure 9. Illustration of the human-based scoring system. (a): input image from UTP-110; (b): selected target class (

B D

) in yellow; (c): low score with dispersed attributions; (d): medium score with attributions on

B D

and partly on

B I n i t

; (e): high score with attributions concentrated on

B D

.

Figure 9. Illustration of the human-based scoring system. (a): input image from UTP-110; (b): selected target class (

B D

) in yellow; (c): low score with dispersed attributions; (d): medium score with attributions on

B D

and partly on

B I n i t

; (e): high score with attributions concentrated on

B D

.

Figure 10. Examples of historical document images and their corresponding ground truth masks used in our experiments. (a) CB55 dataset, (b) UTP-110 dataset, and (c) Synthetic dataset. In the CB55 and Synthetic ground truth masks, black, yellow, cyan, and magenta represent background (BG), main text (TXT), glosses (GL), and highlight (HL), respectively. In the UTP-110 ground truth, black, red, yellow, blue, cyan, magenta, and green represent background (BG), decoration (DEC), body (BD), text line (TXTL), big initial (BInit), small initial (SInit), and filler (FIL).

Figure 11. Impact of perturbation mask size on infidelity sensitivity. Six patch sizes (2 × 2 to 64 × 64) were tested across 50 random patches and 100 iterations.

Figure 12. Comparison of performance using the IoU metric between the L-U-NET and S-U-NET models on the CB55 and UTP-110 datasets for each individual target class.

Figure 13. Comparative results on the CB55 dataset.

Figure 14. Comparative results on the UTP-110 dataset.

Figure 15. Cityscapes examples showing input images, corresponding ground truth, and predictions.

Figure 16. Comparison of two XAI methods (GradCAM and LRP) across four different target classes: road, sidewalk, person, and car.

Figure 17. Comparative results on Cityscapes.

Figure 18. Qualitative comparison of attribution maps for the main text class on the CB55 historical document dataset using six attribution methods.

Figure 19. Qualitative comparison of attribution maps for the main text class on the UTP110 historical document dataset using six attribution methods.

Figure 20. Qualitative evaluation of cross-domain evaluation on the Cityscapes dataset for the car class. An input image (a) of ground truth mask (b) containing a car instance (c) is assessed using a decomposition-based method, DeepLift (d), and two perturbation methods, (e) RISE and (f) MiSuRe.

Table 1. Comparison of the six XAI methods evaluated in this study. The methods are grouped into three categories: gradient-based (Grad-CAM, Gradient × Input), decomposition-based (LRP, DeepLIFT), and perturbation-based (RISE, MiSuRe). Each method is characterized by its underlying principle and whether it supports input-level or intermediate-layer attributions using the (✓) symbol, otherwise we indicate the opposite using the (×) symbol.

Method	Type	Input Layer	Intermediate Layer
Grad-CAM	Gradient-based	✓	✓
G × I	Gradient-based	✓	✓
LRP	Decomposition-based	✓	✓
DeepLIFT	Decomposition-based	✓	✓
RISE	Perturbation-based	✓	×
MiSuRe	Perturbation-based	✓	×

Table 2. Spatial dimensions (height × width) of the decoder layers in the CB55 and UTP-110 datasets.

Layer	Spatial Dimensions
Layer	CB55	UTP-110
Dec4	$168 \times 120$	$120 \times 80$
Dec3	$336 \times 240$	$240 \times 160$
Dec2	$672 \times 480$	$480 \times 320$
Dec1	$1344 \times 960$	$960 \times 640$
FL	$1344 \times 960$	$960 \times 640$

Table 3. Mask size sensitivity analysis for the infidelity perturbation metric. Each value represents the mean ± standard deviation over 100 iterations and 50 random patches. The fifth row with bold values indicate the best configuration for our perturbation experiment.

Patch Size	Confidence Drop	IoU Drop	DICE Drop	Pixel Change Ratio
2 × 2	0.0000 ± 0.0001	0.0036 ± 0.0084	0.0028 ± 0.0085	0.0006 ± 0.0008
4 × 4	0.0000 ± 0.0001	0.0077 ± 0.0136	0.0064 ± 0.0144	0.0013 ± 0.0009
8 × 8	0.0001 ± 0.0001	0.0138 ± 0.0138	0.0107 ± 0.0134	0.0023 ± 0.0012
16 × 16	0.0001 ± 0.0001	0.0284 ± 0.0261	0.0240 ± 0.0290	0.0043 ± 0.0017
32 × 32	0.0002 ± 0.0002	0.0540 ± 0.0317	0.0445 ± 0.0345	0.0088 ± 0.0022
64 × 64	0.0001 ± 0.0003	0.1095 ± 0.0364	0.0944 ± 0.0436	0.0174 ± 0.0033

Table 4. Experimental setup for perturbation-based methods (RISE and MiSuRe) across document datasets (CB55, UTP-110) and Cityscapes. Parameter values were empirically selected to maximize method performance and ensure stable convergence.

Parameter	CB55/UTP-110	Cityscapes
Metric	DICE	DICE
Patch size $(w, h)$	$32 \times 32$	$8 \times 8$
Number of masks	500	500
Batch size	1	8
Probability threshold $(p_{t h})$	$0.5$	$0.5$
Total variation weight $(γ_{t v})$	$0.001$	$0.001$
Score mode	dice	dice
Sparsity weight $(λ_{s})$	$0.01$	$0.01$
Learning rate	$0.1$	$0.1$
Foreground weight $(α_{f g})$	2	2
Background weight $(α_{b g})$	1	1
Temperature $(τ)$	$0.9$	$0.9$
Mask size	$(960, 1344)$ (CB55), $(640, 960)$ (UTP)	$(256, 512)$

Table 5. Comparison of performance using standard semantic segmentation metrics (accuracy, precision, recall, F1-score, and IoU) between the L-U-NET and S-U-NET models on the CB55 and UTP-110 datasets. The bold values indicate the best recorded metrics for each model on each dataset.

Dataset	Model	Accuracy	Precision	Recall	F1-Score	IoU
CB55	L-U-NET	0.914	0.926	0.914	0.920	0.855
CB55	S-U-NET	0.935	0.925	0.935	0.929	0.870
UTP-110	L-U-NET	0.951	0.912	0.951	0.930	0.870
UTP-110	S-U-NET	0.946	0.935	0.946	0.940	0.889

Table 6. Computational cost comparison of the L-U-NET and S-U-NET models across the CB55 and UTP-110 datasets.

Model	#Params	Dataset	Inference Time	Training Time
L-U-NET	17K	CB55	4.82 s	5 h
L-U-NET	17K	UTP-110	3.33 s	2 h
S-U-NET	31M	CB55	4.81 s	10 h
S-U-NET	31M	UTP-110	3.53 s	3 h

Table 7. Computational costs required to generate explanations in the form of attribution maps using four different XAI-based methods, as well as their evaluation after applying L-U-NET and S-U-NET on the images of the test set of the CB55 and UTP-110 datasets. The cells highlighted in red represent the largest values among all the method.

Dataset	Model	Method	Attribution Maps		$Infid$		$Sens_Max$		$CH$ and $ACS$
Dataset	Model	Method	Time (s)	Memory (GB)	Time (s)	Memory (GB)	Time (s)	Memory (GB)	Time (s)	Memory (GB)
CB55	L-U-NET	GradCAM	22	2.2	480	5.9	300	7.5	35	3.2
		LRP	51	5.5	540	6.5	600	19	104	6.6
		DeepLift	36	11.4	480	8	480	26.5	50	8.8
		G*I	22	2.2	480	5.6	300	7.5	36	3.2
	S-U-NET	GradCAM	101	7.3	2040	21.5	1200	27.3	129	9.7
		LRP	235	23	2400	27.6	2760	29.9	285	24
		DeepLift	~5400	36.6	~50,400	32	~56,300	98.6	~5400	30
		G*I	100	7.3	2040	21.5	1260	27.3	159	9.7
UTP-110	L-U-NET	GradCAM	20	1.2	900	3.5	240	4.3	33	2.2
		LRP	43	3	1020	3.8	540	9.9	57	3.8
		DeepLift	32	3.5	900	4.4	420	13.3	44	5
		G*I	20	1.2	900	3.5	240	4.4	32	2.2
	S-U-NET	GradCAM	85	3.8	~3600	11	~1080	13.8	157	5.4
		LRP	196	11	~3600	13.9	~2280	16.5	266	12.1
		DeepLift	159	23.3	~93,600	15.9	~48,500	47.7	172	15
		G*I	85	3.8	~3600	11	~1080	13.8	160	5.4

Table 8. Quantitative comparison of XAI methods across datasets (CB55, UTP-110, and Cityscapes) for input-layer explanations. Reported metrics include content heatmap (CH), pixel grouping (PG), and Attribution Concordance Score (ACS). The ACS values correspond to the Otsu-thresholded variant. The ↑ symbols indicate that higher values are better. The ↑ indicate that the highest values are the better. The green cells indicate that the current method scored the best values in that dataset.

Dataset	Method	CH ↑	PG ↑	ACS ↑
CB55 (Text)	DeepLIFT	0.415	1.000	0.630
CB55 (Text)	Grad-CAM	0.569	1.000	0.738
CB55 (Text)	Input × Gradient	0.554	0.000	0.391
CB55 (Text)	LRP	0.637	0.000	0.025
CB55 (Text)	MiSuRe	0.692	0.000	0.544
CB55 (Text)	RISE	0.558	0.000	0.181
UTP-110 (Text)	DeepLIFT	0.451	1.000	0.832
UTP-110 (Text)	Grad-CAM	0.553	1.000	0.735
UTP-110 (Text)	Input × Gradient	0.421	0.000	0.588
UTP-110 (Text)	LRP	0.306	0.000	0.647
UTP-110 (Text)	MiSuRe	0.807	0.000	0.601
UTP-110 (Text)	RISE	0.448	1.000	0.249
Cityscapes (Car)	DeepLIFT	0.474	0.000	0.627
Cityscapes (Car)	Grad-CAM	0.661	1.000	0.793
Cityscapes (Car)	Input × Gradient	0.637	0.000	0.748
Cityscapes (Car)	LRP	0.601	1.000	0.678
Cityscapes (Car)	MiSuRe	0.480	0.000	0.910
Cityscapes (Car)	RISE	0.557	1.000	0.721

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Brini, I.; Rahal, N.; Mehri, M.; Ingold, R.; Essoukri Ben Amara, N. SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models. J. Imaging 2025, 11, 424. https://doi.org/10.3390/jimaging11120424

AMA Style

Brini I, Rahal N, Mehri M, Ingold R, Essoukri Ben Amara N. SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models. Journal of Imaging. 2025; 11(12):424. https://doi.org/10.3390/jimaging11120424

Chicago/Turabian Style

Brini, Iheb, Najoua Rahal, Maroua Mehri, Rolf Ingold, and Najoua Essoukri Ben Amara. 2025. "SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models" Journal of Imaging 11, no. 12: 424. https://doi.org/10.3390/jimaging11120424

APA Style

Brini, I., Rahal, N., Mehri, M., Ingold, R., & Essoukri Ben Amara, N. (2025). SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models. Journal of Imaging, 11(12), 424. https://doi.org/10.3390/jimaging11120424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models

Abstract

1. Introduction

2. Literature Review

2.1. Explainable Semantic Segmentation Models

2.2. XAI Method Categorization

3. Proposed Attribution-Based XAI Workflow

3.1. Attribution Map Generation

3.2. Post-Processing

3.2.1. Clipping Step

3.2.2. Normalization Step

3.3. Quantitative Evaluation

3.4. Qualitative Evalutation

3.5. Hybrid Evaluation

3.5.1. Infidelity

3.5.2. Sensitivity

3.5.3. Pointing Game

3.5.4. Content Heatmap

3.5.5. Proposed ACS Metric

3.5.6. Adaptive ACS Metric

3.5.7. Human Assessment

4. Experiments and Results

4.1. Experimental Corpora

4.2. Experimental Protocol

4.3. Explanation Parameters

4.4. Generalization Setup

4.5. Results

4.5.1. Model Performance

4.5.2. XAI Evaluation

4.5.3. Qualitative Evaluation

4.6. Human-Centric Alignment of XAI Metrics

4.6.1. Human-Expert Assessment

4.6.2. Metric Alignment

4.7. Saliency Overlay Analysis

4.8. Computational Cost

4.9. Synthesis

5. Generalization

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Attribution Generation

Appendix A.1.1. Segmentation Adaptation

Appendix A.1.2. Attribution Computation

Appendix A.1.3. Post-Processing Steps

Appendix A.2. ACS Metric

Appendix A.3. U-NET-Model

Appendix A.4. Explanation Results

Appendix A.4.1. Quantitative Results

Appendix A.4.2. Qualitative Results

Appendix A.4.3. Assessment

Appendix A.5. Saliency Overlay Experiment

Appendix A.6. Generalization Section

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI