SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models
Abstract
1. Introduction

- Broader XAI Method Integration: SegClarity extends beyond the limited set of attribution-based methods used in DocSegExp by incorporating four XAI techniques: rad-CAM, DeepLIFT, LRP, and Gradient × Input.
- Compatibility with Perturbation-Based XAI: SegClarity is fully compatible with perturbation-based methods such as RISE and MiSuRe, revealing that document layout analysis, unlike other segmentation domains, supports only a subset of XAI techniques that can effectively capture its complex structural characteristics (see Section 5).
- Layer-Wise Attribution Analysis: Unlike DocSegExp, which generated attributions solely from the final layer, SegClarity performs multilayer attribution analysis across five U-Net decoder layers, enabling a deeper understanding of model behavior throughout the feature hierarchy.
- Expanded Evaluation Metrics: While DocSegExp relied primarily on ADCC and wADCC, SegClarity introduces a more comprehensive evaluation suite, including Infidelity [22], Sensitivity [22], content heatmap (CH) [23], pointing game (PG) [24], and our novel Attribution Concordance Score (ACS) for assessing both robustness and interpretability.
- Dual Evaluation Framework: SegClarity introduces a combined quantitative and qualitative evaluation strategy. Quantitative metrics measure faithfulness and stability, while qualitative visualizations enhanced through post-processing and normalization improve interpretability and human alignment.
- Enhanced and Diverse Experimental Validation: DocSegExp was evaluated only on a synthetic dataset, whereas SegClarity includes two real-world historical document datasets (CB55 and UTP-110), offering a more diverse and robust validation.
- Cross-Domain Generalization: Finally, beyond document layout analysis, SegClarity demonstrates domain transferability by successfully adapting the workflow to urban scene segmentation using the Cityscapes dataset, confirming its generalizability across semantic segmentation domains.
- We propose SegClarity, a comprehensive workflow dedicated to evaluating segmentation-based applications through advanced visualization and normalization tools and using a broad set of evaluation metrics. SegClarity contributes to having human-understandable DNNs specifically tailored for pixel-wise semantic segmentation of HDI.
- We adapt and evaluate six state-of-the-art XAI methods and four metrics for HDI, enabling a rigorous assessment of model explanation quality.
- We present a targeted perturbation technique that leverages semantic annotations to generate meaningful perturbations, specifically designed to enable the accurate evaluation of interpretability in pixel-wise prediction tasks with faithfulness-based metrics.
- We introduce a metric, called the Attribution Concordance Score (ACS), designed to enhance the robustness of explainability assessments. This metric is tailored to evaluate the alignment of model attributions with the semantic and structural characteristics of HDI layouts. We demonstrate the effectiveness of our metric both quantitatively, by comparison with two state-of-the-art metrics, and qualitatively, through expert and non-expert human assessments across three different datasets.
- We evaluate the domain transferability of the proposed workflow and validate its generalization by extending our experiments beyond HDI analysis to include Cityscapes, one of the most widely used and complex benchmarks in the state-of-the-art for urban scene segmentation.
2. Literature Review
2.1. Explainable Semantic Segmentation Models
2.2. XAI Method Categorization
- Gradient-based methods: These are a class of XAI techniques that leverage gradients to identify the most influential input features contributing to a model’s predictions. These methods analyze how small changes in input affect the model’s output, highlighting critical regions in images for interpretability.One of the most widely used techniques is Grad-CAM, which generates a localization map by using gradients within a CNN to highlight critical regions for a target concept [18]. Grad-CAM uses the gradients of the target class score with respect to the feature map activations of a convolutional layer. The importance weight for each feature map is computed as follows:where c is the target class, k is the filter index of a convolution layer, i and j denote the spatial coordinates of the feature map, and Z is the number of pixels in the feature map. The Grad-CAM heatmap is then obtained as follows:While Grad-CAM effectively identifies discriminative regions, it often produces coarse explanations due to its dependence on high-level feature maps. To overcome this limitation, Guided Grad-CAM [18] combines Grad-CAM with Guided Backpropagation [45], which preserves fine-grained details from lower convolutional layers. Guided Backpropagation modifies the backward ReLU operation by allowing only positive gradients to flow:where represents the relevance map at layer l, and denotes the activations.The final Guided Grad-CAM map is obtained by element-wise multiplying the high-resolution Guided Backpropagation map with the coarse Grad-CAM map:G*I is a different gradient method that multiplies the gradient of the output with respect to the input by the input itself, producing a saliency map that identifies the most influential pixels on the outcome [38]. The attributions for each input pixel in x are given by the following:where i and j denote the spatial coordinates of the input image.G*I provides an intuitive measure of how changes in the input directly affect the output prediction, highlighting influential regions in the image.
- Perturbation-based methods: These assess feature importance by systematically modifying input data and analyzing the model response to these alterations.Local interpretable model-agnostic explanations (LIME) perturbs input features and fits a locally interpretable model to approximate the behavior of complex models, highlighting the contribution of individual features [46].Similarly, Shapley additive explanations (SHAP) leverages Shapley values from cooperative game theory, systematically perturbing input features to quantify their influence on model predictions [47]. In the same vein, the RISE method [48] generates a large number of random binary masks that selectively occlude parts of the input image. Each mask’s contribution to the model prediction is measured, and the importance of each pixel is estimated as the weighted average of these random perturbations as described in Figure 2. Unlike LIME or SHAP, RISE is model-agnostic and gradient-free, making it applicable even to non-differentiable models.While RISE uses random perturbations to estimate pixel importance, MISURE [33] takes a more targeted approach by providing a dual explanation framework. MISURE systematically identifies both the minimal set of features sufficient for maintaining the current prediction (sufficiency) and the minimal modifications required to alter the prediction to a different class (necessity). Through an optimization-based formulation that balances explanation compactness and prediction confidence, MISURE delivers complementary perspectives on feature importance via sufficient and counterfactual explanations.
- Decomposition-based methods: These focus on breaking down a model decision-making process into interpretable components by redistributing relevance scores across input features. Unlike perturbation-based methods, which remove or modify input regions, decomposition-based techniques directly trace the flow of information through a neural network to determine how different parts of the input contribute to the output.A prominent example is LRP, which assigns important scores to input pixels by backpropagating relevance from the output layer to the input, ensuring that the total relevance is conserved across layers [16]. LRP redistributes relevance scores R, layer by layer, using the following propagation rule:where represents the contribution of neuron p to neuron q, and is the relevance of neuron q in the upper layer. The redistribution follows conservation principles, ensuring relevance is neither created nor lost.Another know method is DeepLIFT which calculates contribution scores by comparing neuron activations to a reference, enhancing traditional gradient methods by considering neuron reference activations [17]. DeepLIFT defines the contribution score C of an input neuron to an output neuron y aswhere is the reference activation, and is the difference between the actual and reference activation of neuron n.DeepLIFT ensures that small variations in the input, which do not significantly affect the output, receive small attributions.
3. Proposed Attribution-Based XAI Workflow
- Attribution map generation has two complementary stages:
- (a)
- Segmentation adaptation reshapes dense segmentation outputs into a classification-like form (via the adapter in Algorithm A1);
- (b)
- Attribution computation using with four different XAI methods to generate layer-wise maps.
- Post-processing improves readability through a clipping step to suppress outliers and a normalization step that separates positive/negative evidence.
- Quantitative evaluation measures explanation quality using Infidelity with targeted perturbations, Sensitivity_max, content heatmap (CH), pointing game (PG), and our new novel metric, called the Attribution Concordance Score (ACS).
- Qualitative evaluation visually inspects positive/negative and blended overlays on the input images to contextualize the maps.
- Hybrid evaluation integrates the two evaluation phases into a hybrid evaluation, by leveraging human assessments of the generated heatmaps to examine how well the evaluation metrics align with human interpretation.
3.1. Attribution Map Generation
- Segmentation adaptation: We reshape the dense outputs of segmentation networks into a classification-like format, enabling the direct use of established attribution methods.
- Attribution computation: We adapt and apply attribution methods to produce fine-grained attribution maps from selected layers of the network.
3.2. Post-Processing
- Noise and artifacts: Attribution maps may include noise or irrelevant artifacts inherent from the dataset [49] that can dominate the visualization if not addressed. Therefore, applying normalization and post-processing steps (e.g., smoothing, denoising, or thresholding) helps to highlight the most relevant regions and suppress irrelevant details.
- Dynamic range and scaling: Attribution maps often have values in a wide or inconsistent range or very small floating-point values depending on the method and the software implementation used. Ignoring normalization or scaling can obscure patterns, as raw values may not map well to a perceivable range of colors or intensities.
3.2.1. Clipping Step
3.2.2. Normalization Step
- Clear distinction between positive and negative values;
- Representation of the relationship between target and non-target classes attribution pixels to compare with the ground truth;
- Improved visuals for qualitative inspection.
3.3. Quantitative Evaluation
3.4. Qualitative Evalutation
3.5. Hybrid Evaluation
3.5.1. Infidelity
- is the explanation map for input x;
- denotes a perturbation on the input x sampled from a distribution D;
- represents the model output on x;
- represents the model output on the perturbed image);
- denotes the expected value over .

- is the explanation map specific to class label for input image x;
- denotes a perturbation that depends on the input x, the label , sampled from a distribution D;
- represents the model output for label at input x, producing a per-pixel confidence score or probability map (i.e., selecting probability map only from target );
- denotes the element-wise dot product, reflecting the predicted change in the model output according to the explanation ;
- denotes the expected value over .
| Algorithm 1 Procedure |
|
3.5.2. Sensitivity
- denotes a perturbation vector applied to the input x, with constraining it to a neighborhood of radius r around x;
- is the explanation map generated by the explanation functional for the model f at the input x;
- is the explanation produced for the perturbed input , providing insight into the model sensitivity to local changes;
- represents a norm (commonly the -norm), reflecting the magnitude of the difference in the explanation values.
3.5.3. Pointing Game
- represents the score for class ;
- is the attribution value at pixel ;
- is a binary mask indicating whether pixel belongs to class .
3.5.4. Content Heatmap
- represents the value for class ;
- is the attribution value at pixel ;
- is a binary mask indicating whether pixel belongs to class .
3.5.5. Proposed ACS Metric
- Attribution on target mask, defined aswhere is the indicator function defined in Equation (14).
- Attribution on non-target mask, defined as
3.5.6. Adaptive ACS Metric
- is the between-class variance at threshold t;
- and are the probabilities of the two classes separated by the threshold t;
- and are the means of the two classes separated by the threshold t.
- Divide the attribution map into two components: positive attributions () and negative attributions ();
- Perform the Otsu method on to determine the threshold and retain values exceeding this threshold;
- Generate by assigning the threshold to the values in that exceed ;
- Perform the Otsu method on to determine the threshold and retain values less than this threshold;
- Generate by assigning the threshold to the values in that exceed ;
- Merge the refined attribution maps and to obtain the refined attribution map .
3.5.7. Human Assessment
- High score (✓✓): In cases where the heatmap assigns positive attribution to the pixels that represent the class of interest;
- Medium score (✓): In cases where the heatmap does not assign positive attribution to all pixels of the class of interest or it assigns positive attribution to the pixels of the relevant class and only one additional class;
- Low score (×): In cases where the heatmap does not assign positive attribution to the pixels that represent the target class.
4. Experiments and Results
4.1. Experimental Corpora
- CB55 dataset (https://diuf.unifr.ch/main/hisdoc/diva-hisdb.html) (accessed on 8 November 2025): This is a freely available subset of the DIVA-HisDB dataset containing Medieval manuscripts. It is composed of 70 Latin handwritten document images digitized at 600 dpi (see Figure 10a). Four classes are defined in the ground truth column: TXT (i.e., central text), HL (i.e., a special line separating paragraphs in main text), GL (i.e., text on the page sides), and BG. The CB55 dataset presents various particularities (e.g., decorations and comments written in different calligraphy styles) [19].
- UTP-110 dataset: This is a subset of Medieval manuscripts (collection Utopia, armarium codicum bibliophilorum, Cod. 110) [20]. It contains 300 images, primarily in Latin with some sections in French (see Figure 10b). The UTP-110 images were resized to pixels while preserving the original aspect ratio. Seven classes are defined in the ground truth, as shown in Figure 10: background (BG), decoration (DEC), body (BD), text line (TXTL), big initial (BInit), small initial (SInit), and filler (FIL). The UTP-110 dataset presents complex challenges for layout analysis, including various types of ornaments, decorative text, faded writing, and ink bleed-through [20].
4.2. Experimental Protocol
- S-U-NET: This is the standard U-NET version featuring a high number of channels in each decoder block (512, 256, 128, and 64, from the bottleneck to the segmentation head), comprising over 31 million parameters [25].
- L-U-NET: This is a lightweight version of the standard U-NET introduced by Rahal et al. [12], which has only 16 channels in each decoder block and fewer than parameters.
- Pre-training was performed using the synthetic dataset, during which the model with the lowest validation loss was selected;
- Fine-tuning was performed using the best performing model on the real datasets of CB55.
4.3. Explanation Parameters
4.4. Generalization Setup
4.5. Results
4.5.1. Model Performance
4.5.2. XAI Evaluation
- Average (): This represents the average metric score across all classes, including both foreground and background;
- Foreground average (): This represents the average metric score computed over the foreground classes only, excluding the background class.
4.5.3. Qualitative Evaluation
- GradCAM: For L-U-NET, we observe that FL has strong positive attributions on the TXT pixels, with minimal noise around borders. However, intermediate layers, particularly Dec1 and Dec3, present intense noise, suggesting a negative effect caused by page borders on the model predictions. For S-U-NET, we also note similar positive attributions on the TXT pixels in FL, but with more distributed noise throughout the BG pixels. S-U-NET struggles more with noise in intermediate layers, particularly around the GL and borders, indicating challenges in distinguishing textual content from other document elements.
- LRP: For L-U-NET, we observe that the TXT pixels have positive attributions. However, there is only negative attributions within the text lines, especially in Dec3 and Dec4, while the background shows little to no attribution. For S-U-NET, we observe that negative attributions are more pronounced within the TXT pixels compared to L-U-NET, creating gaps or holes, particularly in Dec1 and Dec2. This suggests that S-U-NET struggles to consistently recognize the main text regions, starting from Dec4 onward.
- GradCAM: For L-U-NET, we observe a progressive refinement in attribution from Dec4 to FL. In Dec4, positive attributions appear around the TXTL pixels and within the BD pixels, with negative attributions on the BG and DEC pixels. By Dec3, L-U-NET starts differentiating the FIL and DEC pixels. In Dec1, positive attributions focus on the TXTL pixels, while the FIL and DEC pixels have negative attributions. FL shows strong positive attributions for the TXTL pixels, while the FIL and BG have negative attributions, indicating the improved ability of L-U-NET to distinguish text from other document elements. For S-U-NET, we observe a similar trend to L-U-NET. In Dec4, positive attributions appear on the TXTL pixels, while the DEC pixels show strong negative attributions. By Dec3, negative attributions emerges between text lines, and in Dec2, the SDC pixels also receive negative attributions. Dec1 further reinforces positive attributions on the TXTL pixels, while keeping negative attributions for the background. FL maintains strong positive attributions for the TXTL pixels and strong negative attributions for the DEC and BG pixels. Compared to the L-U-NET, S-U-NET struggles more with the DEC pixels, while L-U-NET finds the FIL pixels more challenging.
- LRP: Both models present similar behavior. In Dec4, both models have positive attributions on the TXTL regions, with S-U-NET showing concentrated negative attributions near the FIL and DEC pixels. As we move through Dec3, Dec2, and Dec1, negative attribution fades, with increasing focus on the TXTL pixels. For FL, both models strongly attribute the TXTL regions, with earlier negative attributions disappearing. The key difference is the stronger negative attributions of S-U-NET around FIL and DEC in Dec4 compared to L-U-NET.
4.6. Human-Centric Alignment of XAI Metrics
4.6.1. Human-Expert Assessment
4.6.2. Metric Alignment
- (✓✓ for clear, unambiguous focus);
- (✓ for acceptable focus);
- (× for bad focus).
4.7. Saliency Overlay Analysis
- For GradCAM, heatmaps in Dec4 and Dec3 show strong interference from surrounding text, but from Dec2 onward, attributions stabilize with positive activations concentrated within the BInit region.
- LRP initially produces mixed signals in Dec4 but crucially captures the glyph’s internal “flower” symbol, and by Dec2 converges toward dense positive attribution within the ground truth area. Notably, we apply a log normalization in Equation (23) to handle extremely high attribution values in Dec1 while preserving patterns. Formally, given a attribution vector with n valueseach value is normalized as follows:where is the raw attribution value and is its log-normalized form, preserving sign, compressing large magnitudes, and mapping values in to zero.
- DeepLift exhibits the most stable and consistent results, with positive attributions concentrating on the inner glyph from Dec4 and becoming densely localized on the target by Dec2.
- In contrast, G*I fails to recognize the glyph in Dec4, producing scattered activations, but from Dec3 onward, its maps became more structured and resembled DeepLift outputs.
4.8. Computational Cost
4.9. Synthesis
- GradCAM is efficient in terms of memory consumption and capable of highlighting negative attributions through gradients in the deeper layers. However, its reliability diminishes in lower or intermediate layers due to the gradient vanishing issues.
- LRP demands greater computational resources but produces more consistent and informative attribution maps. Its ability to detect deeper anomalies stems from its adherence to the conservation of information principle.
- Build more robust feature representations;
- Assist the segmentation head in producing clearer segmentation boundaries;
- Enhance compatibility with a wide range of XAI methods for both visual and quantitative heatmap evaluations.
5. Generalization
6. Discussion
- They are inherently difficult to interpret, as attribution maps do not always resemble segmentation masks;
- Applying and adapting the existing XAI methods and measures to rich datasets with pixel-wise annotations presents significant computational challenges.
- The encoder blocks as we progress through the network in a downsampling manner as well as the choice of encoder type to employ;
- The skip connections where an interesting study can be included by measuring the impact of each skip connection on the decoder block.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1. Attribution Generation
Appendix A.1.1. Segmentation Adaptation
| Algorithm A1 Segmentation Adaptation |
Input: f (segmentation model), X (input batch), C (number of classes) Output: (class-wise scalars) 1. segmentation logits, 2: predicted class index per pixel, shape 3: Construct with if , else 0 4: retain only logits of the predicted class at each pixel 5: aggregate over spatial dimensions; 6: return |
Appendix A.1.2. Attribution Computation
- f denotes the model, typically a DNN for segmentation or another prediction task;
- s is the selected output, a scalar quantity derived from the model predictions (e.g., the logit for a specific class at a pixel, or a guided score pooled over a region of interest);
- ℓ is the selected internal layer (input, intermediate, or output) whose contribution we aim to analyze.
Appendix A.1.3. Post-Processing Steps
| Algorithm A2 Clipping step |
Input: (vector of attribution values), p (percentile parameter) Output: (post-processed vector) 1: 2: 3: 4: 5: 6: for to N do 7: if then 8: 9: else if then 10: 11: end if 12: end for 13: return |





Appendix A.2. ACS Metric

Appendix A.3. U-NET-Model

Appendix A.4. Explanation Results
Appendix A.4.1. Quantitative Results
| L-U-NET|GradCAM | ||||||||
|---|---|---|---|---|---|---|---|---|
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 0.86 | 0.39 | 0.09 | 0.06 | 0.71 | 0.78 | 0.57 | 0.53 |
| Dec3 | 0.85 | 0.32 | 0.06 | 0.05 | 0.72 | 0.80 | 0.74 | 0.67 |
| Dec2 | 0.73 | 0.32 | 0.09 | 0.04 | 0.71 | 0.72 | 0.74 | 0.67 |
| Dec1 | 0.72 | 0.32 | 0.03 | 0.03 | 0.77 | 0.77 | 0.64 | 0.66 |
| FL | 0.60 | 0.31 | 0.02 | 0.02 | 0.81 | 0.83 | 0.99 | 0.99 |
| S-U-NET|GradCAM | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 1.13 | 0.33 | 0.06 | 0.06 | 0.61 | 0.65 | 0.30 | 0.33 |
| Dec3 | 1.16 | 0.32 | 0.02 | 0.02 | 0.62 | 0.60 | 0.50 | 0.35 |
| Dec2 | 1.12 | 0.31 | 0.03 | 0.03 | 0.67 | 0.65 | 0.73 | 0.64 |
| Dec1 | 1.17 | 0.31 | 0.03 | 0.03 | 0.76 | 0.73 | 0.75 | 0.67 |
| FL | 1.10 | 0.31 | 0.01 | 0.01 | 0.80 | 0.83 | 1.00 | 1.00 |
| L-U-NET|LRP | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 0.85 | 0.41 | 1.75 | 1.89 | 0.55 | 0.60 | 0.48 | 0.57 |
| Dec3 | 0.85 | 0.33 | 1.83 | 1.98 | 0.61 | 0.57 | 0.63 | 0.77 |
| Dec2 | 0.83 | 0.33 | 1.83 | 1.86 | 0.58 | 0.55 | 0.73 | 0.79 |
| Dec1 | 0.71 | 0.35 | 1.76 | 1.85 | 0.55 | 0.51 | 0.73 | 0.77 |
| FL | 0.65 | 0.32 | 0.11 | 0.14 | 0.90 | 0.87 | 0.91 | 0.88 |
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 1.18 | 0.33 | 0.78 | 0.78 | 0.49 | 0.44 | 0.52 | 0.45 |
| Dec3 | 1.19 | 0.35 | 0.90 | 0.89 | 0.54 | 0.52 | 0.66 | 0.66 |
| Dec2 | 1.16 | 0.31 | 0.97 | 0.98 | 0.52 | 0.48 | 0.90 | 0.91 |
| Dec1 | 1.16 | 0.32 | 1.31 | 1.37 | 0.54 | 0.50 | 0.76 | 0.81 |
| FL | 1.10 | 0.32 | 0.06 | 0.08 | 0.92 | 0.90 | 0.96 | 0.95 |
| L-U-NET|DeepLift | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 0.87 | 0.33 | 0.04 | 0.02 | 0.53 | 0.54 | 0.77 | 0.88 |
| Dec3 | 0.84 | 0.32 | 0.03 | 0.02 | 0.55 | 0.55 | 0.96 | 0.98 |
| Dec2 | 0.82 | 0.32 | 0.04 | 0.03 | 0.60 | 0.61 | 0.96 | 0.99 |
| Dec1 | 0.57 | 0.31 | 0.04 | 0.04 | 0.65 | 0.65 | 0.98 | 0.98 |
| FL | 0.58 | 0.31 | 0.07 | 0.08 | 0.67 | 0.68 | 0.99 | 0.99 |
| S-U-NET|DeepLift | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 1.16 | 0.33 | 0.05 | 0.03 | 0.39 | 0.41 | 0.42 | 0.47 |
| Dec3 | 1.13 | 0.32 | 0.03 | 0.02 | 0.60 | 0.57 | 0.78 | 0.94 |
| Dec2 | 1.12 | 0.31 | 0.04 | 0.03 | 0.57 | 0.54 | 0.83 | 1.00 |
| Dec1 | 1.08 | 0.32 | 0.05 | 0.05 | 0.67 | 0.71 | 0.75 | 0.99 |
| FL | 1.09 | 0.32 | 0.10 | 0.10 | 0.69 | 0.74 | 0.76 | 1.00 |
| L-U-NET|G*I | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 0.82 | 0.40 | 0.17 | 0.14 | 0.50 | 0.54 | 0.65 | 0.81 |
| Dec3 | 0.80 | 0.32 | 0.08 | 0.05 | 0.59 | 0.54 | 0.77 | 0.98 |
| Dec2 | 0.78 | 0.32 | 0.19 | 0.08 | 0.56 | 0.56 | 0.95 | 0.98 |
| Dec1 | 0.60 | 0.32 | 0.11 | 0.10 | 0.56 | 0.53 | 0.96 | 0.96 |
| FL | 0.60 | 0.31 | 0.13 | 0.17 | 0.63 | 0.62 | 0.96 | 0.94 |
| S-U-NET|G*I | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 1.18 | 0.33 | 0.11 | 0.10 | 0.55 | 0.48 | 0.59 | 0.53 |
| Dec3 | 1.20 | 0.35 | 0.10 | 0.08 | 0.57 | 0.55 | 0.71 | 0.70 |
| Dec2 | 1.16 | 0.31 | 0.05 | 0.04 | 0.54 | 0.53 | 0.92 | 0.94 |
| Dec1 | 1.15 | 0.32 | 0.04 | 0.04 | 0.58 | 0.60 | 0.98 | 0.99 |
| FL | 1.10 | 0.31 | 0.02 | 0.03 | 0.58 | 0.58 | 1.00 | 1.00 |
| L-U-NET|GradCAM | ||||||||
|---|---|---|---|---|---|---|---|---|
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 4.71 | 1.09 | 0.10 | 0.11 | 0.63 | 0.61 | 0.47 | 0.39 |
| Dec3 | 5.78 | 1.08 | 0.07 | 0.08 | 0.67 | 0.67 | 0.50 | 0.43 |
| Dec2 | 4.89 | 1.10 | 1.70 | 1.98 | 0.74 | 0.73 | 0.63 | 0.57 |
| Dec1 | 1.58 | 1.06 | 0.04 | 0.05 | 0.81 | 0.79 | 0.63 | 0.74 |
| FL | 1.29 | 1.03 | 0.04 | 0.04 | 0.86 | 0.85 | 0.96 | 0.95 |
| S-U-NET|GradCAM | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 3.25 | 1.41 | 0.04 | 0.04 | 0.64 | 0.62 | 0.46 | 0.47 |
| Dec3 | 2.71 | 2.50 | 0.03 | 0.03 | 0.65 | 0.62 | 0.59 | 0.55 |
| Dec2 | 3.49 | 1.21 | 0.03 | 0.03 | 0.72 | 0.72 | 0.52 | 0.60 |
| Dec1 | 3.79 | 1.08 | 0.02 | 0.02 | 0.76 | 0.78 | 0.76 | 0.87 |
| FL | 1.42 | 1.09 | 0.01 | 0.01 | 0.85 | 0.83 | 0.96 | 0.95 |
| L-U-NET|LRP | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 7.22 | 2.33 | 2.18 | 2.20 | 0.50 | 0.48 | 0.54 | 0.57 |
| Dec3 | 6.93 | 2.31 | 2.21 | 2.21 | 0.44 | 0.42 | 0.48 | 0.53 |
| Dec2 | 6.32 | 2.04 | 2.40 | 2.40 | 0.40 | 0.38 | 0.49 | 0.50 |
| Dec1 | 3.52 | 2.39 | 2.51 | 2.56 | 0.43 | 0.40 | 0.39 | 0.45 |
| FL | 1.31 | 1.05 | 0.15 | 0.16 | 0.94 | 0.93 | 0.94 | 0.93 |
| S-U-NET|LRP | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 4.94 | 2.34 | 1.12 | 1.10 | 0.47 | 0.42 | 0.59 | 0.64 |
| Dec3 | 4.51 | 3.35 | 1.21 | 1.07 | 0.45 | 0.44 | 0.63 | 0.60 |
| Dec2 | 3.77 | 1.35 | 1.35 | 1.26 | 0.50 | 0.48 | 0.66 | 0.71 |
| Dec1 | 3.95 | 1.18 | 1.40 | 1.33 | 0.48 | 0.45 | 0.57 | 0.63 |
| FL | 1.45 | 1.12 | 0.07 | 0.07 | 0.93 | 0.92 | 0.95 | 0.94 |
| L-U-NET|DeepLift | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 4.69 | 1.20 | 0.07 | 0.06 | 0.60 | 0.64 | 0.82 | 0.81 |
| Dec3 | 4.98 | 1.28 | 0.07 | 0.06 | 0.63 | 0.68 | 0.87 | 0.87 |
| Dec2 | 1.28 | 1.04 | 0.09 | 0.10 | 0.69 | 0.69 | 0.85 | 0.83 |
| Dec1 | 1.26 | 1.01 | 0.09 | 0.10 | 0.73 | 0.73 | 0.89 | 0.87 |
| FL | 1.28 | 1.03 | 0.14 | 0.16 | 0.79 | 0.78 | 0.88 | 0.86 |
| S-U-NET|DeepLift | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 1.77 | 1.28 | 0.02 | 0.02 | 0.53 | 0.54 | 0.80 | 0.77 |
| Dec3 | 1.88 | 1.58 | 0.02 | 0.02 | 0.60 | 0.61 | 0.91 | 0.89 |
| Dec2 | 1.52 | 1.16 | 0.03 | 0.03 | 0.62 | 0.63 | 0.92 | 0.91 |
| Dec1 | 1.39 | 1.07 | 0.03 | 0.04 | 0.71 | 0.71 | 0.94 | 0.93 |
| FL | 1.42 | 1.10 | 0.05 | 0.06 | 0.77 | 0.76 | 0.96 | 0.95 |
| L-U-NET|G*I | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 6.85 | 1.95 | 0.33 | 0.30 | 0.52 | 0.49 | 0.73 | 0.75 |
| Dec3 | 6.50 | 1.85 | 0.22 | 0.19 | 0.54 | 0.52 | 0.72 | 0.79 |
| Dec2 | 5.45 | 1.51 | 0.27 | 0.21 | 0.50 | 0.48 | 0.70 | 0.75 |
| Dec1 | 2.30 | 1.87 | 0.28 | 0.29 | 0.49 | 0.47 | 0.72 | 0.84 |
| FL | 1.29 | 1.05 | 0.07 | 0.08 | 0.75 | 0.74 | 0.96 | 0.95 |
| S-U-NET|G*I | ||||||||
| ↓ | ↓ | ↑ | ↑ | |||||
| Layer | A | fA | A | fA | A | fA | A | fA |
| Dec4 | 4.82 | 2.24 | 0.08 | 0.07 | 0.51 | 0.46 | 0.67 | 0.73 |
| Dec3 | 4.10 | 3.23 | 0.08 | 0.07 | 0.52 | 0.50 | 0.83 | 0.81 |
| Dec2 | 3.55 | 1.29 | 0.10 | 0.08 | 0.61 | 0.58 | 0.86 | 0.90 |
| Dec1 | 3.83 | 1.09 | 0.11 | 0.05 | 0.68 | 0.66 | 0.89 | 0.93 |
| FL | 1.43 | 1.11 | 0.03 | 0.04 | 0.76 | 0.73 | 0.96 | 0.95 |
Appendix A.4.2. Qualitative Results




Appendix A.4.3. Assessment
| GradCAM | |||||
|---|---|---|---|---|---|
| Class | Layer | HA | PG | CH | AC |
| TXT | Dec4 | ✓ | 0.000 | 0.488 | 0.495 |
| Dec3 | ✓✓ | 1.000 | 0.613 | 0.984 | |
| Dec2 | ✓✓ | 1.000 | 0.594 | 0.998 | |
| Dec1 | ✓✓ | 1.000 | 0.618 | 0.999 | |
| FL | ✓✓ | 1.000 | 0.715 | 0.999 | |
| HL | Dec4 | ✓ | 0.308 | 0.470 | 0.000 |
| Dec3 | ✓ | 0.308 | 0.584 | 0.000 | |
| Dec2 | ✓ | 0.308 | 0.608 | 0.019 | |
| Dec1 | ✓✓ | 0.538 | 0.620 | 0.038 | |
| FL | ✓ | 0.692 | 0.757 | 0.996 | |
| GL | Dec4 | ✓ | 0.000 | 0.645 | 0.498 |
| Dec3 | ✓ | 0.000 | 0.511 | 0.153 | |
| Dec2 | ✓ | 1.000 | 0.589 | 0.907 | |
| Dec1 | ✓✓ | 1.000 | 0.607 | 0.982 | |
| FL | ✓✓ | 1.000 | 0.712 | 0.993 | |
| LRP | |||||
| Class | Layer | HA | PG | CH | AC |
| TXT | Dec4 | ✓✓ | 0.100 | 0.534 | 0.518 |
| Dec3 | ✓✓ | 0.850 | 0.391 | 0.929 | |
| Dec2 | ✓ | 0.700 | 0.419 | 0.922 | |
| Dec1 | ✓ | 0.700 | 0.526 | 0.855 | |
| FL | ✓✓ | 0.500 | 0.923 | 0.966 | |
| HL | Dec4 | ✓✓ | 0.462 | 0.408 | 0.200 |
| Dec3 | ✓✓ | 0.615 | 0.532 | 0.818 | |
| Dec2 | ✓✓ | 0.692 | 0.456 | 0.853 | |
| Dec1 | ✓✓ | 0.538 | 0.335 | 0.724 | |
| FL | ✓✓ | 0.308 | 0.890 | 0.953 | |
| GL | Dec4 | ✓✓ | 0.278 | 0.440 | 0.597 |
| Dec3 | ✓✓ | 0.222 | 0.555 | 0.303 | |
| Dec2 | ✓✓ | 0.833 | 0.496 | 0.899 | |
| Dec1 | ✓✓ | 0.667 | 0.553 | 0.799 | |
| FL | ✓✓ | 0.333 | 0.872 | 0.943 | |
| DeepLift | |||||
| Class | Layer | HA | PG | CH | AC |
| TXT | Dec4 | ✓✓ | 0.000 | 0.367 | 0.573 |
| Dec3 | ✓✓ | 1.000 | 0.523 | 0.995 | |
| Dec2 | × | 1.000 | 0.543 | 0.999 | |
| Dec1 | × | 1.000 | 0.712 | 0.995 | |
| FL | × | 1.000 | 0.726 | 0.999 | |
| HL | Dec4 | ✓✓ | 0.385 | 0.398 | 0.205 |
| Dec3 | ✓✓ | 0.692 | 0.571 | 0.988 | |
| Dec2 | ✓✓ | 0.692 | 0.571 | 0.999 | |
| Dec1 | ✓✓ | 0.692 | 0.755 | 0.991 | |
| FL | ✓✓ | 0.692 | 0.789 | 0.994 | |
| GL | Dec4 | ✓ | 0.000 | 0.505 | 0.618 |
| Dec3 | × | 0.333 | 0.568 | 0.850 | |
| Dec2 | × | 1.000 | 0.487 | 0.993 | |
| Dec1 | × | 1.000 | 0.665 | 0.988 | |
| FL | × | 1.000 | 0.700 | 0.996 | |
| G*I | |||||
| Class | Layer | HA | PG | CH | AC |
| TXT | Dec4 | ✓✓ | 0.000 | 0.493 | 0.627 |
| Dec3 | ✓✓ | 0.950 | 0.497 | 0.994 | |
| Dec2 | ✓✓ | 1.000 | 0.523 | 0.999 | |
| Dec1 | ✓✓ | 1.000 | 0.585 | 0.998 | |
| FL | ✓✓ | 1.000 | 0.574 | 0.999 | |
| HL | Dec4 | ✓✓ | 0.462 | 0.486 | 0.314 |
| Dec3 | ✓✓ | 0.692 | 0.539 | 0.777 | |
| Dec2 | ✓✓ | 0.692 | 0.511 | 0.843 | |
| Dec1 | ✓✓ | 0.692 | 0.592 | 0.973 | |
| FL | ✓✓ | 0.692 | 0.564 | 0.996 | |
| GL | Dec4 | ✓✓ | 0.278 | 0.500 | 0.636 |
| Dec3 | ✓✓ | 0.111 | 0.527 | 0.399 | |
| Dec2 | ✓✓ | 1.000 | 0.524 | 0.987 | |
| Dec1 | ✓✓ | 1.000 | 0.607 | 0.991 | |
| FL | ✓✓ | 1.000 | 0.575 | 0.996 | |
| GradCAM | |||||
|---|---|---|---|---|---|
| Class | Layer | HA | PG | CH | AC |
| DEC | Dec4 | ✓ | 0.737 | 0.610 | 0.938 |
| Dec3 | ✓✓ | 0.947 | 0.670 | 0.991 | |
| Dec2 | ✓✓ | 0.947 | 0.742 | 1.000 | |
| Dec1 | ✓✓ | 0.947 | 0.761 | 0.999 | |
| FL | ✓✓ | 0.947 | 0.870 | 0.999 | |
| FIL | Dec4 | × | 0.059 | 0.672 | 0.166 |
| Dec3 | × | 0.000 | 0.596 | 0.149 | |
| Dec2 | ✓ | 0.882 | 0.633 | 0.261 | |
| Dec1 | ✓ | 1.000 | 0.681 | 0.785 | |
| FL | ✓✓ | 1.000 | 0.814 | 0.998 | |
| TXTL | Dec4 | ✓ | 0.650 | 0.670 | 0.817 |
| Dec3 | ✓✓ | 1.000 | 0.509 | 0.967 | |
| Dec2 | ✓✓ | 1.000 | 0.696 | 0.997 | |
| Dec1 | ✓ | 1.000 | 0.709 | 0.984 | |
| FL | ✓✓ | 1.000 | 0.797 | 0.999 | |
| BD | Dec4 | ✓ | 0.950 | 0.583 | 0.686 |
| Dec3 | ✓✓ | 0.800 | 0.623 | 0.928 | |
| Dec2 | ✓✓ | 1.000 | 0.671 | 0.949 | |
| Dec1 | ✓✓ | 1.000 | 0.746 | 0.983 | |
| FL | ✓✓ | 1.000 | 0.820 | 0.986 | |
| SInit | Dec4 | × | 0.632 | 0.584 | 0.150 |
| Dec3 | × | 0.158 | 0.550 | 0.078 | |
| Dec2 | ✓ | 0.842 | 0.593 | 0.175 | |
| Dec1 | ✓ | 0.895 | 0.693 | 0.903 | |
| FL | ✓✓ | 0.895 | 0.809 | 0.997 | |
| BInit | Dec4 | × | 0.632 | 0.476 | 0.159 |
| Dec3 | ✓ | 0.526 | 0.622 | 0.255 | |
| Dec2 | ✓ | 0.895 | 0.623 | 0.535 | |
| Dec1 | ✓✓ | 0.895 | 0.672 | 0.947 | |
| FL | ✓✓ | 0.895 | 0.775 | 0.995 | |
| LRP | |||||
| Class | Layer | HA | PG | CH | AC |
| DEC | Dec4 | ✓ | 0.579 | 0.777 | 0.586 |
| Dec3 | ✓✓ | 0.474 | 0.467 | 0.468 | |
| Dec2 | ✓✓ | 0.526 | 0.511 | 0.575 | |
| Dec1 | ✓ | 0.474 | 0.458 | 0.504 | |
| FL | ✓✓ | 0.368 | 0.994 | 0.996 | |
| FIL | Dec4 | ✓✓ | 0.882 | 0.415 | 0.802 |
| Dec3 | ✓✓ | 0.882 | 0.317 | 0.851 | |
| Dec2 | ✓✓ | 0.941 | 0.452 | 0.917 | |
| Dec1 | ✓✓ | 0.824 | 0.534 | 0.919 | |
| FL | ✓✓ | 0.353 | 0.961 | 0.975 | |
| TXTL | Dec4 | ✓✓ | 0.700 | 0.349 | 0.915 |
| Dec3 | ✓✓ | 0.950 | 0.371 | 0.892 | |
| Dec2 | ✓✓ | 0.750 | 0.545 | 0.995 | |
| Dec1 | ✓✓ | 0.450 | 0.346 | 0.993 | |
| FL | ✓✓ | 0.300 | 0.985 | 0.987 | |
| BD | Dec4 | ✓ | 0.550 | 0.303 | 0.623 |
| Dec3 | ✓ | 0.700 | 0.542 | 0.510 | |
| Dec2 | ✓✓ | 0.300 | 0.641 | 0.772 | |
| Dec1 | ✓✓ | 0.400 | 0.476 | 0.435 | |
| FL | ✓✓ | 0.050 | 0.952 | 0.967 | |
| SInit | Dec4 | ✓✓ | 0.789 | 0.350 | 0.620 |
| Dec3 | ✓✓ | 0.579 | 0.630 | 0.563 | |
| Dec2 | ✓✓ | 0.632 | 0.447 | 0.812 | |
| Dec1 | ✓✓ | 0.632 | 0.364 | 0.737 | |
| FL | ✓✓ | 0.526 | 0.986 | 0.980 | |
| BInit | Dec4 | ✓✓ | 0.684 | 0.498 | 0.647 |
| Dec3 | ✓✓ | 0.579 | 0.507 | 0.653 | |
| Dec2 | ✓✓ | 0.474 | 0.524 | 0.610 | |
| Dec1 | ✓✓ | 0.579 | 0.514 | 0.542 | |
| FL | ✓✓ | 0.316 | 0.883 | 0.978 | |
| DeepLift | |||||
| Class | Layer | HA | PG | CH | AC |
| DEC | Dec4 | ✓ | 0.842 | 0.797 | 0.906 |
| Dec3 | ✓✓ | 0.789 | 0.734 | 0.884 | |
| Dec2 | ✓✓ | 0.895 | 0.600 | 0.993 | |
| Dec1 | ✓✓ | 0.947 | 0.640 | 0.995 | |
| FL | ✓✓ | 0.947 | 0.748 | 0.996 | |
| FIL | Dec4 | ✓✓ | 1.000 | 0.620 | 0.946 |
| Dec3 | ✓✓ | 1.000 | 0.576 | 1.000 | |
| Dec2 | ✓✓ | 1.000 | 0.720 | 0.997 | |
| Dec1 | ✓✓ | 1.000 | 0.752 | 0.997 | |
| FL | ✓✓ | 1.000 | 0.836 | 0.999 | |
| TXTL | Dec4 | ✓✓ | 1.000 | 0.670 | 0.913 |
| Dec3 | ✓✓ | 1.000 | 0.477 | 0.999 | |
| Dec2 | ✓✓ | 1.000 | 0.651 | 0.998 | |
| Dec1 | ✓✓ | 1.000 | 0.761 | 0.999 | |
| FL | ✓✓ | 1.000 | 0.814 | 1.000 | |
| BD | Dec4 | ✓ | 0.850 | 0.373 | 0.598 |
| Dec3 | ✓✓ | 0.850 | 0.664 | 0.918 | |
| Dec2 | ✓✓ | 1.000 | 0.651 | 0.981 | |
| Dec1 | ✓✓ | 1.000 | 0.767 | 0.984 | |
| FL | ✓✓ | 1.000 | 0.802 | 0.990 | |
| SInit | Dec4 | ✓✓ | 0.895 | 0.464 | 0.996 |
| Dec3 | ✓✓ | 0.895 | 0.677 | 0.997 | |
| Dec2 | ✓✓ | 0.895 | 0.755 | 0.996 | |
| Dec1 | ✓✓ | 0.895 | 0.788 | 0.995 | |
| FL | ✓✓ | 0.895 | 0.833 | 0.997 | |
| BInit | Dec4 | ✓✓ | 0.895 | 0.481 | 0.995 |
| Dec3 | ✓✓ | 0.895 | 0.664 | 0.999 | |
| Dec2 | ✓✓ | 0.895 | 0.692 | 0.997 | |
| Dec1 | ✓✓ | 0.895 | 0.727 | 0.997 | |
| FL | ✓✓ | 0.895 | 0.747 | 0.999 | |
| G*I | |||||
| Class | Layer | HA | PG | CH | AC |
| DEC | Dec4 | ✓ | 0.737 | 0.823 | 0.821 |
| Dec3 | ✓✓ | 0.579 | 0.458 | 0.891 | |
| Dec2 | ✓✓ | 0.789 | 0.577 | 0.983 | |
| Dec1 | ✓✓ | 0.842 | 0.604 | 0.999 | |
| FL | ✓✓ | 0.947 | 0.833 | 0.999 | |
| FIL | Dec4 | ✓✓ | 1.000 | 0.477 | 0.936 |
| Dec3 | ✓✓ | 1.000 | 0.391 | 0.955 | |
| Dec2 | ✓✓ | 1.000 | 0.622 | 0.995 | |
| Dec1 | ✓✓ | 1.000 | 0.724 | 0.998 | |
| FL | ✓✓ | 1.000 | 0.792 | 0.999 | |
| TXTL | Dec4 | ✓✓ | 0.800 | 0.511 | 0.981 |
| Dec3 | ✓✓ | 1.000 | 0.437 | 0.907 | |
| Dec2 | ✓✓ | 1.000 | 0.573 | 0.998 | |
| Dec1 | ✓✓ | 1.000 | 0.747 | 0.997 | |
| FL | ✓✓ | 1.000 | 0.758 | 1.000 | |
| BD | Dec4 | ✓ | 0.600 | 0.283 | 0.602 |
| Dec3 | ✓✓ | 0.900 | 0.632 | 0.673 | |
| Dec2 | ✓✓ | 1.000 | 0.694 | 0.995 | |
| Dec1 | ✓✓ | 1.000 | 0.719 | 0.991 | |
| FL | ✓✓ | 1.000 | 0.744 | 0.991 | |
| SInit | Dec4 | ✓✓ | 0.789 | 0.402 | 0.685 |
| Dec3 | ✓✓ | 0.789 | 0.731 | 0.866 | |
| Dec2 | ✓✓ | 0.842 | 0.605 | 0.991 | |
| Dec1 | ✓✓ | 0.737 | 0.730 | 0.992 | |
| FL | ✓✓ | 0.895 | 0.770 | 0.998 | |
| BInit | Dec4 | ✓✓ | 0.737 | 0.485 | 0.740 |
| Dec3 | ✓✓ | 0.789 | 0.652 | 0.917 | |
| Dec2 | ✓✓ | 0.684 | 0.556 | 0.919 | |
| Dec1 | ✓✓ | 0.895 | 0.634 | 0.988 | |
| FL | ✓✓ | 0.895 | 0.685 | 0.999 | |
Appendix A.5. Saliency Overlay Experiment

Appendix A.6. Generalization Section
| GradCAM | |||||
|---|---|---|---|---|---|
| Class | Layer | HA | PG | CH | AC |
| road | Dec4 | × | 0.905 | 0.585 | 0.933 |
| Dec3 | × | 0.905 | 0.725 | 0.944 | |
| Dec2 | ✓✓ | 0.905 | 0.700 | 0.943 | |
| Dec1 | ✓✓ | 0.905 | 0.740 | 0.947 | |
| FL | ✓✓ | 0.905 | 0.801 | 0.945 | |
| sidewalk | Dec4 | × | 0.609 | 0.560 | 0.668 |
| Dec3 | × | 0.739 | 0.643 | 0.756 | |
| Dec2 | ✓ | 0.783 | 0.650 | 0.816 | |
| Dec1 | ✓ | 0.696 | 0.685 | 0.802 | |
| FL | ✓ | 0.696 | 0.697 | 0.800 | |
| person | Dec4 | × | 0.261 | 0.542 | 0.045 |
| Dec3 | × | 0.435 | 0.505 | 0.191 | |
| Dec2 | × | 0.261 | 0.549 | 0.080 | |
| Dec1 | × | 0.304 | 0.482 | 0.176 | |
| FL | ✓ | 0.609 | 0.576 | 0.562 | |
| car | Dec4 | × | 0.850 | 0.546 | 0.682 |
| Dec3 | × | 0.850 | 0.535 | 0.709 | |
| Dec2 | ✓ | 0.900 | 0.594 | 0.724 | |
| Dec1 | ✓ | 0.900 | 0.573 | 0.854 | |
| FL | ✓✓ | 0.950 | 0.626 | 0.906 | |
| bus | Dec4 | × | 0.500 | 0.576 | 0.500 |
| Dec3 | × | 0.667 | 0.529 | 0.959 | |
| Dec2 | × | 0.667 | 0.627 | 0.910 | |
| Dec1 | × | 0.500 | 0.654 | 0.761 | |
| FL | ✓ | 0.667 | 0.684 | 0.999 | |
| LRP | |||||
| Class | Layer | HA | PG | CH | AC |
| road | Dec4 | × | 0.857 | 0.491 | 0.935 |
| Dec3 | × | 0.905 | 0.621 | 0.944 | |
| Dec2 | ✓✓ | 0.810 | 0.626 | 0.943 | |
| Dec1 | ✓✓ | 0.762 | 0.782 | 0.946 | |
| FL | ✓✓ | 0.333 | 0.971 | 0.937 | |
| sidewalk | Dec4 | × | 0.696 | 0.392 | 0.792 |
| Dec3 | × | 0.783 | 0.481 | 0.848 | |
| Dec2 | ✓ | 0.391 | 0.443 | 0.855 | |
| Dec1 | ✓ | 0.391 | 0.479 | 0.862 | |
| FL | ✓ | 0.217 | 0.749 | 0.831 | |
| person | Dec4 | × | 0.435 | 0.509 | 0.359 |
| Dec3 | × | 0.609 | 0.410 | 0.463 | |
| Dec2 | ✓ | 0.522 | 0.379 | 0.597 | |
| Dec1 | ✓ | 0.348 | 0.358 | 0.624 | |
| FL | ✓ | 0.261 | 0.571 | 0.601 | |
| car | Dec4 | × | 0.700 | 0.403 | 0.798 |
| Dec3 | × | 0.700 | 0.418 | 0.815 | |
| Dec2 | ✓ | 0.350 | 0.415 | 0.782 | |
| Dec1 | ✓ | 0.250 | 0.407 | 0.766 | |
| FL | ✓✓ | 0.300 | 0.863 | 0.843 | |
| bus | Dec4 | × | 0.667 | 0.414 | 0.933 |
| Dec3 | × | 0.667 | 0.308 | 0.997 | |
| Dec2 | ✓ | 0.500 | 0.356 | 0.992 | |
| Dec1 | ✓ | 0.667 | 0.218 | 0.942 | |
| FL | ✓ | 0.333 | 0.631 | 0.985 | |
| DeepLift | |||||
| Class | Layer | HA | PG | CH | AC |
| road | Dec4 | × | 0.905 | 0.340 | 0.893 |
| Dec3 | × | 0.905 | 0.361 | 0.902 | |
| Dec2 | ✓ | 0.905 | 0.374 | 0.927 | |
| Dec1 | ✓ | 0.810 | 0.431 | 0.928 | |
| FL | ✓✓ | 0.810 | 0.472 | 0.931 | |
| sidewalk | Dec4 | × | 0.783 | 0.329 | 0.809 |
| Dec3 | × | 0.783 | 0.404 | 0.860 | |
| Dec2 | ✓ | 0.783 | 0.406 | 0.880 | |
| Dec1 | ✓ | 0.739 | 0.457 | 0.885 | |
| FL | ✓ | 0.783 | 0.468 | 0.879 | |
| person | Dec4 | ✓ | 0.348 | 0.463 | 0.321 |
| Dec3 | × | 0.565 | 0.349 | 0.563 | |
| Dec2 | ✓ | 0.609 | 0.297 | 0.647 | |
| Dec1 | ✓ | 0.565 | 0.343 | 0.648 | |
| FL | ✓ | 0.565 | 0.389 | 0.618 | |
| car | Dec4 | × | 0.850 | 0.357 | 0.839 |
| Dec3 | × | 0.900 | 0.391 | 0.909 | |
| Dec2 | ✓✓ | 0.950 | 0.495 | 0.927 | |
| Dec1 | ✓✓ | 0.950 | 0.563 | 0.922 | |
| FL | ✓✓ | 0.950 | 0.593 | 0.925 | |
| bus | Dec4 | × | 0.667 | 0.394 | 0.980 |
| Dec3 | × | 0.667 | 0.318 | 0.998 | |
| Dec2 | ✓ | 0.667 | 0.427 | 0.998 | |
| Dec1 | ✓ | 0.667 | 0.473 | 0.996 | |
| FL | ✓ | 0.667 | 0.494 | 0.998 | |
| G*I | |||||
| Class | Layer | HA | PG | CH | AC |
| road | Dec4 | × | 0.905 | 0.498 | 0.938 |
| Dec3 | × | 0.905 | 0.604 | 0.944 | |
| Dec2 | ✓ | 0.905 | 0.609 | 0.944 | |
| Dec1 | ✓✓ | 0.905 | 0.660 | 0.946 | |
| FL | ✓✓ | 0.905 | 0.775 | 0.947 | |
| sidewalk | Dec4 | × | 0.696 | 0.364 | 0.799 |
| Dec3 | × | 0.783 | 0.446 | 0.865 | |
| Dec2 | ✓ | 0.739 | 0.420 | 0.886 | |
| Dec1 | ✓ | 0.739 | 0.477 | 0.890 | |
| FL | ✓ | 0.696 | 0.521 | 0.878 | |
| person | Dec4 | × | 0.391 | 0.502 | 0.357 |
| Dec3 | × | 0.609 | 0.401 | 0.477 | |
| Dec2 | ✓ | 0.522 | 0.352 | 0.605 | |
| Dec1 | ✓ | 0.565 | 0.352 | 0.652 | |
| FL | ✓ | 0.609 | 0.333 | 0.646 | |
| car | Dec4 | × | 0.900 | 0.371 | 0.829 |
| Dec3 | × | 0.900 | 0.409 | 0.904 | |
| Dec2 | ✓✓ | 0.850 | 0.486 | 0.921 | |
| Dec1 | ✓✓ | 0.850 | 0.505 | 0.921 | |
| FL | ✓✓ | 0.950 | 0.559 | 0.933 | |
| bus | Dec4 | × | 0.667 | 0.414 | 0.920 |
| Dec3 | × | 0.667 | 0.278 | 0.954 | |
| Dec2 | ✓ | 0.667 | 0.357 | 0.998 | |
| Dec1 | ✓ | 0.667 | 0.456 | 0.992 | |
| FL | ✓ | 0.667 | 0.436 | 0.998 | |

References
- Baird, H.S.; Govindaraju, V.; Lopresti, D.P. Document analysis systems for digital libraries: Challenges and opportunities. In Proceedings of the Document Analysis Systems VI: 6th International Workshop, DAS 2004, Florence, Italy, 8–10 September 2004; Proceedings 6. Springer: Berlin/Heidelberg, Germany, 2004; pp. 1–16. [Google Scholar]
- Lombardi, F.; Marinai, S. Deep learning for historical document analysis and recognition—A survey. J. Imaging 2020, 6, 110. [Google Scholar] [CrossRef] [PubMed]
- Ma, W.; Zhang, H.; Jin, L.; Wu, S.; Wang, J.; Wang, Y. Joint layout analysis, character detection and recognition for historical document digitization. In Proceedings of the 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 31–36. [Google Scholar]
- Wahyudi, M.I.; Fauzi, I.; Atmojo, D. Robust Image Watermarking Based on Hybrid IWT-DCT-SVD. IJACI Int. J. Adv. Comput. Inform. 2025, 1, 89–98. [Google Scholar] [CrossRef]
- Kusuma, M.R.; Panggabean, S. Robust Digital Image Watermarking Using DWT, Hessenberg, and SVD for Copyright Protection. IJACI Int. J. Adv. Comput. Inform. 2026, 2, 41–52. [Google Scholar]
- Amrullah, A.; Aminuddin, A. Tamper Localization and Content Restoration in Fragile Image Watermarking: A Review. IJACI Int. J. Adv. Comput. Inform. 2026, 2, 62–74. [Google Scholar] [CrossRef]
- Chen, K.; Seuret, M.; Hennebert, J.; Ingold, R. Convolutional neural networks for page segmentation of historical document images. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–12 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 965–970. [Google Scholar]
- Oliveira, S.A.; Seguin, B.; Kaplan, F. dhSegment: A generic deep-learning approach for document segmentation. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Niagara Falls, NY, USA, 5–8 August 2018; pp. 7–12. [Google Scholar]
- Grüning, T.; Leifert, G.; Strauß, T.; Michael, J.; Labahn, R. A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recognit. 2019, 22, 285–302. [Google Scholar] [CrossRef]
- Renton, G.; Soullard, Y.; Chatelain, C.; Adam, S.; Kermorvant, C.; Paquet, T. Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 2018, 21, 177–186. [Google Scholar] [CrossRef]
- Rahal, N.; Vögtlin, L.; Ingold, R. Historical document image analysis using controlled data for pre-training. Int. J. Doc. Anal. Recognit. 2023, 26, 241–254. [Google Scholar] [CrossRef]
- Rahal, N.; Vögtlin, L.; Ingold, R. Layout analysis of historical document images using a light fully convolutional network. In Proceedings of the International Conference on Document Analysis and Recognition, San José, CA, USA, 21–26 August 2023; pp. 325–341. [Google Scholar]
- Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
- Cheng, Z.; Wu, Y.; Li, Y.; Cai, L.; Ihnaini, B. A Comprehensive Review of Explainable Artificial Intelligence (XAI) in Computer Vision. Sensors 2025, 25, 4166. [Google Scholar] [CrossRef]
- Sujatha Ravindran, A.; Contreras-Vidal, J. An empirical comparison of deep learning explainability approaches for EEG using simulated ground truth. Sci. Rep. 2023, 13, 17709. [Google Scholar] [CrossRef]
- Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar]
- Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via Gradient-based localization. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Simistira, F.; Seuret, M.; Eichenberger, N.; Garz, A.; Liwicki, M.; Ingold, R. Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Shenzhen, China, 23–26 October 2016; pp. 471–476. [Google Scholar]
- Rahal, N.; Vögtlin, L.; Ingold, R. Approximate ground truth generation for semantic labeling of historical documents with minimal human effort. Int. J. Doc. Anal. Recognit. 2024, 27, 335–347. [Google Scholar] [CrossRef]
- Brini, I.; Mehri, M.; Ingold, R.; Essoukri Ben Amara, N. An End-to-End Framework for Evaluating Explainable Deep Models: Application to Historical Document Image Segmentation. In Proceedings of the Computational Collective Intelligence, Hammamet, Tunisia, 28–30 September 2022; pp. 106–119. [Google Scholar]
- Yeh, C.K.; Hsieh, C.Y.; Suggala, A.; Inouye, D.I.; Ravikumar, P.K. On the (in) fidelity and sensitivity of explanations. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
- Pillai, V.; Pirsiavash, H. Explainable models with consistent interpretations. In Proceedings of the PAAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2431–2439. [Google Scholar]
- Zhang, J.; Bargal, S.A.; Lin, Z.; Brandt, J.; Shen, X.; Sclaroff, S. Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 2018, 126, 1084–1102. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Boillet, M.; Kermorvant, C.; Paquet, T. Multiple document datasets pre-training improves text line detection with deep neural networks. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2134–2141. [Google Scholar]
- Da, C.; Luo, C.; Zheng, Q.; Yao, C. Vision grid transformer for document layout analysis. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 19462–19472. [Google Scholar]
- Binmakhashen, G.M.; Mahmoud, S.A. Document layout analysis: A comprehensive survey. ACM Comput. Surv. 2019, 52, 109. [Google Scholar] [CrossRef]
- Hassija, V.; Chamola, V.; Mahapatra, A.; Singal, A.; Goel, D.; Huang, K.; Scardapane, S.; Spinelli, I.; Mahmud, M.; Hussain, A. Interpreting black-box models: A review on explainable artificial intelligence. Cogn. Comput. 2024, 16, 45–74. [Google Scholar]
- Gipiškis, R.; Tsai, C.W.; Kurasova, O. Explainable AI (XAI) in image segmentation in medicine, industry, and beyond: A survey. ICT Express 2024, 10, 1331–1354. [Google Scholar] [CrossRef]
- Vinogradova, K.; Dibrov, A.; Myers, G. Towards interpretable semantic segmentation via gradient-weighted class activation mapping (student abstract). In Proceedings of the AAAI conference on artificial intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13943–13944. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Hasany, S.N.; MÊriaudeau, F.; Petitjean, C. MiSuRe is all you need to explain your image segmentation. arXiv 2024, arXiv:2406.12173. [Google Scholar]
- Riva, M.; Gori, P.; Yger, F.; Bloch, I. Is the U-NET Directional-Relationship Aware? In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3391–3395. [Google Scholar]
- Igelsias, J.; Styner, M.; Langerak, T.; Landman, B.; Xu, Z.; Klein, A. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, Munich, Germany, 5–9 October 2015. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Poppi, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Revisiting the Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 20–25 June 2021; pp. 2299–2304. [Google Scholar]
- Shrikumar, A.; Greenside, P.; Shcherbina, A.; Kundaje, A. Not just a black box: Learning important features through propagating activation differences. arXiv 2016, arXiv:1605.01713. [Google Scholar]
- Monnier, T.; Aubry, M. docExtractor: An off-the-shelf historical document element extraction. In Proceedings of the International Conference on Frontiers in Handwriting Recognition, Dortmund, Germany, 8–10 September 2020; pp. 91–96. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Brini, I.; Rahal, N.; Mehri, M.; Ingold, R.; Essoukri Ben Amara, N. DocXAI-Pruner: Optimizing Semantic Segmentation Models for Document Layout Analysis Via Explainable AI-Driven Pruning. In Proceedings of the Horizons of AI: Ethical Considerations and Interdisciplinary Engagements; Springer: Singapore, 2025; pp. 333–348. [Google Scholar]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M.A. Striving for Simplicity: The All Convolutional Net. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You? ”: Explaining the Predictions of Any Classifier. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
- Petsiuk, V.; Das, A.; Saenko, K. RISE: Randomized Input Sampling for Explanation of Black-box Models. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018; p. 151. [Google Scholar]
- Lapuschkin, S.; Wäldchen, S.; Binder, A.; Montavon, G.; Wojciech.; Müller, K.R. Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun. 2019, 10, 1096. [Google Scholar] [CrossRef]
- Dreyer, M.; Achtibat, R.; Wiegand, T.; Samek, W.; Lapuschkin, S. Revealing Hidden Context Bias in Segmentation and Object Detection Through Concept-Specific Explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3829–3839. [Google Scholar]
- Wang, C.; Liu, Y.; Chen, Y.; Liu, F.; Tian, Y.; McCarthy, D.; Frazer, H.; Carneiro, G. Learning Support and Trivial Prototypes for Interpretable Image Classification. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2062–2072. [Google Scholar]
- Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man, Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Espinosa Zarlenga, M.; Barbiero, P.; Ciravegna, G.; Marra, G.; Giannini, F.; Diligenti, M.; Shams, Z.; Precioso, F.; Melacci, S.; Weller, A.; et al. Concept embedding models: Beyond the accuracy-explainability trade-off. Adv. Neural Inf. Process. Syst. 2022, 35, 21400–21413. [Google Scholar]
- Crabbé, J.; van der Schaar, M. Evaluating the robustness of interpretability methods through explanation invariance and equivariance. Adv. Neural Inf. Process. Syst. 2023, 36, 71393–71429. [Google Scholar]
- Achtibat, R.; Dreyer, M.; Eisenbraun, I.; Bosse, S.; Wiegand, T.; Samek, W.; Lapuschkin, S. From attribution maps to human-understandable explanations through concept relevance propagation. Nat. Mach. Intell. 2023, 5, 1006–1019. [Google Scholar] [CrossRef]
- Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef] [PubMed]
- Ayoob, M.; Nettasinghe, O.; Sylvester, V.; Bowala, H.; Mohideen, H. Peering into the Heart: A Comprehensive Exploration of Semantic Segmentation and Explainable AI on the MnMs-2 Cardiac MRI Dataset. Appl. Comput. Syst. 2025, 30, 12–20. [Google Scholar] [CrossRef]
- Hassan, M.; Fateh, A.A.; Lin, J.; Zhuang, Y.; Lin, G.; Xiong, H.; You, Z.; Qin, P.; Zeng, H. Unfolding Explainable AI for Brain Tumor Segmentation. Neurocomputing 2024, 599, 128058. [Google Scholar] [CrossRef]
- Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for pytorch. arXiv 2020, arXiv:2009.07896. [Google Scholar] [CrossRef]
- Montavon, G.; Binder, A.; Lapuschkin, S.; Samek, W.; Müller, K.R. Layer-wise relevance propagation: An overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer: Cham, Switzerland, 2019; pp. 193–209. [Google Scholar]


















| Method | Type | Input Layer | Intermediate Layer |
|---|---|---|---|
| Grad-CAM | Gradient-based | ✓ | ✓ |
| G × I | Gradient-based | ✓ | ✓ |
| LRP | Decomposition-based | ✓ | ✓ |
| DeepLIFT | Decomposition-based | ✓ | ✓ |
| RISE | Perturbation-based | ✓ | × |
| MiSuRe | Perturbation-based | ✓ | × |
| Layer | Spatial Dimensions | |
|---|---|---|
| CB55 | UTP-110 | |
| Dec4 | ||
| Dec3 | ||
| Dec2 | ||
| Dec1 | ||
| FL | ||
| Patch Size | Confidence Drop | IoU Drop | DICE Drop | Pixel Change Ratio |
|---|---|---|---|---|
| 2 × 2 | 0.0000 ± 0.0001 | 0.0036 ± 0.0084 | 0.0028 ± 0.0085 | 0.0006 ± 0.0008 |
| 4 × 4 | 0.0000 ± 0.0001 | 0.0077 ± 0.0136 | 0.0064 ± 0.0144 | 0.0013 ± 0.0009 |
| 8 × 8 | 0.0001 ± 0.0001 | 0.0138 ± 0.0138 | 0.0107 ± 0.0134 | 0.0023 ± 0.0012 |
| 16 × 16 | 0.0001 ± 0.0001 | 0.0284 ± 0.0261 | 0.0240 ± 0.0290 | 0.0043 ± 0.0017 |
| 32 × 32 | 0.0002 ± 0.0002 | 0.0540 ± 0.0317 | 0.0445 ± 0.0345 | 0.0088 ± 0.0022 |
| 64 × 64 | 0.0001 ± 0.0003 | 0.1095 ± 0.0364 | 0.0944 ± 0.0436 | 0.0174 ± 0.0033 |
| Parameter | CB55/UTP-110 | Cityscapes |
|---|---|---|
| Metric | DICE | DICE |
| Patch size | ||
| Number of masks | 500 | 500 |
| Batch size | 1 | 8 |
| Probability threshold | ||
| Total variation weight | ||
| Score mode | dice | dice |
| Sparsity weight | ||
| Learning rate | ||
| Foreground weight | 2 | 2 |
| Background weight | 1 | 1 |
| Temperature | ||
| Mask size | (CB55), (UTP) |
| Dataset | Model | Accuracy | Precision | Recall | F1-Score | IoU |
|---|---|---|---|---|---|---|
| CB55 | L-U-NET | 0.914 | 0.926 | 0.914 | 0.920 | 0.855 |
| S-U-NET | 0.935 | 0.925 | 0.935 | 0.929 | 0.870 | |
| UTP-110 | L-U-NET | 0.951 | 0.912 | 0.951 | 0.930 | 0.870 |
| S-U-NET | 0.946 | 0.935 | 0.946 | 0.940 | 0.889 |
| Model | #Params | Dataset | Inference Time | Training Time |
|---|---|---|---|---|
| L-U-NET | 17K | CB55 | 4.82 s | 5 h |
| UTP-110 | 3.33 s | 2 h | ||
| S-U-NET | 31M | CB55 | 4.81 s | 10 h |
| UTP-110 | 3.53 s | 3 h |
| Dataset | Model | Method | Attribution Maps | and | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Time (s) | Memory (GB) | Time (s) | Memory (GB) | Time (s) | Memory (GB) | Time (s) | Memory (GB) | |||
| CB55 | L-U-NET | GradCAM | 22 | 2.2 | 480 | 5.9 | 300 | 7.5 | 35 | 3.2 |
| LRP | 51 | 5.5 | 540 | 6.5 | 600 | 19 | 104 | 6.6 | ||
| DeepLift | 36 | 11.4 | 480 | 8 | 480 | 26.5 | 50 | 8.8 | ||
| G*I | 22 | 2.2 | 480 | 5.6 | 300 | 7.5 | 36 | 3.2 | ||
| S-U-NET | GradCAM | 101 | 7.3 | 2040 | 21.5 | 1200 | 27.3 | 129 | 9.7 | |
| LRP | 235 | 23 | 2400 | 27.6 | 2760 | 29.9 | 285 | 24 | ||
| DeepLift | ~5400 | 36.6 | ~50,400 | 32 | ~56,300 | 98.6 | ~5400 | 30 | ||
| G*I | 100 | 7.3 | 2040 | 21.5 | 1260 | 27.3 | 159 | 9.7 | ||
| UTP-110 | L-U-NET | GradCAM | 20 | 1.2 | 900 | 3.5 | 240 | 4.3 | 33 | 2.2 |
| LRP | 43 | 3 | 1020 | 3.8 | 540 | 9.9 | 57 | 3.8 | ||
| DeepLift | 32 | 3.5 | 900 | 4.4 | 420 | 13.3 | 44 | 5 | ||
| G*I | 20 | 1.2 | 900 | 3.5 | 240 | 4.4 | 32 | 2.2 | ||
| S-U-NET | GradCAM | 85 | 3.8 | ~3600 | 11 | ~1080 | 13.8 | 157 | 5.4 | |
| LRP | 196 | 11 | ~3600 | 13.9 | ~2280 | 16.5 | 266 | 12.1 | ||
| DeepLift | 159 | 23.3 | ~93,600 | 15.9 | ~48,500 | 47.7 | 172 | 15 | ||
| G*I | 85 | 3.8 | ~3600 | 11 | ~1080 | 13.8 | 160 | 5.4 | ||
| Dataset | Method | CH ↑ | PG ↑ | ACS ↑ |
|---|---|---|---|---|
| CB55 (Text) | DeepLIFT | 0.415 | 1.000 | 0.630 |
| CB55 (Text) | Grad-CAM | 0.569 | 1.000 | 0.738 |
| CB55 (Text) | Input × Gradient | 0.554 | 0.000 | 0.391 |
| CB55 (Text) | LRP | 0.637 | 0.000 | 0.025 |
| CB55 (Text) | MiSuRe | 0.692 | 0.000 | 0.544 |
| CB55 (Text) | RISE | 0.558 | 0.000 | 0.181 |
| UTP-110 (Text) | DeepLIFT | 0.451 | 1.000 | 0.832 |
| UTP-110 (Text) | Grad-CAM | 0.553 | 1.000 | 0.735 |
| UTP-110 (Text) | Input × Gradient | 0.421 | 0.000 | 0.588 |
| UTP-110 (Text) | LRP | 0.306 | 0.000 | 0.647 |
| UTP-110 (Text) | MiSuRe | 0.807 | 0.000 | 0.601 |
| UTP-110 (Text) | RISE | 0.448 | 1.000 | 0.249 |
| Cityscapes (Car) | DeepLIFT | 0.474 | 0.000 | 0.627 |
| Cityscapes (Car) | Grad-CAM | 0.661 | 1.000 | 0.793 |
| Cityscapes (Car) | Input × Gradient | 0.637 | 0.000 | 0.748 |
| Cityscapes (Car) | LRP | 0.601 | 1.000 | 0.678 |
| Cityscapes (Car) | MiSuRe | 0.480 | 0.000 | 0.910 |
| Cityscapes (Car) | RISE | 0.557 | 1.000 | 0.721 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Brini, I.; Rahal, N.; Mehri, M.; Ingold, R.; Essoukri Ben Amara, N. SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models. J. Imaging 2025, 11, 424. https://doi.org/10.3390/jimaging11120424
Brini I, Rahal N, Mehri M, Ingold R, Essoukri Ben Amara N. SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models. Journal of Imaging. 2025; 11(12):424. https://doi.org/10.3390/jimaging11120424
Chicago/Turabian StyleBrini, Iheb, Najoua Rahal, Maroua Mehri, Rolf Ingold, and Najoua Essoukri Ben Amara. 2025. "SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models" Journal of Imaging 11, no. 12: 424. https://doi.org/10.3390/jimaging11120424
APA StyleBrini, I., Rahal, N., Mehri, M., Ingold, R., & Essoukri Ben Amara, N. (2025). SegClarity: An Attribution-Based XAI Workflow for Evaluating Historical Document Layout Models. Journal of Imaging, 11(12), 424. https://doi.org/10.3390/jimaging11120424

