Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit

Diaconu, Bogdan Marian

doi:10.3390/ai7030094

Open AccessArticle

Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit

by

Bogdan Marian Diaconu

Faculty of Engineering, “Constantin Brancusi” University of Targu Jiu, Calea Eroilor 30, 210135 Targu Jiu, Romania

AI 2026, 7(3), 94; https://doi.org/10.3390/ai7030094

Submission received: 14 January 2026 / Revised: 18 February 2026 / Accepted: 25 February 2026 / Published: 4 March 2026

(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Convolutional neural networks (CNNs) can achieve high accuracy in photovoltaic (PV) fault recognition from RGB imagery, yet their decisions may rely on shortcut cues induced by heterogeneous backgrounds, viewpoints, and class imbalance. This work presents a multi-method explainability audit on the Kaggle PV Panel Defect Dataset (six classes), comparing five architectures (Baseline CNN, VGG16, ResNet50, InceptionV3, EfficientNetB0). Explanations are obtained with LIME superpixel surrogates (reported together with kernel-weighted surrogate fidelity), occlusion sensitivity (quantified via IoU@Top10% against consistent proxy masks, Shannon entropy, and Hoyer sparsity), and Integrated Gradients evaluated by deletion–insertion faithfulness and a Faithfulness Gap. While ResNet50 yields the best predictive performance, EfficientNetB0 shows the most consistent faithfulness evidence and stable panel-centered attributions. The analysis highlights class-dependent vulnerability to context cues, especially for the Clean and damaged classes, and supports using quantitative explainability diagnostics during model selection and dataset curation to mitigate shortcuts in vision-based PV monitoring.

Keywords:

explainable AI; shortcut learning; transfer learning; photovoltaic panel; fault detection

1. Introduction

Explainable artificial intelligence (XAI) has gained increasing attention as deep learning models achieve remarkable predictive accuracy while remaining largely opaque. Despite the proliferation of explainability techniques, no universal taxonomy or standardized classification currently exists to encompass all XAI methods. The literature reveals a wide range of frameworks that differ substantially in how explanations are categorized and applied. Cação et al. [1] proposed a unified classification integrating practical applicability and industrial relevance, while Tanzib Hosain et al. [2] highlighted the growing deployment of XAI in domains such as healthcare, finance, autonomous vehicles, and energy management. Fault detection and diagnosis in renewable energy, particularly photovoltaics (PV), has become increasingly important as PV installations exhibit heterogeneous fault signatures, including cracks, discoloration, hotspots, and partial shading, which are difficult to detect through electrical parameters alone. Machine learning and computer vision techniques have therefore emerged as promising tools for automated visual fault analysis. Awedat et al. [3] enhanced the U-Net architecture by incorporating residual blocks, atrous spatial pyramid pooling (ASPP), and attention mechanisms, significantly reducing false positives in segmentation-based PV fault detection. Sairam et al. [4] proposed an explainable three-component diagnostic framework combining a physical irradiance model, XGBoost classification, and XAI-based interpretability for each fault instance. Rico Espinosa et al. [5] introduced a CNN-based two-stage pipeline coupling semantic segmentation for panel localization with a classification network distinguishing breakage, shadows, dust, and no-fault conditions. Despite the small dataset, their method achieved reliable detection with approximately 70% accuracy, illustrating the feasibility of vision-based PV monitoring. Performance degradation in PV modules arises from both intrinsic faults—such as delamination, cell cracks, or interconnection failures—and extrinsic soiling, including dust, bird droppings, snow, and industrial particulates. These factors reduce irradiance capture, induce thermal gradients, and accelerate local degradation. Traditional inspection methods (infrared thermography, electroluminescence, I-V tracing) remain accurate but are expensive and not scalable for large arrays. Vision-based machine learning offers a low-cost and scalable alternative capable of identifying both fault and soiling patterns in RGB imagery. Wan et al. [6] provided a comprehensive review of dust deposition mechanisms and monitoring approaches, covering both sensor-based and AI-driven systems. Restrepo-Cuestas et al. [7] presented an experimental dataset combining electrical parameters and thermographic imaging under various fault conditions, demonstrating significant power losses due to cracking and shading. From the perspective of real-time fault recognition and on-device feasibility, Ling et al. [8] addressed recognition and real-time limitations in intelligent PV cleaning robots by improving YOLOv9t with three major innovations: integration of AOD-Net for dehazing, Spatial–Depth Conversion Convolution (SPD–Conv) to reduce computational cost, and an Inverted Residual Mobile Block–Efficient Multi-Scale Attention (iRMB–EMA) mechanism to improve robustness. Their approach increased mAP by 5.83% and reduced model size by 18.21% relative to baselines. Collectively, these studies confirm the potential of deep learning for PV fault analysis but also underscore a key limitation, which is the opacity of CNNs and their reliance on dataset-specific context rather than intrinsic fault cues.

Quantitative evaluation of explanations remains non-trivial and is increasingly recognized as necessary beyond visual saliency inspection. Systematic reviews highlight that a substantial fraction of XAI papers still rely on qualitative evidence and advocate broader adoption of quantitative criteria and test protocols (Nauta et al. [9]). In particular, insertion–deletion measures (used here to summarize IG faithfulness) have been critically analyzed with respect to their assumptions and sensitivity to perturbation design (Gomez et al. [10]), and randomized sanity checks have shown that visually plausible saliency maps may be insensitive to model parameters if not properly validated (Adebayo et al. [11]). Recent benchmarking efforts further support the need for standardized metric suites when comparing attribution methods across models and datasets (Li et al. [12]).

1.1. Objectives of the Study

This study aims to systematically evaluate the interpretability and reliability of CNNs in the classification of PV panel faults through complementary explainability techniques. The specific objectives are: (i) To compare the explanatory behavior of multiple CNN architectures (Baseline CNN, ResNet50, InceptionV3, EfficientNetB0, and VGG16) across diverse PV fault classes; (ii) to integrate three complementary XAI approaches: LIME (surrogate modeling), occlusion sensitivity (perturbation-based causality), and Integrated Gradients, to obtain a multifaceted understanding of model reasoning; and (iii) to identify cases of spurious or context-driven correlations that inflate model accuracy but compromise robustness and generalization.

Through these objectives, the study bridges the methodological gap between conventional performance metrics and interpretability-driven evaluation, advancing the deployment of explainable computer vision in renewable energy systems.

1.2. Novelty and Contributions

This work contributes:

(1): A multi-method explainability audit for PV fault recognition combining surrogate-based (LIME), perturbation-based (occlusion), and gradient-based (IG) explanations in a unified protocol;
(2): Quantitative reliability reporting for explanations, including kernel-weighted LIME surrogate fidelity, occlusion-based localization and concentration metrics (IoU@Top10%, entropy, Hoyer sparsity), and IG deletion–insertion faithfulness with a Faithfulness Gap;
(3): A class-level performance–faithfulness coupling analysis to highlight categories prone to context-driven shortcuts despite high accuracy;
(4): Practical guidance for dataset curation and model selection in vision-based PV monitoring.

This use of “audit” does not imply a standards-based compliance procedure or universal pass/fail thresholds; rather, it provides a transparent set of checks and risk signals that help determine whether high predictive accuracy is plausibly grounded in panel-intrinsic evidence or may be driven by contextual correlations.

2. Materials and Methodology

2.1. Dataset and Preprocessing

The experiments were conducted on the publicly available Kaggle dataset “PV Panel Defect Dataset” [13], which contains six classes: Clean, Dusty, Bird-drop, Electrical damage, Physical damage, and Snow-covered. A subset of 875 images was selected in this work, with a marked imbalance across classes, as shown in Figure 1. In particular, the Physical damage class contains only 70 images, while Dusty and Clean exceed 190 images each. Such imbalance can bias models towards majority classes and motivates the need for an explainability analysis that goes beyond conventional accuracy metrics. The dataset is highly heterogeneous. Representative samples from each class are presented in Figure 2 (resized to 224 × 224) to illustrate the heterogeneity in viewpoint, scale, and background context. Images were collected from diverse online sources, resulting in large variations in resolution (ranging from ~100 × 100 to >1000 × 1000 pixels), aspect ratios, lighting conditions, and viewing angles. Some samples include cluttered backgrounds (soil patches, vegetation, clear sky), while others focus closely on the photovoltaic module surface. This heterogeneity introduces noise and increases the likelihood that deep networks will learn associative context features (e.g., soil patches as indicators of dust) rather than intrinsic object features. To ensure comparability, all images were resized to a fixed input size of 224 × 224 pixels (299 × 299 in the case of InceptionV3) and converted to RGB color space. Pixel intensities were normalized to the [0, 1] range by scaling each channel by 1/255. No additional color space conversions (e.g., HSV, grayscale) were employed in the baseline experiments to remain consistent with the pretrained models used (EfficientNetB0, InceptionV3, ResNet50, and VGG16), which expect RGB inputs.

For both the custom CNN baseline and transfer learning experiments, no data augmentation was applied. This deliberate choice allows a direct assessment of each model’s intrinsic ability to generalize under limited and heterogeneous data conditions, without introducing artificial variability. Given the relatively small dataset size and the uncontrolled diversity of the original images, spanning a wide range of lighting conditions, angles, and backgrounds, further augmentation could have (i) introduced biases and non-physical variability in the real statistical distribution of visual features (e.g., altering soil color cues or cropping out small defect evidence), (ii) changed the coupling between background and label in unpredictable ways (potentially creating or removing shortcuts), and (iii) confounded explainability comparisons by modifying the input distribution differently across models. The absence of augmentation therefore ensures that model behavior, including any reliance on contextual artifacts, reflects genuine dataset characteristics rather than artifacts introduced during preprocessing. Because our objective is not only accuracy but also a quantitative audit of explanation faithfulness and localization, we keep the data distribution fixed to ensure that differences in the quantitative metrics of LIME, OS and IG can be attributed primarily to model/architecture behavior rather than augmentation artifacts. Controlled augmentation remains valuable for deployment, but in this work, it is treated as a follow-up axis that should be evaluated together with explanation stability and background-leakage controls.

The dataset was partitioned into training and validation subsets with an 80/20 split, stratified by class to preserve the imbalanced distribution. The number of images and the mean resolution per class are presented in Figure 2. While the stratified 80/20 split approach provides a baseline estimate of model performance, it does not fully mitigate the imbalance problem. Therefore, the integration of explainability methods such as LIME becomes crucial: high reported accuracy may conceal the fact that models base their decisions on contextual or spurious features instead of the actual physical faults in the panels.

2.2. Architectural Framework of the Deep Learning Models

Five CNN architectures were employed in this study: Baseline CNN, VGG16, ResNet50, InceptionV3, and EfficientNetB0. The five architectures were selected to provide a controlled comparison spanning distinct CNN design paradigms under identical data and training conditions. The Baseline CNN serves as a low-capacity reference trained from scratch to expose dataset-driven shortcuts under limited feature learning. VGG16 represents a classic deep sequential design with homogeneous 3 × 3 convolutions, widely used for image classification tasks due to its simplicity and effectiveness (Deb and Rahman [14]). ResNet50 introduces residual connections that stabilize optimization and promote feature reuse. It features improved classification accuracy by addressing the vanishing gradient problem, He et al. [15]. InceptionV3 (Szegedyn et al. [16]) captures multi-scale patterns via parallel branches with different receptive fields. An inception module, which serves as the cornerstone of the network, carries out multiple convolutions of varying sizes in parallel, followed by the concatenation of their outputs. This approach enables the network to seize both local and global information (Khan et al. [17]). EfficientNetB0 implements compound scaling (depth/width/resolution) to balance accuracy and efficiency, thereby addressing the inefficiencies in conventional network scaling. Such an approach simultaneously optimizes the network’s depth, width and input resolution to achieve a more balanced and efficient model (VanBerlo et al. [18]). This set therefore probes how architectural inductive biases affect both performance and explanation behavior on heterogeneous PV imagery.

The structural overviews of the selected architectures are presented in Figure 3. The Baseline CNN, trained from scratch, consists of two convolutional layers followed by normalization, pooling, and fully connected stages, serving as a control for generalization under limited data. VGG16 represents an early deep architecture composed of uniform 3 × 3 convolutional blocks and max-pooling operations, forming a sequential and interpretable hierarchy. ResNet50 introduces residual connections that enable information flow across layers, mitigating gradient vanishing and improving feature reuse through bottleneck blocks. InceptionV3 employs a multi-branch design with parallel convolutions of varying receptive-field sizes, capturing both local and global patterns within the same layer depth. EfficientNetB0 exemplifies compound scaling, balancing network depth, width, and resolution through MBConv blocks optimized for efficiency. All pretrained models were truncated before their classification heads and extended with a Global Average Pooling layer—an operation that averages each feature map spatially to a single representative value—followed by dense layers and dropout regularization. This standardized structure facilitates direct comparison of feature extraction behavior across architectures under identical training conditions.

Despite their structural differences, all CNNs share the same opacity in internal reasoning, making it difficult to determine whether predictions rely on genuine fault-related features or on spurious contextual cues such as background textures or lighting patterns. To address this, three complementary explainability techniques were applied—LIME, occlusion sensitivity, and Integrated Gradients—each providing a distinct perspective on the spatial and functional basis of model predictions. This approach enables a multi-faceted interpretation of CNN behavior, offering both visual and analytical insights into the decision mechanisms underlying PV fault classification. The complete workflow from the image dataset to the results and their presentation is illustrated in Figure 4.

3. Explainability Framework

Section 3 describes the explainability framework using three complementary lenses. First, LIME provides surrogate-based local explanations, and we quantify their reliability via kernel-weighted surrogate fidelity. Second, occlusion sensitivity offers an intervention-based probe of functional relevance and enables dataset-level interpretability metrics (localization and concentration). Third, Integrated Gradients is evaluated with deletion–insertion tests to quantify faithfulness, i.e., whether attribution rankings are causally coupled to the model score. Each subsection ends with a short practical takeaway to support readers less familiar with XAI.

3.1. LIME-Based Explainability for Image Classification Models

To provide local, instance-level interpretability for the five CNN architectures, we implemented an image-adapted LIME procedure based on superpixel-level perturbations and a kernel-weighted linear surrogate. For each test image, the RGB input is first resized to the model-specific resolution (224 × 224 for Baseline CNN/EfficientNetB0/ResNet50/VGG16 and 299 × 299 for InceptionV3) and preprocessed using the corresponding normalization pipeline. The image is then partitioned into perceptually coherent regions using Quickshift superpixels (Vedaldi and Soato [19]), implemented using scikit-image (with the parameter

k e r n e l_s i z e = 2

,

m a x_d i s t = 10

,

r a t i o = 0.01

), and the resulting segment labels are relabeled to a contiguous index set

\{0, \dots, K - 1\}

. This superpixel formulation is motivated by both interpretability and perturbation realism: superpixels act as human-meaningful explanation units (capturing local edges, textures, and homogeneous PV-surface areas) and avoid pixel-wise masking artifacts that are difficult to interpret and can generate out-of-distribution inputs in heterogeneous outdoor scenes. Local neighborhoods are constructed by sampling

N = 1200

binary activation vectors

z \in {0, 1}^{K}

(with

P (z_{k} = 1) = p_{k e e p} = 0.5

), where each perturbation masks entire superpixels rather than individual pixels; masked regions are replaced with a per-image mean RGB baseline, yielding perturbed images

I (z)

that remain visually plausible compared with hard dropout. For each perturbed sample, we query the black-box, trained architecture, and record the probability of the predicted class of the unperturbed image (

c_{p r e d}

), i.e.,

y (z) = p_{θ} (c_{p r e d} ∣ I (z))

. Perturbations are weighted by a locality kernel based on cosine distance to the all-superpixels-on reference

z_{0} = 1

:

d (z, z_{0}) = 1 - c o s (z, z_{0})

, and

w (z) = \sqrt{e x p (- d (z, z_{0})^{2} / σ^{2})}

with

σ = 0.25

. A Ridge regression surrogate (weighted by

w

,

α = 1.0

) is fitted to approximate the local mapping from superpixel inclusion to predicted-class probability; its coefficients provide a ranked attribution over superpixels, and we visualize the Top-3 positively contributing regions with consistent rank encoding (Red/Green/Blue) and a matched coefficient bar plot. Importantly, because superpixel explanations trade spatial precision for perturbation plausibility and locality, superpixel choice and segmentation granularity can influence the fitted surrogate; therefore, we report per-image kernel-weighted surrogate fidelity (

R_{w}^{2}

,

M S E_{w}

) to avoid over-interpreting explanations when the local linear approximation is unstable.

Kernel-Weighted $R_{w}^{2}$ for LIME Surrogate Fidelity

Let

z_{i} \in {0, 1}^{K}

denote the

i

-th LIME perturbation vector over

K

superpixels, and let

x (z_{i})

be the corresponding perturbed image obtained by masking the inactive superpixels. For the class explained (typically the predicted class on the unperturbed image), we define the model response:

y_{i} = p_{\hat{c}} (x (z_{i}))

(1)

and the surrogate prediction:

{\hat{y}}_{i} = g (z_{i}),

(2)

where

g

is the fitted linear surrogate (e.g., ridge/linear regression) trained with locality weights

w_{i}

.

Distances are computed between each perturbation

z_{i}

and the all-ones vector

z_{0} = 1

(i.e., the unmasked configuration), using cosine distance:

d_{i} = d_{c o s} (z_{i}, z_{0}) .

(3)

The kernel weight for perturbation with the index

i

is given by:

w_{i} = \sqrt{e x p (- \frac{d_{i}^{2}}{σ^{2}})},

(4)

where

σ

is the kernel width (in your runs,

σ = 0.25

).

We define the weighted mean:

{\bar{y}}_{w} = \frac{\sum_{i = 1}^{N} w_{i} y_{i}}{\sum_{i = 1}^{N} w_{i}}

(5)

and weighted SSE and SST:

{S S E}_{w} = \sum_{i = 1}^{N} w_{i} {(y_{i} - {\hat{y}}_{i})}^{2}, {S S T}_{w} = \sum_{i = 1}^{N} w_{i} {(y_{i} - {\hat{y}}_{i})}^{2}

The kernel-weighted coefficient of determination is calculated with:

R_{w}^{2} = 1 - \frac{{S S E}_{w}}{{S S T}_{w} + ε}

(6)

where

ε

is a small constant for numerical stability (the value

ε = 10^{- 12}

was used). The values of

R_{w}^{2}

range in the interval

(- \infty, 1]

indicating that the linear surrogate accurately approximates the model’s local response over the neighborhood emphasized by the LIME kernel.

Weighted MSE is given by:

{M S E}_{w} = \frac{\sum_{i = 1}^{N} w_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} w_{i}}

(7)

To quantify the reliability of LIME explanations beyond qualitative inspection, we report the kernel-weighted surrogate fidelity of the local linear model fitted to the LIME perturbation samples. Specifically, for each explained image, we compute the weighted coefficient of determination

R_{w}^{2}

(and the corresponding weighted MSE) between the architecture’s predicted-class probability on perturbed inputs and the surrogate’s predictions, using the same locality weights as the LIME kernel. Low

R_{w}^{2}

values indicate locally non-linear decision behavior that is poorly captured by a linear surrogate; in such cases, coefficient-based superpixel attributions should be interpreted cautiously. Table 1 summarizes surrogate-fidelity statistics across architectures. Table 1 presents synthetically, for each architecture, both the central tendency and the tail behavior of the LIME surrogate fit under the exact perturbation protocol used to generate explanations. Table 1 reports the following:

(i): $n$ , the number of explained validation images per model;
(ii): ${\hat{R}}_{w}^{2}$ (the mean value of $R_{w}^{2}$ ) and $R_{w - p 10}^{2}$ (the 10th $R_{w}^{2}$ set percentile), where $R_{w}^{2}$ is the kernel-weighted coefficient of determination between the architecture’s response $y_{i} = p_{\hat{c}} (x_{i})$ on perturbations and the surrogate prediction ${\hat{y}}_{i}$ , computed with the LIME locality weights $w_{i}$ (thus capturing fidelity in the local neighborhood emphasized by LIME);
(iii): $\bar{p} (\hat{y})$ , the mean predicted-class probability on the unperturbed images, included to contextualize surrogate fidelity with respect to model confidence;
(iv): $\hat{K}$ , the mean number of superpixels produced by the chosen segmentation settings (quickshift), serving as a proxy for explanation granularity and complexity of the surrogate feature space;
(v): $f_{R^{2} <}$ , the fraction of a model’s instances falling below the global low-fidelity threshold (defined as the bottom decile of $R_{w}^{2}$ across all image $\times$ model pairs), indicating how often LIME explanations for that architecture enter a regime where linear surrogates are unreliable; and
(vi): $f_{K >}$ the fraction of instances exceeding the global high-fragmentation threshold (top decile of $K$ ), indicating how frequently segmentation produces highly fragmented partitions that can destabilize coefficient-based attributions.

3.2. Occlusion Sensitivity Quantitative Analysis

To complement the qualitative inspection of LIME visualizations and to establish a more objective basis for evaluating model interpretability, we developed a quantitative occlusion sensitivity pipeline implemented in TensorFlow and OpenCV. This procedure systematically perturbs the input image by masking localized square patches and records the resulting variation in prediction confidence. The resulting response maps quantify the causal influence of each spatial region on the model’s decision, enabling reproducible comparison across architectures and damage categories. In contrast to gradient-based techniques, occlusion sensitivity directly probes the decision surface of the trained model, thereby providing a measure of functional relevance rather than correlation. Occlusion sensitivity assigns each spatial region a functional influence score by measuring the change in predicted-class confidence under localized interventions. For each image

I

, we define the reference class

c^{⋆}

as the model’s top-1 prediction on the unoccluded input and compute, for each patch location

(y, x)

, the log-probability drop

Δ (y, x) = l o g (p_{c^{⋆}} (I) + ε) - l o g (p_{c^{⋆}} (I^{(y, x)}) + ε)

with

ε = 10^{- 8}

, where

I^{(y, x)}

is obtained by replacing the patch with the per-image mean RGB value. Negative impacts (occlusion increasing confidence) are clamped to zero to retain evidence-decreasing contributions. The resulting grid map is upsampled bicubically and resized to a common

224 \times 224

resolution for cross-architecture comparability, then max-normalized to

[0, 1]

. We quantify (i) localization via IoU@Top10%, computed between the top 10% most influential pixels (percentile-thresholded) and a consistently generated automatic proxy mask, (ii) dispersion via Shannon entropy computed on the normalized relevance mass

p_{i} = h_{i} / \sum_{j} h_{j}

, and (iii) compactness via Hoyer sparsity based on the

l_{1} / l_{2}

ratio of the vectorized map.

3.3. Integrated Gradients

3.3.1. General Theory of Integrated Gradients

To complement the perturbation-based occlusion sensitivity analysis, Integrated Gradients (IG) was employed to derive gradient-integrated attribution maps that capture the cumulative influence of each pixel on the predicted class. Unlike occlusion sensitivity, which measures confidence variation under explicit masking, IG integrates the gradient of the model response along a continuous interpolation path from a neutral baseline to the actual image, providing a smooth and analytically grounded estimate of functional relevance.

Integrated Gradients is an attribution method that explains the prediction of a model by measuring how changes in each input pixel (or feature) affect the output along a continuous path from a baseline input to the actual input.

For a model

F (x)

, an input

x

and a baseline

x^{'}

the IG attribution for input dimension

i

is defined as:

{I G}_{i} (x) = (x_{i} - x_{i}^{'}) \int_{0}^{1} \frac{\partial F (x^{'} + α (x - x^{'}))}{\partial x_{i}} d α

(8)

In practice, this path integral is approximated via a Riemann sum over a finite number of interpolation steps,

m

:

{I G}_{i} (x) \approx \frac{x_{i} - x_{i}^{'}}{m} \sum_{k = 1}^{m} \frac{\partial F (x^{'} + \frac{k}{m} (x - x^{'}))}{\partial x_{i}}

(9)

The resulting attribution scores

{I G}_{i} (x)

form a pixel-wise relevance map that approximately satisfies the completeness property, i.e.,

\sum_{i} {I G}_{i} (x) \approx F (x) - F (x^{'})

, allowing the prediction to be decomposed into a sum of feature contributions.

For the qualitative interpretability analysis, Integrated Gradients was computed for all five trained classifiers (Baseline CNN, EfficientNetB0, ResNet50, InceptionV3, VGG16). For each model and class, two correctly classified test images were selected, and IG attributions were obtained along a straight-line path from a zero-valued (black) baseline image to the actual input, using 50 interpolation steps. At each step, gradients were taken with respect to the output score of the predicted class, and the resulting attributions were averaged over the interpolation path to obtain a pixel-wise attribution map. The IG maps were then converted to relevance heatmaps by taking the absolute values, averaging across color channels, and applying min–max normalization to the [0, 1] range, after which they were overlaid on the original RGB images using a fixed blending factor to facilitate visual inspection.

3.3.2. Faithfulness of IG Explanations (Deletion-Insertion)

Beyond visual inspection, the faithfulness of IG explanations was quantified using the Deletion–Insertion framework. For each architecture, we computed the area under the confidence–perturbation curves (AUC) for progressive pixel removal (Deletion) and re-insertion (Insertion) and defined the Faithfulness Gap as:

∆ = {A U C}_{i n s} - {A U C}_{d e l}

(10)

Positive values of

Δ

indicate that the confidence of the model decreases when highly attributed pixels are removed and recovers when they are reintroduced, reflecting a causal rather than merely correlative relationship between the explanation and the decision.

3.4. XAI Hyperparameter Choices and Rationale

Across all explainability methods, we fixed hyperparameters wherever possible to support fair cross-architecture comparison and selected values using a consistent set of criteria: minimizing masking artifacts (mean vs. zero baselines where applicable), ensuring locality (for LIME), balancing spatial resolution against runtime (for OS/IG), and enforcing reproducibility.

3.4.1. LIME Parameters

Hyperparameters were selected to (i) preserve architectural comparability via model-specific preprocessing, (ii) ensure locality around the original sample, (iii) avoid out-of-distribution masking artifacts, and (iv) provide stable surrogate coefficients under a reasonable computational budget. Quickshift was used for superpixelization because it produces compact, boundary-respecting regions. We generated LIME explanations using a manual, batch-safe implementation that preserves each architecture’s native input pipeline. Images were segmented into superpixels with Quickshift (kernel_size = 2, max_dist = 10, ratio = 0.01) computed on RGB values normalized to [0, 1], which yields compact, boundary-respecting regions suitable for localized occlusion. We used 1000 perturbation samples as a trade-off between coefficient stability and runtime and set the keep probability to p = 0.5 to balance mask sparsity and diversity while maintaining good conditioning of the linear surrogate; we also prevented degenerate all-zero masks and explicitly included the all-on reference mask. Masked superpixels were replaced by the image mean color (“mean” baseline) to avoid out-of-distribution artifacts (e.g., artificial black patches) that can bias CNN predictions. Locality weighting followed the standard LIME principle by measuring cosine distance between each perturbation mask and the all-on mask, using a kernel width of 0.25 to emphasize perturbations that are most similar to the original sample. The local surrogate was a weighted Ridge regression (α = 1.0) to improve numerical stability under correlated superpixel features. For clarity and comparability across models, we report the top 3 contributing superpixels (largest positive coefficient; fallback to largest magnitude when needed), computed with architecture-specific preprocessing (EfficientNet/ResNet/Inception/VGG) and batched inference (batch size 32).

3.4.2. OS Parameters

OS parameters were selected to balance spatial resolution, cross-architecture comparability, and computational feasibility while minimizing masking-induced artifacts. Each model received inputs at its native resolution and corresponding preprocessing. Occlusions were implemented as square patches filled with the global mean RGB color (mean baseline), chosen to reduce out-of-distribution effects compared to zero/black masking. We computed Occlusion Sensitivity maps with an architecture-aware pipeline to ensure comparability across the five CNN backbones. For each model, the input image was resized to the network’s native input resolution and preprocessed with the corresponding application-specific function (EfficientNet/ResNet/Inception/VGG) or simple min–max scaling for the baseline CNN. We defined the target class as the model’s top-1 prediction on the unoccluded image and quantified sensitivity by sliding a square occlusion window over the image on a regular grid. Each occluded region was replaced with the global mean RGB color of the image (mean baseline), which reduces out-of-distribution artifacts that can occur with black/zero masking. For each window position, we measured the drop in the model’s confidence for the target class in log-space, using:

Δ = l o g (p_{b a s e} + 10^{- 8}) - l o g (p_{o c c} + 10^{- 8})

(11)

where

p_{b a s e}

is the original probability and

p_{o c c}

is the probability after occlusion. Using log-probabilities improves numerical stability and yields a more informative sensitivity score, especially when probabilities are very small or close to saturation, where raw probability differences can become difficult to interpret. Patch size and stride were selected to preserve a similar relative spatial scale across input resolutions while balancing localization and computational cost: 24 × 24 pixels with stride 12 for 224 × 224 inputs and 32 × 32 with stride 16 for larger inputs (e.g., 299 × 299), maintaining ~50% overlap and thus comparable sampling density. The resulting impact values were assembled into a coarse grid, min–max normalized, and smoothly upsampled (bicubic interpolation) to the original resolution for visualization as a heatmap overlay.

3.4.3. IG Parameters

We computed IG attributions using an architecture-aware pipeline to ensure consistent interpretability across the five CNN backbones. For each model, images were robustly loaded as 3-channel RGB, resized to the network’s native input resolution (224 × 224 for Baseline/EfficientNet/ResNet/VGG and 299 × 299 for Inception), and preprocessed with the corresponding application-specific function (EfficientNet/ResNet/Inception/VGG) or min–max scaling for the baseline CNN. IG was computed for the model’s top-1 predicted class on the original input, using a zero-baseline defined in the same preprocessed input space and a uniform integration path with 50 steps (i.e., 51 interpolation points) between baseline and input. This choice represents a practical accuracy–runtime compromise commonly used in CNN attribution; qualitatively, this granularity was sufficient to produce smooth, stable maps without excessive cost. Gradients of the target class score with respect to the interpolated inputs were averaged along the path (Riemann-sum approximation), and attributions were calculated using Equation (9). Non-finite values were sanitized to zero to improve numerical robustness. For visualization, absolute attributions were aggregated across channels, min–max normalized to [0,1], and rendered as both grayscale heatmaps and colored overlays blended with the original image (

α

= 0.5). Representative qualitative results were generated by randomly sampling two images per class under a fixed seed (42), balancing coverage and computational cost.”

4. Results

4.1. Performance

4.1.1. Metrics

The performance metrics for all architectures computed over the validation set are presented in Supplementary File S1, Tables S1–S5. The quantitative evaluation highlights a clear performance gap between the baseline CNN and the transfer-learning architectures, with the former achieving only moderate accuracy (63.8%) and macro-F1 (0.618), indicative of limited class-wise generalization. In contrast, pretrained models substantially improve both metrics, with ResNet50 yielding the strongest overall performance (accuracy = 82.3%, macro-F1 = 0.825). While visually distinctive classes such as Snow-covered and Electrical-damage are consistently detected with high recall across architectures, Physical-damage remains the most challenging category, exhibiting persistently lower recall even for the best-performing model. The corresponding confusion patterns indicate that residual errors are dominated by misclassifications among visually and structurally similar fault types rather than confusion with the Clean class. These results provide an essential quantitative baseline for the subsequent analysis of occlusion-based explainability maps, where we investigate whether the spatial attribution patterns reflect these systematic strengths and failure modes. The baseline CNN exhibits pronounced class imbalance effects, with strong precision but low recall for certain fault types (e.g., Physical-damage), resulting in substantial off-diagonal entries in the confusion matrix. Transfer-learning models markedly reduce these errors, achieving improved diagonal dominance, particularly for Electrical-damage and Snow-covered samples. Nevertheless, consistent misclassification patterns persist across architectures for fault classes sharing similar texture and structural characteristics, confirming that remaining errors are systematic rather than random. These supplementary results substantiate the comparative performance claims made in the main text and serve as a quantitative reference point for interpreting the model-specific occlusion sensitivity and XAI visualizations discussed later throughout the paper.

4.1.2. Cross Validation and Robustness to Partitioning

To assess the sensitivity of model performance to the particular Train/Validation split, we performed repeated stratified hold-out validation within the predefined training set. Specifically, the Train set was randomly partitioned 30 times into 80% sub-train and 20% sub-validation subsets, preserving the class proportions in each repetition (stratification). For every repetition, each architecture was re-initialized and trained from scratch using the same preprocessing and training hyperparameters as in the main experiments (including class-weighting and early stopping) and evaluated on the corresponding sub-validation subset. We report the resulting distributions as well as the mean ± standard deviation of accuracy, macro-averaged precision/recall/F1, and per-class F1. The original external Test set was kept untouched and was reserved for the main Train/Test evaluation and explainability analyses. Figure 5 illustrates the distribution of macro-averaged F1 scores obtained across 30 repeated stratified train/validation splits for all five architectures. Each violin summarizes the empirical distribution of Macro-F1 values for one architecture, while individual points correspond to independent repetitions with different random partitions. This representation highlights not only the average performance but also the stability with respect to data partitioning, revealing clear differences in robustness across models. In particular, ResNet50 exhibits both higher central tendency and reduced dispersion, indicating superior and more stable performance compared to the remaining architectures.

To further examine class-specific sensitivity to data partitioning, Figure 6 reports the distribution of per-class F1 scores for the Electrical damage category across the same 30 repeated stratified splits. This class was selected due to its higher variability and lower separability observed in the aggregate metrics. As in Figure 1, each point corresponds to one independent run, and the violin shape reflects the density of observed F1 score values. The figure reveals substantial differences in robustness at the class level, with some architectures exhibiting pronounced variability (Baseline CNN), thereby complementing the global analysis and supporting the discussion on partition-dependent behavior and potential shortcut learning effects.

While Figure 5 summarizes overall robustness at the model level, Figure 6 provides a finer-grained, class-specific perspective on partition sensitivity.

4.2. LIME Explainability

4.2.1. LIME Quantitative Results

The Baseline CNN exhibits consistently high surrogate fidelity (

R_{w}^{2}

mean

= 0.915

, 10th percentile

= 0.853

), indicating that its output varies approximately linearly under superpixel masking; however, it also shows the lowest average predicted-class confidence (

0.60

), emphasizing that a faithful local surrogate does not necessarily imply strong or physically grounded evidence. At the other extreme, VGG16 yields the weakest surrogate fidelity (

R_{w}^{2}

mean

= 0.289

, 10th percentile

= 0.171

) while maintaining high confidence (

\bar{p} (\hat{y}) \approx 0.884

), revealing locally complex score–perturbation relationships where LIME’s linear approximation becomes unreliable despite confident predictions. EfficientNetB0 shows intermediate-to-good fidelity (

R_{w}^{2}

mean

= 0.559

, 10th percentile

= 0.457

), whereas ResNet50 and InceptionV3 are lower (

R_{w}^{2}

means

= 0.415

and

0.476

, respectively). Notably, InceptionV3 produces a markedly larger number of superpixels on average (

K \approx 279

, versus

K \approx 159

for 224 × 224 models), reflecting a more fragmented interpretable partition that can increase explanation variance and dilute coefficient interpretability. A class-model worst-case inspection (Supplementary File S2) further confirms that the lowest surrogate fits concentrate in visually diffuse regimes (e.g., Snow-covered), with extreme cases such as VGG16 × Snow-covered reaching

R_{w}^{2} \approx 0.10

at near-unit confidence, reinforcing that confidence and “visually clean” explanations alone are insufficient without surrogate-fidelity context.

To further characterize where LIME becomes unreliable, we performed a class–model worst-tail audit and provided the complete listings in Supplementary File S2 (worst_cases_by_class_model.csv). For each architecture and each PV-fault class, we selected the six instances with the highest “badness” score (dominated by low kernel-weighted surrogate fidelity

R_{w}^{2}

, and optionally amplified by low confidence or high superpixel fragmentation), thereby isolating regimes where the local linear surrogate is least able to approximate the model response under superpixel masking. The resulting worst-tail patterns are summarized by two compact heatmaps: Figure 7 reports the mean

R_{w}^{2}

within the worst-tail subset for each class

\times

model pair, while Figure 8 reports the corresponding minimum

R_{w}^{2}

, highlighting the most extreme failure modes. These maps reveal that LIME fragility is not uniform across categories: visually diffuse and globally distributed phenomena (most notably Snow-covered) concentrate the lowest-fidelity cases, and architecture-specific weaknesses emerge (e.g., VGG16 exhibits particularly low worst-tail

R_{w}^{2}

for Snow-covered despite high prediction confidence). By explicitly reporting both average and extreme surrogate-fidelity behavior in the worst tail, we complement the global summary of Table 1 and provide a transparent, reproducible basis for selecting representative examples and for interpreting LIME coefficients cautiously in locally non-linear regimes.

The two all-sample heatmaps are provided in the Supplementary File S3 (Figures S1 and S2) to document the global baseline of LIME surrogate fidelity across the entire validation set, thereby complementing the main-text worst-tail analysis (which is intentionally focused on failure modes) and enabling the distinction between typical behavior and tail-risk extremes. Taken together, the four heatmaps show a consistent hierarchy in LIME surrogate fidelity: Baseline CNN remains highly linear under masking (high mean and high minima); EfficientNetB0/InceptionV3 occupy a stable intermediate regime, whereas ResNet50 and especially VGG16 exhibit markedly lower fidelity with the strongest tail risk (lowest minima), and this degradation concentrates in visually diffuse regimes such as Snow-covered, confirming that LIME reliability is jointly driven by architecture and class-specific visual structure and motivating explicit reporting of both global and worst-tail surrogate-fit diagnostics.

Supplementary File S4 (Zenodo, doi:10.5281/zenodo.18233689) provides the complete set of LIME visualization outputs for the full test set, preserved in the original class-wise folder structure. For each test image, four PNG files are included: (i) the resized original, (ii) the superpixel segmentation overlay, (iii) the LIME RGB overlay highlighting the top-3 ranked superpixels (red/green/blue), and (iv) a bar plot of the corresponding top-3 Ridge surrogate coefficients (with the predicted label of the model and probability reported in the plot title).

4.2.2. Selection Policy for Representative and Failure-Mode LIME Examples

For the LIME examples reported in the main manuscript, we adopted an explicit best–worst per architecture curation policy intended to showcase both representative explanatory behavior and “tail-risk” failure modes. For each architecture (Baseline CNN, EfficientNetB0, ResNet50, InceptionV3, VGG16), we selected exactly two test images using surrogate fidelity as the primary criterion. Fidelity was quantified by the kernel-weighted coefficient of determination

R_{w}^{2}

, computed between (i) the black-box model’s predicted-class probability on the LIME perturbed samples and (ii) the corresponding predictions of the locally fitted Ridge surrogate, using LIME’s locality weights derived from the exponential kernel. The resulting curated set is summarized in Table 2, which documents for each selected case the architecture, extremal type (pick, BEST/WORST), the directory ground-truth label, surrogate fidelity (

R_{w}^{2}

) and weighted error (

{M S E}_{w}

), as well as the model’s predicted label and confidence on the unperturbed image (Predicted, Prob) to provide decision-regime traceability.

Selection was performed within each architecture. The BEST example was defined as the case with the highest available

R_{w}^{2}

, while preferentially restricting the search to robust operating conditions: whenever the required fields were available, we prioritized images for which the unperturbed prediction was high-confidence (e.g.,

p_{m a x} \geq 0.8

) and, when possible, correct (predicted label matches the ground truth), so that high-fidelity explanations are demonstrated under the intended regime of correct decisions. Conversely, the WORST example was defined as the case with the lowest available

R_{w}^{2}

, with priority given to high-confidence cases when available, because low surrogate fidelity under high confidence is most diagnostic of LIME limitations—namely, strong local nonlinearity of the classifier, sensitivity to superpixel segmentation, or instability of the local perturbation neighborhood. If a preferred subset was empty for a given architecture (i.e., no candidates satisfied the confidence and/or correctness preference), we applied a deterministic fallback and selected the global maximum (BEST) or global minimum (WORST)

R_{w}^{2}

among all test images available for that architecture.

To make the extremal nature of each choice auditable, Table 2 additionally reports r2_rank (the rank of the selected case when all test samples for that architecture are sorted by

R_{w}^{2}

in descending order) and r2_p, the percentile of

R_{w}^{2}

, indicating whether the selected sample is a strict extreme or a near-extreme due to preference constraints. Thumbnail previews of the selected BEST/WORST exemplars, together with the corresponding Top-3 LIME superpixels, are provided in Supplementary File S5 for visual traceability.

4.3. Functional Interpretability Through Occlusion Sensitivity

4.3.1. Occlusion Sensitivity Maps

Occlusion sensitivity analysis (Figure 9) provides an intervention-based view of model evidence by quantifying the change in predicted-class confidence when local image patches are masked. Across the representative occlusion maps in Figure 9, the architectures exhibit distinct “evidence utilization” modes that are class dependent. For context-rich scenes (Dusty and the second Bird-drop exemplar), the Baseline CNN frequently assigns functional relevance to peripheral structures and background terrain, indicating a context-driven decision pathway consistent with shortcut learning under dataset heterogeneity. EfficientNetB0, in contrast, more consistently anchors relevance on the PV surface and along physically plausible transitions (e.g., soiling gradients or snow boundaries), yielding a balanced pattern that remains informative both for diffuse phenomena (Dusty, Snow-covered) and compact faults (Bird-drop). ResNet50 often produces multi-island relevance distributions, suggesting partial structural awareness but weaker selectivity. VGG16 tends to generate highly peaked maps (few intense hotspots), which aligns with its high sparsity/low entropy profile but also implies vulnerability to point-cue reliance when the hotspot does not coincide with the true defect. Finally, InceptionV3 frequently yields spatially diffuse heatmaps—consistent with high entropy and near-zero sparsity—so that coarse overlap with defect masks may occur (notably for large-area phenomena such as Snow-covered) without providing fine-grained localization. Overall, Figure 9 reinforces that interpretability cannot be inferred from accuracy alone: models may achieve correct predictions while relying on markedly different—and sometimes non-physical—evidence sources.

These complementary measures separate where evidence is concentrated (IoU) from how it is distributed (entropy/sparsity) and are interpreted alongside qualitative overlays (Figure 9). Formal definitions of the occlusion impact map and quantitative metrics (IoU@Top10%, entropy, Hoyer sparsity), including numerical-stability constants and units, are provided in Supplementary File S6.

Occlusion metrics were computed on the full validation set under identical occlusion hyperparameters for all architectures; the evaluation is balanced across models and classes (equal number of images per class–model pair). Table 3 summarizes the quantitative occlusion-sensitivity metrics aggregated over the validation set, while Figure 9 provides representative qualitative examples. Overall, VGG16 exhibits the strongest functional localization, achieving the highest IoU@Top10% (0.172 ± 0.145), the lowest entropy (8.391 ± 2.700), and the highest Hoyer sparsity (0.520 ± 0.277), indicating comparatively compact and mask-aligned relevance distributions under perturbation. EfficientNetB0 shows similarly high sparsity (0.449 ± 0.146) but lower overlap with the defect masks (IoU@Top10% = 0.096 ± 0.051), suggesting that its relevance is often concentrated yet not consistently aligned with the automatically derived fault regions. ResNet50 attains intermediate IoU (0.130 ± 0.114) but markedly lower sparsity (0.183 ± 0.252), consistent with broader relevance spread across the image. InceptionV3 yields the weakest localization signal, with near-zero sparsity (0.013 ± 0.094) and the highest entropy (10.321 ± 2.209), consistent with the visually diffuse occlusion maps observed in Figure 9. Taken together, these results indicate that architectural differences in functional relevance are measurable at the dataset scale and that high-confidence predictions (PredProb) do not necessarily imply spatially faithful or mask-aligned evidence.

4.3.2. Model-Level Interpretability Metrics

The updated model-level occlusion metrics (Figure 10, Figure 11 and Figure 12 with the error bars denoting standard deviation across N = 141 test images) reveal marked differences in how architectures distribute functional relevance, and they also clarify that concentration and localization are not interchangeable. VGG16 exhibits the lowest mean entropy (8.391 ± 2.700) and the highest Hoyer sparsity (0.520 ± 0.277), indicating highly concentrated occlusion maps. Consistently, it also achieves the highest IoU@Top10% (0.172 ± 0.145), i.e., the top 10% most influential pixels overlap most with the automatically generated defect masks. However, this “best” IoU–sparsity combination should be interpreted cautiously: strong concentration can inflate overlap when masks are compact or imperfect, and qualitative inspections (occlusion/LIME) still show that VGG16 may lock onto small, high-contrast hotspots that are not necessarily the true fault evidence. In contrast, InceptionV3 yields the highest entropy (10.321 ± 2.209) and near-zero sparsity (0.013 ± 0.094), confirming diffuse relevance and an over-smoothing tendency, while still producing a mid-range IoU@Top10% (0.111 ± 0.064), consistent with coarse but weakly localized alignment. ResNet50 attains the second-highest IoU@Top10% (0.130 ± 0.114) with moderate entropy (9.321 ± 3.157) but low sparsity (0.183 ± 0.252), suggesting broader evidence integration rather than compact localization. EfficientNetB0 shows a more stable profile—moderate IoU@Top10% (0.096 ± 0.051), relatively low entropy (9.760 ± 0.555), and high sparsity (0.449 ± 0.146)—which aligns with the interpretation of a balanced, consistently focused attention mechanism. Finally, the Baseline CNN yields the lowest IoU@Top10% (0.083 ± 0.030) alongside intermediate sparsity and entropy, reflecting weaker and less reliable localization. Overall, these results emphasize that high IoU can coexist with low faithfulness (as we will show in the next section by IG-based metrics) and that robust interpretability requires jointly considering localization, concentration, and faithfulness rather than any single metric in isolation.

4.3.3. Per-Class Interpretability Analysis

Figure 13, Figure 14 and Figure 15 summarize how occlusion-derived interpretability metrics vary across defect categories and architectures, revealing that the same model can exhibit significantly different explanation behavior depending on the visual structure of the class. Importantly, these class-wise patterns refine the model-level trends discussed above: high localization (IoU@Top10%) can coincide with either concentrated or distributed saliency, and concentration (high Hoyer/low entropy) should not be conflated with semantic correctness. The qualitative overlays in Figure 9 provide the necessary visual anchor for interpreting these metrics, showing that architectures differ not only in “how much” relevance they assign but also in where and how coherently relevance is distributed over the PV surface versus contextual background.

From a localization standpoint (Figure 13), VGG16 achieves the highest IoU@Top10% in five out of six classes (Bird-drop, Clean, Dusty, Electrical damage, and Physical damage), indicating that its most influential occluded regions frequently overlap the automatically derived defect masks. However, this consistent “best” overlap must be read together with Figure 13, Figure 14 and Figure 15 and with Figure 9: VGG16 also exhibits strongly concentrated maps (high Hoyer sparsity across most classes) and low entropy relative to other architectures, implying that overlap is often driven by a small number of high-impact hotspots. This aligns with the qualitative behavior in Figure 9, where VGG16 tends to lock onto compact, high-contrast regions. Such concentration can be advantageous when the defect truly forms compact salient structures (e.g., localized bird-drop patterns or sharp electrical damage cues), but it also raises the risk of over-reliance on spurious, high-contrast artifacts, i.e., high overlap does not necessarily guarantee physically meaningful evidence.

The concentration–diffusion separation becomes explicit in Figure 14 (Hoyer sparsity) and Figure 15 (entropy). InceptionV3 systematically yields near-zero sparsity across all classes and the highest entropy, confirming highly dispersed relevance distributions that are weakly localized—again matching the almost uniform heatmaps observed in Figure 9. EfficientNetB0 shows stable, moderate-to-high sparsity across classes, often ranking second after VGG16, while simultaneously maintaining comparatively moderate entropy (Figure 15). This combination supports the qualitative impression from Figure 9 that EfficientNetB0 tends to anchor evidence on the PV surface and preserve coherent relevance transitions rather than spreading attribution over the whole scene. ResNet50 occupies an intermediate regime: its sparsity is lower than EfficientNetB0 and substantially below VGG16, while entropy is also typically moderate, consistent with the “multi-island” attribution patterns observed in Figure 9—suggesting evidence integration from multiple spatial cues rather than a single hotspot. The Baseline CNN generally remains less distinctive than the pretrained backbones, showing intermediate sparsity and relatively high entropy in several classes, consistent with a greater sensitivity to context and background structures.

Finally, the class dependency of these metrics is particularly informative for understanding where localization is intrinsically harder. The Snow-covered class consistently exhibits the weakest IoU@Top10% across architectures (Figure 13), reflecting the fact that snow coverage often reduces contrast and introduces large, diffuse regions with soft boundaries, making precise localization difficult for occlusion-based maps. In this regime, models that produce compact hotspots (e.g., VGG16) may fail to match extended masks despite strong concentration, whereas a diffuse model (InceptionV3) can obtain relatively better overlap simply by distributing relevance broadly over the scene. Taken together, Figure 13, Figure 14 and Figure 15 and the qualitative evidence of Figure 9 support a key methodological point: interpretability assessment should be multi-metric—localization (IoU), concentration (Hoyer), and dispersion (entropy) capture complementary properties, and only their joint interpretation can distinguish “focused but potentially brittle” explanations (e.g., VGG16) from “diffuse and weakly localized” explanations (InceptionV3) and more balanced behaviors (EfficientNetB0/ResNet50).

4.4. Integrated Gradient

Based on the definition of the faithfulness metrics described in Section 3.3.2, Table 4 summarizes

{A \hat{U} C}_{d e l}

,

{A \hat{U} C}_{i n s}

and the faithfulness gap

∆

.

Figure 16 shows a consistent global ranking in terms of the mean faithfulness gap Δ over the evaluation set (N = 141). EfficientNetB0 yields the largest mean gap (

∆

= 0.0192), followed by ResNet50 (

∆

= 0.0153) and the Baseline CNN (

∆

= 0.0106), whereas VGG16 (

∆

= 0.0077) and especially InceptionV3 (

∆

= 0.0049) show weaker separation between insertion and deletion curves. Notably, the Baseline CNN combines the lowest deletion AUC (rapid confidence degradation under removal) with limited recovery under insertion, producing only an intermediate Δ.

The global comparison presented in Figure 16 shows the mean faithfulness gap aggregated over all six classes for each architecture. EfficientNetB0 achieves the highest average gap (

∆

≈ 0.019), followed by ResNet50 (

∆

≈ 0.015) and the Baseline CNN (

∆

≈ 0.011). InceptionV3 and VGG16 obtain substantially lower values (

∆

≈ 0.005 and 0.008, respectively). This ranking suggests that, among the tested architectures, EfficientNetB0 and ResNet50 produce IG maps whose highlighted regions are most tightly coupled to the classifier’s confidence, whereas the explanations of InceptionV3 and VGG16 are less faithful in the insertion–deletion sense, despite their competitive classification performance.

The Faithfulness Gap heatmap across all architectures is presented in Figure 17. Because the faithfulness gap is computed as a mean over images, positive global gaps (Figure 16) do not preclude negative class-level gaps (Figure 17) when sign reversals occur in specific categories. At the aggregate level, all architectures show positive mean Faithfulness Gaps (Figure 16), with EfficientNetB0 exhibiting the largest average separation (Δ ≈ 0.019) and ResNet50 also strongly positive (Δ ≈ 0.015), while InceptionV3 and VGG16 yield smaller mean gaps (Δ ≈ 0.005–0.008). However, Figure 17 shows that faithfulness is class-dependent and can reverse sign: Baseline CNN becomes slightly negative on Dusty (−0.001) and Snow-covered (−0.005), InceptionV3 is negative on Clean (−0.004) and Physical damage (−0.014), and VGG16 is negative on Clean (−0.004) and Electrical damage (−0.010). These negative class-level means can coexist with a positive global mean because Figure 16 pools all images across classes, allowing strongly positive regimes to outweigh weaker or negative regimes. Overall, the results indicate that high predictive performance can coincide with weak pixel-level faithfulness in specific categories, particularly where evidence is diffuse, low-contrast, or easily confounded by contextual correlations.

In addition, we performed an exploratory analysis of the relationship between validation accuracy and faithfulness gap at class level. For each defect category, we computed Pearson and Spearman correlations between the per-model validation accuracy and the corresponding mean Δ across the five architectures. No consistent positive coupling emerged: some classes (e.g., Bird-drop, Snow-covered) exhibited weak to moderate positive trends, whereas others (e.g., Clean, Electrical damage) showed negative trends, confirming that higher predictive performance does not necessarily imply more faithful pixel-level explanations. Full correlation coefficients and per-class scatterplots are reported in Supplementary File S7, Table S6 and Figure S3.

To obtain a compact view of how accuracy and interpretability co-vary across classes, we defined an accuracy-faithfulness consistency score by averaging the Pearson and Spearman correlation coefficients between per-model validation accuracy and the mean faithfulness gap Δ for each class:

C o n s i s t e n c y = \frac{1}{2} (r_{P e a r s o n} + ρ_{S p e a r m a n})

(12)

Positive scores indicate that models that are more accurate on a given class also tend to show larger faithfulness gaps (i.e., more faithful IG maps in the insertion–deletion sense), whereas negative scores indicate an inverse relationship.

As shown in Supplementary File S7, Table S6, Bird-drop, Snow-covered, and, to a lesser extent, Dusty obtain the highest consistency scores, meaning that for these defect types, improvements in accuracy generally go hand in hand with more faithful attributions. In contrast, Physical damage, Clean and especially Electrical damage show the lowest (negative) consistency scores, indicating that in these categories the most accurate architectures are often those with the weakest or even contradictory insertion–deletion behavior (small or negative Δ). These classes are therefore prime candidates for deeper inspection, e.g., via qualitative IG maps, occlusion sensitivity, or dataset audit, to rule out spurious cues or annotation issues. Because the consistency score is derived from correlations computed over only five architectures, this report should be interpreted as a descriptive ranking rather than a formal statistical test; its main purpose is to highlight which classes exhibit robust alignment between predictive performance and explanation quality and which ones do not.

Overall, the IG and Deletion–Insertion analyses show that faithfulness is a complementary dimension of model quality that is only partially aligned with standard accuracy. EfficientNetB0 and ResNet50 combine strong predictive performance with consistently positive and comparatively large faithfulness gaps, indicating that their IG maps concentrate on pixels that genuinely drive the confidence of the model. By contrast, InceptionV3, VGG16, and, for some classes, the Baseline CNN achieve competitive accuracies while exhibiting small or even negative gaps on specific fault types, revealing explanations that are only weakly coupled to the underlying decision logic. When the per-class heatmap is combined with the accuracy–faithfulness consistency scores, a clearer picture emerges: Bird-drop, Snow-covered, and, to a lesser extent, Dusty are the defect types where improvements in accuracy generally go hand in hand with more faithful attributions, whereas Electrical damage, Physical damage and Clean remain vulnerable to shortcut learning and spurious cues. These results reinforce the conclusion that reliable photovoltaic fault diagnosis requires not only accurate models but also architectures and datasets that promote stable, causally grounded pixel-level attributions.

5. Discussion

5.1. Performance-Interpretability Coupling Across Architectures

Across the five architectures (Tables S1–S5), performance differences are reflected not only in aggregate scores but also in the structure of errors, which concentrates primarily within the visually overlapping soiling-related classes (Clean—Dusty—Bird-drop). In contrast, classes with more distinctive cues (notably Electrical damage and, depending on acquisition conditions, Snow-covered) tend to exhibit fewer systematic confusions. This error pattern aligns with the explainability results: LIME explanations frequently identify small, localized regions whose location becomes less stable as scene complexity increases, and for the most ambiguous categories, the Top-1 superpixel often shifts toward boundaries, background textures, or other contextual proxies rather than remaining anchored to panel-intrinsic evidence. Consequently, the dominant confusions observed in the confusion matrices occur precisely in those regimes where shortcut learning is most plausible and where local explanations show reduced invariance across exemplars.

Occlusion sensitivity provides a functional complement by probing decision relevance under explicit masking (Figure 9) and by enabling quantitative comparison through localization and concentration metrics (Table 3; Figure 10, Figure 11 and Figure 12). Importantly, the occlusion-derived measures separate different aspects of interpretability: IoU@Top10% captures spatial alignment between the most influential occluded regions and a consistently generated automatic proxy mask, whereas entropy and Hoyer sparsity reflect dispersion and compactness of the relevance distribution. The results show that these properties do not necessarily co-vary—models with more concentrated maps (lower entropy/higher sparsity) are not automatically better aligned under IoU@Top10%, and diffuse relevance can still yield competitive overlap in classes dominated by large-area cues (most notably Snow-covered). This distinction is consistent with the qualitative occlusion maps, where some architectures exhibit diffuse, scene-level sensitivity (particularly in classes with low-contrast or globally distributed cues), while others produce sharper but not always defect-centered activations. Hence, occlusion analysis supports the conclusion that “visually clean” heatmaps are not sufficient evidence of physically meaningful decision-making and that localization and concentration must be interpreted jointly.

Integrated Gradients adds an additional layer by quantifying explanation faithfulness through deletion–insertion behavior (Figure 16 and Figure 17). At an aggregate level, all architectures exhibit positive mean Faithfulness Gaps (Table 4), with EfficientNetB0 showing the largest average separation between insertion and deletion AUCs and ResNet50 also consistently positive, while InceptionV3 and VGG16 yield smaller average gaps. This indicates that EfficientNetB0 and ResNet50 more often highlight pixels whose removal reduces confidence and whose reintroduction restores it, i.e., attributions that are more causally coupled to the output under the adopted perturbation protocol. However, the class-wise analysis (Figure 17) reveals that the strength—and even the sign—of this coupling is class-dependent, with some model-class combinations approaching zero or becoming negative, implying that IG may occasionally prioritize pixels that are weakly causal or counter-indicative for the predicted score. These class-level sign reversals can coexist with positive global means. That is because Table 4 aggregates across all images, allowing strongly positive regimes (e.g., Bird-drop) to outweigh weaker or negative regimes in specific categories. Taken together, the combined evidence suggests a more complex relationship between performance and interpretability: the top-performing architectures (notably ResNet50 and EfficientNetB0) tend to exhibit stronger average faithfulness, yet the most frequent classification confusions persist in the very classes where all three XAI analyses indicate higher vulnerability to contextual shortcuts and reduced explanation stability.

5.2. Overall XAI Practical Implications

The interpretability analyses conducted with LIME, OS, and IG converge on a consistent conclusion: on heterogeneous PV imagery, correct classification can be supported by evidence that is not physically tied to the fault mechanism. LIME (Top-1 superpixel) highlights limited explanation invariance under scene complexity, with dominant attributions frequently shifting from panel-intrinsic regions to contextual proxies (roof/ground textures, boundaries, edges, and occasional acquisition artifacts) when non-PV content becomes class-correlated. Occlusion sensitivity strengthens this observation by directly testing functional relevance under masking, revealing that map “sharpness” (low entropy/high sparsity) is not equivalent to correct localization (IoU@Top10% against a consistent automatic proxy mask) and that several architectures exhibit broad, scene-driven sensitivity for visually diffuse categories such as Dusty and Snow-covered (Figure 9), consistent with the class-wise patterns in Figure 13, Figure 14 and Figure 15. IG evaluated via deletion–insertion faithfulness further shows that attribution reliability is architecture- and class-dependent: EfficientNetB0 and ResNet50 are the only models with uniformly positive per-class mean gaps across all six classes (Figure 17), whereas Baseline CNN, InceptionV3, and VGG16 exhibit negative gaps in specific categories (e.g., InceptionV3 on Physical damage, VGG16 on Electrical damage), indicating that the highest-attributed pixels can be weakly causal or even counter-indicative under the adopted perturbation protocol. Collectively, these results indicate that accuracy alone is insufficient for trustworthy PV fault diagnostics; robust deployment requires (i) dataset auditing to reduce context–label coupling (e.g., PV-centric cropping, removal of overlays/markings, and control of background leakage), and (ii) multi-method XAI evaluation that pairs qualitative maps with perturbation-based faithfulness tests to verify that the model’s decisions are grounded in physically relevant evidence.

5.3. Rationale for a Quantitative, Multi-Method XAI Audit

To contextualize our results, we compare our three-family explainability audit surrogate-based (LIME), perturbation-based (OS), and gradient-based (IG)—with widely used attribution baselines from the CAM/Grad-CAM family, emphasizing assumptions and failure modes. CAM/Grad-CAM methods are popular due to their speed and visually intuitive class-discriminative heatmaps; however, they largely depend on late convolutional feature maps and typically yield coarse localization, which can appear plausible even when predictions are influenced by contextual shortcuts (background, borders, illumination artifacts) rather than panel-relevant evidence. In contrast, our protocol explicitly quantifies explanation reliability and concentration through complementary mechanisms: LIME provides superpixel-based local surrogates and reports kernel-weighted surrogate fidelity, enabling us to detect when the explanation is a weak approximation of the model; occlusion sensitivity probes causal sensitivity by measuring the impact of localized perturbations and is summarized with IoU@Top10% (localization against consistent proxy masks) and concentration statistics (Shannon entropy and Hoyer sparsity), while acknowledging dependence on masking strategy and patch size; and IG offers axiomatically motivated attributions whose practical credibility is assessed via deletion–insertion faithfulness, summarized through a Faithfulness Gap. This triangulation strengthens interpretability claims by requiring consistency across methods with different assumptions, which is particularly important in heterogeneous PV imagery where shortcut learning can inflate performance without reflecting physically meaningful fault cues.

5.4. Scope of Architectural Comparison and Outlook Toward ViT-Based Models

We clarify that the architectural comparison in this study is intentionally scoped to a controlled set of widely used CNN backbones (Baseline CNN, VGG16, ResNet50, InceptionV3, EfficientNetB0) trained under the same data splits, preprocessing, and optimization protocol in order to isolate how different CNN inductive biases affect both predictive performance and the proposed XAI diagnostics. These architectures represent complementary design paradigms—sequential depth, residual learning, multi-branch receptive fields, and compound scaling—while remaining sufficiently established and reproducible for PV monitoring pipelines. Extending the comparison to additional families (e.g., Vision Transformers, ViTs) is a natural next step, but it introduces supplementary factors that are particularly relevant in our setting: PV imagery is heterogeneous (viewpoint, scale, background), and the effective training size per class can be limited by class imbalance, which may amplify sensitivity to spurious correlations. In such regimes, ViTs often benefit from large-scale pretraining and careful regularization (e.g., strong augmentation, stochastic depth, dropout, label smoothing, or data-efficient variants such as DeiT) to avoid overfitting and to stabilize attention patterns. Moreover, attention maps are not inherently faithful explanations of model decisions, so ViT extensions should be accompanied by the same type of quantitative faithfulness checks used here. For these reasons, we focus the present work on a reproducible CNN baseline set while framing ViT-based extensions as future work that should be evaluated under stronger data curation and regularization constraints.

5.5. Mechanistic Interpretation of Shortcut Vulnerability in Clean and Electrical-Damage

The class-wise audit suggests that Clean and Electrical-damage are particularly exposed to shortcut learning, but for different reasons. Clean is, by definition, characterized by the absence of localized fault evidence; consequently, the classifier can achieve separability by exploiting weak but consistent proxies such as framing regularities (panel borders, mounting rails), acquisition context, and background textures, rather than learning “cleanliness” as a panel-intrinsic concept (See Supplementary File S4). This is consistent with cases where LIME’s top-ranked superpixels and occlusion relevance shift toward borders/background and where faithfulness becomes less aligned with accuracy. For Electrical-damage, visible cues in RGB imagery can be subtle or spatially limited, which increases reliance on high-contrast edges, reflections, and contextual co-occurrences (e.g., scene-specific backgrounds or acquisition artifacts) when the dataset contains label–context coupling. Across these classes, the combined evidence (local surrogate explanations, functional relevance under masking, and IG deletion–insertion behavior) supports a practical implication: robust PV fault recognition should incorporate dataset curation that reduces background leakage (PV-centric cropping, removal of overlays/markers, and acquisition-condition normalization) and should validate models using quantitative faithfulness and localization diagnostics, not accuracy alone.

5.6. Limitations and Future Work

Despite providing a cross-method evaluation of explainability, several limitations constrain the generality of the findings.

Architecture complementarity. Future work could test validation-calibrated weighted ensembles or gated model selection to exploit potential complementarity between models that better capture localized defects versus distributed degradations, with the aim of improving robustness and explanation stability across classes.

Dataset heterogeneity and potential label–context coupling. The dataset contains substantial variability in viewpoint, background, and acquisition conditions. Some classes co-occur with characteristic contexts (e.g., rooftop textures for Dusty, snow/roof structures for Snow-covered), and certain images may include markings/overlays. These factors can encourage shortcut learning and can confound attribution analyses by making non-PV regions predictive. Future work should incorporate stricter dataset curation (PV-centric cropping, overlay removal, and controlled background leakage) and/or evaluate robustness under explicit background randomization.

Ground-truth masks and localization assumptions. Quantitative localization metrics (e.g., IoU@Top10%) depend on the quality and scope of defect masks. For diffuse phenomena such as Dusty or Snow-covered, the notion of a compact “fault region” is inherently ambiguous, and mask definitions can bias both IoU and derived conclusions. Follow-up studies should consider uncertainty-aware masks, multi-annotator agreement, and alternative evaluation objectives for diffuse classes (e.g., region-level or global-shift descriptors rather than pixel-tight localization).

Sensitivity to XAI hyperparameters and baselines. LIME explanations depend on segmentation granularity and sampling; occlusion sensitivity depends on patch size/stride and masking value; IG depends on the baseline choice and number of integration steps. Although fixed hyperparameters enabled fair inter-model comparison, different choices may change attribution morphology. Future work should conduct sensitivity analyses across parameter ranges and evaluate baseline choices more systematically (e.g., blurred baselines or dataset-mean baselines for IG).

Direct tests via context manipulation. While attribution analyses provide convergent evidence about where models focus, shortcut learning hypotheses can be tested more directly through controlled context interventions. Future work should therefore incorporate PV-centric cropping or segmentation (restricting inputs to the panel region) and systematic background ablations such as blurring/replacing pixels outside the panel or background randomization while preserving the panel content. A robust evaluation should track not only changes in accuracy and class-wise confusion patterns but also shifts in localization/concentration and faithfulness measures (IoU@Top10%, entropy/sparsity, and deletion–insertion Faithfulness Gap). A consistent degradation of performance under background removal—together with increased panel-centered attributions—would provide stronger causal evidence that the original model relied on context cues.

Faithfulness evaluation scope. The deletion–insertion framework probes causal relevance under a specific perturbation protocol, but it does not fully characterize human interpretability or deployment risk. Complementary tests—such as randomized sanity checks, model parameter randomization, or counterfactual generation—could better assess whether explanations remain stable under distribution shifts and whether highlighted features genuinely encode fault mechanisms.

Data decentralization and federated perspectives. In practical PV monitoring, image data are often decentralized across plants, operators, and maintenance providers, and data sharing may be constrained by privacy, commercial sensitivity, and heterogeneous acquisition conditions. This motivates federated learning and collaborative adaptation strategies that can train global models without centralizing raw data. Recent work (Yang et al. [20]) has proposed balance recovery and collaborative adaptation mechanisms for federated fault diagnosis under inconsistent client distributions and machine groups, highlighting the importance of mitigating cross-site heterogeneity. An important implication for explainability is that attention/attribution patterns should be audited not only globally but also across participating sites, since shortcuts can become client-specific. Future work should therefore combine federated training with the present quantitative XAI audit, e.g., by tracking faithfulness/localization metrics per site and enforcing explanation stability under domain shift.

Dataset size and external validity. Although the dataset is heterogeneous, the total sample size (875 images) limits the statistical generalizability of performance and interpretability trends. Our conclusions should therefore be read as diagnostic evidence on shortcut learning risk under a low-data, real-world heterogeneous regime. Future work will validate the proposed explainability audit on larger, curated, and multi-site PV datasets and will include robustness checks such as repeated splits/cross-validation and explicit context/background manipulation experiments.

External validity. Results were obtained on a public dataset and a limited set of architectures. Deployment-grade PV monitoring involves additional sensing variability (camera type, compression, weather), class imbalance, and site-specific conditions. Future work should validate the conclusions on field data, incorporate domain adaptation, and report calibration/uncertainty measures alongside interpretability.

6. Conclusions

This study examined the interpretability and functional faithfulness of convolutional classifiers trained for photovoltaic (PV) fault recognition on a heterogeneous image dataset comprising six operational conditions (Clean, Dusty, Bird-drop, Electrical damage, Physical damage, Snow-covered). Beyond reporting predictive performance, we systematically compared five CNN architectures using three complementary explainability families: surrogate-based local explanations (LIME, Top-1 superpixel), perturbation-based functional relevance (occlusion sensitivity), and gradient-based attributions validated through deletion–insertion faithfulness (Integrated Gradients).

Across methods, the results show that correct classification does not guarantee physically grounded evidence. LIME revealed limited explanation invariance under scene complexity, with dominant attributions frequently shifting from PV surface cues to contextual proxies (roof/ground textures, borders, and acquisition artifacts). Occlusion sensitivity confirmed that map concentration is not equivalent to localization: architectures producing compact relevance patterns were not necessarily better aligned with defect masks, while diffuse responses could still yield comparable overlap metrics. Finally, IG faithfulness analysis demonstrated marked architecture- and class-dependence: EfficientNetB0 and ResNet50 achieved the most consistent deletion–insertion behavior, whereas other models exhibited class-specific near-zero or negative faithfulness gaps, indicating partial reliance on non-causal or counter-indicative pixels.

We deliberately structured the analysis around four complementary layers: (1) classification performance, (2) LIME, (3) occlusion sensitivity, and (4) Integrated Gradients—because no single indicator can establish trustworthy evidence attribution on heterogeneous PV imagery. Performance metrics quantify what the model gets right or wrong, but they do not reveal why, nor whether correct predictions rely on panel-intrinsic cues or on spurious contextual correlations. LIME was included as a model-agnostic, instance-level explanation that translates high-dimensional decisions into a small set of interpretable superpixel regions, making it well-suited for qualitative inspection and for diagnosing instability under scene complexity; however, its reliance on local linear surrogates and segmentation motivates the additional fidelity checks reported in the Supplementary Materials. OS complements LIME by providing a direct perturbation-based probe of functional relevance under explicit masking, enabling quantitative separation between localization (IoU@Top10% against a consistently generated proxy mask) and concentration (entropy and Hoyer sparsity), thereby preventing “visually sharp” heatmaps from being over-interpreted as physically meaningful. Finally, IG with deletion–insertion faithfulness was added to test whether attribution rankings are causally coupled to the predicted score under a standardized perturbation protocol, a property that can vary by class and can exhibit sign reversals even when global averages remain positive. Taken together, this layered design ensures that interpretability claims are supported simultaneously by (i) outcome-level evidence (performance), (ii) human-interpretable local explanations (LIME), and (iii) two perturbation-based checks that quantify functional relevance and faithfulness (OS and IG), reducing the risk of drawing conclusions from any single, method-specific artifact.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7030094/s1. Supplementary File S1. Performance metrics for all architectures. Table S1. Baseline architecture performance metrics. Table S2. EfficientNet architecture performance metrics. Table S3. InceptionV3 architecture performance metrics. Table S4. ResNet50 architecture performance metrics. Table S5. VGG16 architecture performance metrics. Supplementary File S2. Worst-case by-class model. Supplementary File S3. Surrogate fidelity for all classes and architectures. Figure S1. Global LIME surrogate fidelity (mean

R_{w}^{2}

) across classes and architectures. Figure S2. Global LIME surrogate fidelity (minimum

R_{w}^{2}

) across classes and architectures. Supplementary File S4. (Zenodo, doi:10.5281/zenodo.18233689) Full set of LIME visualization outputs for the test set. Supplementary File S5. Visual preview of the LIME BEST/WORST exemplars and their Top-3 superpixels. Supplementary File S6. Formal definition of occlusion-based metrics. Supplementary File S7. Relationship between classification accuracy and IG faithfulness. Table S6. Correlation between per-model test accuracy and IG faithfulness gap for each class. Figure S3. Per-class relationship between test accuracy and IG faithfulness gap.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public dataset PV Panel Defect Dataset; Kaggle: 2025 Available online: https://www.kaggle.com/datasets/alicjalena/pv-panel-defect-dataset (accessed on 10 October 2025) was used in this work. The full set of LIME visualization outputs is available at Zenodo, doi:10.5281/zenodo.18233689.

Conflicts of Interest

The author declares no conflict of interest.

References

Cação, J.; Santos, J.; Antunes, M. Explainable AI for industrial fault diagnosis: A systematic review. J. Ind. Inf. Integr. 2025, 47, 100905. [Google Scholar] [CrossRef]
Hosain, M.T.; Jim, J.R.; Mridha, M.F.; Kabir, M.M. Explainable AI approaches in deep learning: Advancements, applications and challenges. Comput. Electr. Eng. 2024, 117, 109246. [Google Scholar] [CrossRef]
Awedat, K.; Comert, G.; Ayad, M.; Mrebit, A. Advanced fault detection in photovoltaic panels using enhanced U-Net architectures. Mach. Learn. Appl. 2025, 20, 100636. [Google Scholar] [CrossRef]
Sairam, S.; Seshadri, S.; Marafioti, G.; Srinivasan, S.; Mathisen, G.; Bekiroglu, K. Edge-based explainable fault detection systems for photovoltaic panels on edge nodes. Renew. Energy 2022, 185, 1425–1440. [Google Scholar] [CrossRef]
Rico Espinosa, A.; Bressan, M.; Giraldo, L.F. Failure signature classification in solar photovoltaic plants using RGB images and convolutional neural networks. Renew. Energy 2020, 162, 249–256. [Google Scholar] [CrossRef]
Wan, L.; Zhao, L.; Xu, W.; Guo, F.; Jiang, X. Dust deposition on the photovoltaic panel: A comprehensive survey on mechanisms, effects, mathematical modeling, cleaning methods, and monitoring systems. Sol. Energy 2024, 268, 112300. [Google Scholar] [CrossRef]
Restrepo-Cuestas, B.J.; Guarnizo-Lemus, C.; Montoya-Marín, J.A.; Montano, J. Dataset of photovoltaic panel performance under different fault conditions cracks, discoloration, and shading effects. Data Brief 2025, 59, 111392. [Google Scholar] [CrossRef] [PubMed]
Ling, M.; Zhu, J.; Yang, Y.; Li, H.; Yi, J.; Gao, J.; Wang, L. Study on an enhanced YOLOv9 algorithm for detecting stains and damage in photovoltaic panels. Renew. Energy 2026, 256, 124540. [Google Scholar] [CrossRef]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Gomez, T.; Fréour, T.; Mouchère, H. Metrics for Saliency Map Evaluation of Deep Learning Explanation Methods. arXiv 2022, arXiv:2201.13291. [Google Scholar] [CrossRef]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity Checks for Saliency Maps. arXiv 2018, arXiv:1810.03292. [Google Scholar]
Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Lenarczyk, A. PV Panel Defect Dataset. Kaggle. 2025. Available online: https://www.kaggle.com/datasets/alicjalena/pv-panel-defect-dataset (accessed on 8 February 2026).
Deb, N.; Rahman, T. An efficient VGG16-based deep learning model for automated potato pest detection. Smart Agric. Technol. 2025, 12, 101409. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 2818–2826. [Google Scholar]
Khan, M.N.; Das, S.; Liu, J. Predicting pedestrian-involved crash severity using inception-v3 deep learning model. Accid. Anal. Prev. 2024, 197, 107457. [Google Scholar] [CrossRef] [PubMed]
VanBerlo, B.; Wu, D.; Li, B.; Rahman, M.A.; Hogg, G.; VanBerlo, B.; Tschirhart, J.; Ford, A.; Ho, J.; McCauley, J.; et al. Accurate assessment of the lung sliding artefact on lung ultrasonography using a deep learning approach. Comput. Biol. Med. 2022, 148, 105953. [Google Scholar] [CrossRef] [PubMed]
Vedaldi, A.; Soatto, S. Quick shift and kernel methods for mode seeking. In Proceedings of the 10th European Conference on Computer Vision—ECCV 2008, Marseille, France, 12–18 October 2008; Forsyth, D., Torr, P., Zisserman, A., Eds.; Lecture Notes in Computer Science, 5305; Springer: Berlin/Heidelberg, Germany, 2008; pp. 705–718. [Google Scholar] [CrossRef]
Yang, B.; Lei, Y.; Li, N.; Li, X.; Si, X.; Chen, C. Balance recovery and collaborative adaptation approach for federated fault diagnosis of inconsistent machine groups. Knowl.-Based Syst. 2025, 317, 113480. [Google Scholar] [CrossRef]

Figure 1. The structure of the dataset (left) and mean resolution of images (right).

Figure 2. Representative RGB samples from the six classes (Clean, Dusty, Bird-drop, Electrical-damage, Physical-damage, Snow-covered).

Figure 3. Block-level representation of the convolutional neural network (CNN) architectures used in this study: Baseline CNN, EfficientNetB0, ResNet50, InceptionV3, and VGG16.

Figure 4. The overall experimental workflow.

Figure 5. Distribution of macro-F1 scores across 30 repeated stratified train/validation splits.

Figure 6. Distribution of F1 score across 30 repeated stratified train/validation splits for the Electrical-damage class.

Figure 7. Worst-tail LIME surrogate fidelity (mean

R_{w}^{2}

) across classes and architectures.

Figure 7. Worst-tail LIME surrogate fidelity (mean

R_{w}^{2}

) across classes and architectures.

Figure 8. Worst-tail LIME surrogate fidelity (minimum

R_{w}^{2}

) across classes and architectures.

Figure 8. Worst-tail LIME surrogate fidelity (minimum

R_{w}^{2}

) across classes and architectures.

Figure 9. Occlusion sensitivity for the Dusty, Bird-drop, Electrical-damage, Physical-damage, and Snow-covered classes. Warmer colors indicate regions whose occlusion produces a larger decrease in the predicted-class score (higher functional relevance under masking).

Figure 10. Mean occlusion-map entropy for each architecture.

Figure 11. Mean Hoyer sparsity across architectures.

Figure 12. Mean IoU@Top10% between model saliency and defect masks.

Figure 13. Per-class IoU@Top10% for all architectures.

Figure 14. Per-class Hoyer sparsity for all architectures.

Figure 15. Per-class entropy of occlusion sensitivity maps.

Figure 16. Faithfulness gap averaged over all six classes for each architecture.

Figure 17. Per-class Faithfulness Gap heatmap across all architectures.

Table 1. LIME surrogate-fidelity summary.

Architecture	${\hat{R}}^{2}$	$R_{p 10}^{2}$	$\hat{p}$	$f_{R^{2} <}$	$\hat{K}$	$f_{K >}$
VGG16	0.289	0.171	0.88	0.425	159	0
ResNet50	0.4159	0.293	0.81	0.078	159	0
InceptionV3	0.476	0.386	0.69	0	279	0.51
EfficientNetB0	0.558	0.457	0.67	0	159	0
Baseline_CNN	0.915	0.853	0.60	0	159	0

Table 2. Best–worst LIME exemplar set per architecture (surrogate-fidelity curation).

Model	Pick	Ground Truth	$R_{w}^{2}$	${M S E}_{w}$	Prob	Predicted	r2_Rank	r2_p
1	B	Clean	0.974	1.01 × 10⁻⁴	0.812	Clean	2	0.993
1	W	Snow-covered	0.772	8.06 × 10⁻⁴	0.811	Snow-covered	139	0.014
2	B	Clean	0.727	1.28 × 10⁻³	0.909	Clean	1	1.000
2	W	Snow-covered	0.427	6.40 × 10⁻⁴	0.944	Snow-covered	134	0.050
3	B	Snow-covered	0.623	7.10 × 10⁻³	0.862	Snow-covered	3	0.986
3	W	Bird-drop	0.366	9.33 × 10⁻³	0.935	Bird-drop	135	0.043
4	B	Clean	0.620	1.04 × 10⁻²	0.858	Clean	3	0.986
4	W	Snow-covered	0.185	4.75 × 10⁻⁴	1.000	Snow-covered	140	0.007
5	B	Bird-drop	0.492	6.27 × 10⁻²	0.956	Bird-drop	1	1.000
5	W	Snow-covered	0.101	1.14 × 10⁻⁶	1.000	Snow-covered	141	0.000

1—Baseline CNN; 2—EfficientNetB0; 3—ResNet50; 4—InceptionV3; 5—VGG16; B—Best case; W—Worst case.

Table 3. Quantitative occlusion-sensitivity interpretability metrics aggregated over the validation set (

μ \pm σ

).

Table 3. Quantitative occlusion-sensitivity interpretability metrics aggregated over the validation set (

μ \pm σ

).

Model	VGG16	ResNet50	InceptionV3	EfficientNetB0	Baseline_CNN
No images	141	141	141	141	141
IoU@Top10%	0.172 ± 0.145	0.130 ± 0.114	0.111 ± 0.064	0.096 ± 0.051	0.083 ± 0.030
Entropy	8.391 ± 2.700	9.321 ± 3.157	10.321 ± 2.209	9.760 ± 0.555	9.994 ± 0.368
HoyerSparsity	0.520 ± 0.277	0.183 ± 0.252	0.013 ± 0.094	0.449 ± 0.146	0.385 ± 0.115
PredProb	0.887 ± 0.165	0.804 ± 0.186	0.674 ± 0.195	0.658 ± 0.197	0.550 ± 0.202

Table 4.

{A U C}_{d e l}

,

{A U C}_{i n s}

, and faithfulness gap.

Table 4.

{A U C}_{d e l}

,

{A U C}_{i n s}

, and faithfulness gap.

Model	${A \hat{U} C}_{d e l}$	${A \hat{U} C}_{i n s}$	$∆$
Baseline_CNN	0.22594	0.23654	0.0106
EfficientNetB0	0.22772	0.24688	0.0192
ResNet50	0.25782	0.27314	0.0153
InceptionV3	0.24359	0.24853	0.0049
VGG16	0.26182	0.26949	0.0077

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Diaconu, B.M. Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit. AI 2026, 7, 94. https://doi.org/10.3390/ai7030094

AMA Style

Diaconu BM. Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit. AI. 2026; 7(3):94. https://doi.org/10.3390/ai7030094

Chicago/Turabian Style

Diaconu, Bogdan Marian. 2026. "Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit" AI 7, no. 3: 94. https://doi.org/10.3390/ai7030094

APA Style

Diaconu, B. M. (2026). Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit. AI, 7(3), 94. https://doi.org/10.3390/ai7030094

Article Menu

Diagnosing Shortcut Learning in CNN-Based Photovoltaic Fault Recognition from RGB Images: A Multi-Method Explainability Audit

Abstract

1. Introduction

1.1. Objectives of the Study

1.2. Novelty and Contributions

2. Materials and Methodology

2.1. Dataset and Preprocessing

2.2. Architectural Framework of the Deep Learning Models

3. Explainability Framework

3.1. LIME-Based Explainability for Image Classification Models

Kernel-Weighted R w 2 for LIME Surrogate Fidelity

3.2. Occlusion Sensitivity Quantitative Analysis

3.3. Integrated Gradients

3.3.1. General Theory of Integrated Gradients

3.3.2. Faithfulness of IG Explanations (Deletion-Insertion)

3.4. XAI Hyperparameter Choices and Rationale

3.4.1. LIME Parameters

3.4.2. OS Parameters

3.4.3. IG Parameters

4. Results

4.1. Performance

4.1.1. Metrics

4.1.2. Cross Validation and Robustness to Partitioning

4.2. LIME Explainability

4.2.1. LIME Quantitative Results

4.2.2. Selection Policy for Representative and Failure-Mode LIME Examples

4.3. Functional Interpretability Through Occlusion Sensitivity

4.3.1. Occlusion Sensitivity Maps

4.3.2. Model-Level Interpretability Metrics

4.3.3. Per-Class Interpretability Analysis

4.4. Integrated Gradient

5. Discussion

5.1. Performance-Interpretability Coupling Across Architectures

5.2. Overall XAI Practical Implications

5.3. Rationale for a Quantitative, Multi-Method XAI Audit

5.4. Scope of Architectural Comparison and Outlook Toward ViT-Based Models

5.5. Mechanistic Interpretation of Shortcut Vulnerability in Clean and Electrical-Damage

5.6. Limitations and Future Work

6. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Kernel-Weighted $R_{w}^{2}$ for LIME Surrogate Fidelity