1. Introduction
Skin cancer ranks among the most prevalent malignancies, and its incidence continues to rise. Melanoma in particular is responsible for the majority of deaths attributed to skin cancer. For this reason, the development of automated methods that improve diagnostic accuracy and reduce subjective interpretation remains a priority [
1]. Skin lesion symmetry is a pivotal feature in dermatological assessment. Clinicians often employ the ABCDE rule (Asymmetry, Border irregularity, Color variegation, Diameter, Evolution) as a heuristic for early melanoma detection, where
A stands for the asymmetry of the lesion [
2,
3]. In practice, benign nevi tend to exhibit symmetric shape and color, whereas malignant lesions like melanoma are frequently asymmetric between halves. This emphasis on symmetry in clinical diagnostics underlines the importance of developing computational methods that can recognize and quantify lesion symmetry from images.
Although deep learning models, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved dermatologist-level accuracy in skin lesion classification [
4], their “black box” nature hinders clinical trust. To address this, eXplainable Artificial Intelligence (XAI) techniques like saliency maps are employed to visualize the image regions influencing the decision of the model. However, the analysis of these maps has been predominantly qualitative [
5]. This work hypothesizes that the morphological characteristics of saliency maps, such as their symmetry and spatial distribution, contain latent signals about model behavior. We propose that these features differ systematically between correct and incorrect predictions, thereby offering a novel quantitative approach to evaluating the reliability of assisted diagnoses.
To investigate this, we present a systematic analysis of saliency map morphometrics in the context of skin lesion classification. Our study first benchmarks five state-of-the-art deep learning architectures, encompassing both convolutional and transformer-based models, on a challenging nine-class skin lesion dataset. For each prediction, we generate visual explanations using two advanced saliency techniques, Grad-CAM++ and LayerCAM, to identify the regions deemed important by the model. Subsequently, we compute a suite of five quantitative morphometric metrics, such as entropy, radial symmetry, and compactness, to characterize the spatial structure of these explanations. This methodology concludes with a statistical comparison of these metrics between correctly classified and misclassified cases to test our hypothesis that the morphology of a saliency map is associated with prediction correctness. Our findings confirm this relationship, revealing statistically significant differences which suggest that the morphometric properties of explanation maps could serve as valuable indicators of model reliability in skin lesion classification.
The remainder of this paper is organized as follows.
Section 3 describes the dataset, preprocessing pipeline, model architectures, and saliency techniques employed in this study.
Section 4 presents the classification performance of all models and analyzes the statistical behavior of the proposed symmetry-aware morphometric metrics across correct and incorrect predictions.
Section 5 discusses the implications of the observed patterns for model trustworthiness and future explainability strategies. Finally,
Section 6 concludes the paper by summarizing the key findings and outlining directions for future work.
2. Background
In recent years, deep learning (DL) has revolutionized automated skin lesion classification. CNNs in particular have achieved dermatologist-level accuracy in identifying malignant lesions from dermoscopic images [
4]. The versatility of these DL paradigms has been proven in other complex medical domains, such as in multi-label classification systems for radiological images, highlighting their broad applicability [
6]. Subsequent advances leveraging larger datasets and model ensembles have further improved performance on challenging multiclass skin lesion tasks [
7]. Modern scaling strategies, exemplified by EfficientNet, enhance feature-extraction efficiency while preserving robust generalization across diverse lesion types [
8]. Recent CNN variants such as ConvNeXt, particularly when augmented with attention mechanisms, further boost accuracy by jointly capturing local and global context and selectively emphasizing diagnostically salient regions [
9,
10]. This approach is not unique to medicine, as sophisticated segmentation architectures are continuously proposed in other demanding fields like infrared target detection and shadow detection, which leverage complex contextual information and innovative network topologies to enhance precision, offering paradigms with clear potential for transferability to medical diagnostics [
11,
12]. More recently, ViT architectures [
13] have been introduced to this domain, offering improved modeling of long-range dependencies and global context. Early studies indicate that transformer-based models can match or exceed CNN performance in skin cancer classification [
14]. These developments demonstrate the state-of-the-art in lesion classification, but they also bring into focus a critical issue: the interpretability of such complex models.
Although DL models achieve high accuracy, they are frequently described as black boxes, particularly in applications where decisions have significant consequences such as medicine [
15]. Techniques from XAI aim to mitigate this opacity by rendering the reasoning of the models more transparent. In medical image analysis, a prevalent XAI strategy involves the generation of saliency maps or class activation maps that highlight the regions of an image that most strongly influence a model prediction [
16]. When studying skin lesions, these maps can reveal whether a CNN or ViT directs its attention toward the suspicious area of a mole or instead toward irrelevant artifacts. Recent methods, including Grad-CAM++ and LayerCAM, have proved effective at producing such visual explanations. Grad-CAM++ refines the localization of important features by incorporating higher order gradients [
17]. LayerCAM propagates class activation across several convolutional layers to create a more comprehensive highlight of relevant regions [
18]. By applying these techniques, clinicians and researchers can determine whether the focus of a model corresponds to patterns that are meaningful in clinical practice, for example, concentration on the irregular pigmented portion of a lesion. This level of interpretability is essential for fostering trust in AI-assisted diagnosis [
19].
To address this black box problem, significant research has focused on integrating XAI techniques into the diagnostic pipeline, often leveraging public benchmark datasets. For instance, [
20] developed an optimized CNN for seven-class skin cancer classification; their work not only focused on model optimization by testing various activation and optimization functions but also incorporated Grad-CAM and Grad-CAM++ to interpret the model’s decisions, highlighting that Grad-CAM++ provided more detailed and accurate visual explanations, thereby enhancing model transparency. A common strategy involves applying transfer learning with pretrained architectures. Ref. [
21] successfully employed a ResNet-18 model on the extensive ISIC 2019 dataset to classify eight different lesion types. To move beyond simple prediction, they utilized the Local Interpretable Model-agnostic Explanations (LIME) framework to generate visual explanations, demonstrating how such interpretations can increase trust and safety in a clinical setting by rationalizing the model’s outputs.
Building on these foundations, more sophisticated systems have been proposed. The Skin-CAD system [
22] represents an advanced approach that fuses features from multiple deep layers—specifically the pooling and fully connected layers—of an ensemble of four distinct CNNs. By applying Principal Component Analysis for dimensionality reduction and feature selection techniques, the system achieved very high classification accuracy on both binary and multiclass tasks. The integration of LIME provided explainability for this complex ensemble, validating the decisions of the multi-faceted model. Contrasting with end-to-end image-based classification, some studies explore the utility of preextracted clinical features. Ref. [
23] utilized features such as asymmetry and pigment networks to train an XGBoost model. They applied model-agnostic methods, such as SHapley Additive exPlanations (SHAP), to explain the predictions, confirming that the model’s most influential features align with established dermatological criteria, thereby bridging the gap between machine-learned patterns and clinical expertise. A systematic review by [
24] observed that XAI is often applied superficially as a sanity checkrather than being rigorously evaluated. The review highlighted a critical gap: the lack of studies systematically assessing the impact of these explanations on the diagnostic performance and confidence of dermatologists, which remains a crucial area for future research.
However, most existing studies rely on saliency maps solely for qualitative assessment, leaving a clear gap in the quantitative analysis of their properties [
5,
15,
24]. Earlier research seldom evaluated measurable aspects of explanation maps, such as symmetry, shape, or the spatial distribution of salient regions, or investigated how these aspects might relate to the accuracy of the model predictions. Recent surveys focusing on transformer-based dermatology AI also observed that explainability is often either underreported or neglected [
14]. We argue that these maps go beyond simple visual aids and embed latent signals of model behavior. Our central hypothesis is that the morphology of a saliency map, particularly characteristics associated with symmetry, varies noticeably between correct and incorrect predictions. For instance, a model that accurately classifies a lesion may produce a saliency map with a concentrated and relatively symmetric focus, suggesting attention that is more clinically meaningful, whereas an incorrect prediction may be accompanied by a map that is more chaotic or irregularly dispersed. Confirming this hypothesis could deliver valuable insight into the reliability of a model decision and its susceptibility to error.
3. Materials and Methods
This section outlines the experimental setup and methodological framework adopted in this study. We begin in
Section 3.1 by describing the dermoscopic image dataset and the preprocessing pipeline used to ensure a clinically realistic and reproducible evaluation. We then detail, in
Section 3.2, the deep learning models selected to cover a diverse range of architectural paradigms and present the metrics used for their performance assessment in
Section 3.3. Subsequently, we introduce the post hoc explainability techniques in
Section 3.4 and the suite of morphological metrics designed to quantify their spatial characteristics in
Section 3.5. Finally, we describe the complete experimental protocol, including the training configuration, in
Section 3.6.
Using these methodological components, the overall experimental pipeline is depicted in
Figure 1. The process begins with the ISIC dataset, which is preprocessed and subsequently fed into a set of deep learning models representing diverse architectural paradigms: ResNet-50 and EfficientNetV2-S as established CNN baselines, ConvNeXt-Tiny as a modern CNN inspired by transformer design principles, and Swin-Tiny plus MaxViT-Tiny as hierarchical vision transformers. This set was deliberately chosen to span complementary inductive biases (local convolutional priors vs. long-range attention) while keeping the computational scope tractable. After training, the pipeline branches into two complementary evaluation paths: one focused on predictive performance and the other on post hoc explainability. The performance branch evaluates the models using standard classification metrics and conducts a quantitative analysis of the results. The explainability branch generates saliency maps via gradient-based attribution methods (Grad-CAM
++ and LayerCAM), which are then subjected to morphological analysis using symmetry-aware descriptors. In the last stage, for each model, we statistically compare the morphometric descriptors between correctly and incorrectly classified samples to assess whether explanation structure is systematically associated with prediction correctness, thus providing insight into model reliability and diagnostic behavior.
3.1. Dataset Description
In this study, we employ a publicly available dataset curated by Ahuja [
25], derived from the International Skin Imaging Collaboration archive. The dataset comprises 2357 high-resolution dermoscopic RGB images, each annotated by board-certified dermatologists with one of nine diagnostic labels. As illustrated in
Figure 2, these categories encompass a broad spectrum of skin lesion types, ranging from benign to malignant. The benign categories are Pigmented benign keratosis (478 images), Nevus (373), Seborrheic keratosis (80), Dermatofibroma (111), and Vascular lesion (142). These lesions are non-cancerous and usually do not call for aggressive intervention. Malignant lesions include Melanoma (454), Basal cell carcinoma (392), and Squamous cell carcinoma (197). Their invasiveness and potential for metastasis differ, with melanoma being the most aggressive and lethal. Additionally, Actinic keratosis (130), a lesion induced by chronic ultraviolet exposure, is included as a premalignant condition due to its potential progression to squamous cell carcinoma [
26]. This clinically diverse and expertly annotated dataset is well-suited for benchmarking multi-class classification models aimed at automated skin lesion diagnosis.
To establish a rigorous and clinically relevant learning setup, we implemented a preprocessing pipeline tailored for dermoscopic image analysis. The images were first combined into a single corpus and then divided into three mutually exclusive subsets through stratified sampling: 60% for training, 20% for validation, and 20% for testing. During training, we applied a set of conservative data augmentation techniques that aim to improve generalization while retaining diagnostically relevant detail. These augmentations include random rotations within (applied with a probability of 50% and implemented with border padding), horizontal and vertical flips (each with a probability of 50%), adjustments of brightness and contrast within a % range (50% probability), subtle shifts in hue (), saturation (%), and value (%) (each with a probability of 30%), and Gaussian noise injection (20% probability). Each operation was selected to reproduce clinical variability observed in practice without altering lesion morphology. For the validation and test sets, no augmentations are applied to ensure consistent performance evaluation. All images were resized to 224 × 224 pixels to match the input resolution expected by the chosen model architecture and were normalized with ImageNet statistics. This preprocessing scheme balances the need for broader data-driven generalization with the preservation of clinically meaningful visual characteristics.
3.2. Deep Learning Models
This section describes the set of DL architectures evaluated in our study. We selected five representative backbones that span the spectrum from purely convolutional to fully attention-based processing—namely, ResNet-50 and EfficientNetV2-S (classical/modern CNNs), ConvNeXt-Tiny (a CNN informed by transformer design principles), and Swin-Tiny plus MaxViT-Tiny (hierarchical vision transformers). Each architecture embodies a distinct inductive bias, thus offering complementary strengths for skin lesion classification, such as capturing fine-grained texture, modeling spatial relationships, or incorporating long-range contextual information. All five backbones have a broadly comparable parameter count (in the order of 20–30 million) and memory footprint, which helps us disentangle architectural inductive biases from sheer model capacity differences.
3.2.1. ResNet
ResNet [
27] is a classical convolutional architecture that introduces identity skip connections to improve gradient propagation within deep networks. These residual links allow the model to learn adjustments to the identity mapping:
where
f denotes a residual transformation composed of a three-layer bottleneck (
1×1→
3×3→
1×1) with BatchNorm and ReLU activations, and
represents its trainable parameters. Here,
x and
y denote the input and output feature maps of the residual block.
In our experiments, we adopt the ResNet-50 variant, which offers a balance between depth and computational cost. Its hierarchical structure supports the extraction of both fine-grained textures and high-level morphological patterns, features that are essential for distinguishing benign from malignant skin lesions in dermoscopic images.
3.2.2. EfficientNet
EfficientNetV2 [
28] is a convolutional architecture optimized through compound scaling, a principled method for jointly tuning network depth, width, and resolution. Instead of using fixed scaling rules, it leverages neural architecture search to balance accuracy and computational efficiency across multiple deployment scenarios. The core building block is the
fused-MBConv, which combines depthwise convolution with pointwise projection and incorporates squeeze-and-excitation (SE) for channel attention.
The functional form of a fused-MBConv block with SE can be expressed as:
where
denotes the depthwise and pointwise convolutions with learnable weights
,
applies squeeze-and-excitation channel attention, and
is a non-linear activation function such as Swish or ReLU.
In our experiments, we adopt the EfficientNetV2-S variant, which offers a strong trade-off between model compactness and representational power. Its design enables precise modeling of both fine structures—such as pigment dots or globules—and broader spatial arrangements, which are critical for differentiating benign nevi from malignant lesions under dermoscopic imaging.
3.2.3. ConvNeXt
ConvNeXt [
29] is a convolutional architecture that integrates key design principles from ViTs while preserving the inductive biases of traditional CNNs. Its core building block adopts an inverted bottleneck design and incorporates modern components such as Layer Normalization, GELU activations, and large convolutional kernels. Specifically, each block applies a depthwise convolution followed by a pointwise projection and non-linear transformation:
where
and
denote depthwise and pointwise convolutions with weights
and
, respectively, and
is a GELU activation function.
We adopt the ConvNeXt-Tiny variant, which offers a compact architecture with enhanced representational capacity. The use of large convolutional kernels () increases the effective receptive field, allowing the model to capture long-range spatial patterns without sacrificing local detail. This is particularly beneficial for dermoscopic analysis, where lesion characteristics such as irregular borders, pigment asymmetry, and large-scale structure must be jointly modeled for reliable classification.
3.2.4. Swin Transformer
Swin Transformer [
30] is a hierarchical ViTs designed to efficiently model both local and global patterns in visual data. Its architecture divides the image into non-overlapping windows, within which multi-head self-attention (MHSA) is applied. To promote contextual exchange across regions, the windows are shifted between consecutive layers—a design known as shifted window attention.
Let
x denote the input feature map to a Swin block. The block applies window-based self-attention followed by a feed-forward network (FFN) with residual connections:
where
are the learned projection weights of the attention heads and FFN layers.
In our experiments, we use the Swin-Tiny (patch4–window7) variant, which provides a lightweight yet expressive representation hierarchy. Its ability to preserve high-resolution local details while modeling a coarse-scale context makes it particularly suitable for dermoscopic images, where both fine-grained texture and global asymmetry can be clinically informative.
3.2.5. MaxViT
MaxViT [
31] merges the advantages of convolutional and transformer-based designs within a unified architecture. It interleaves MBConv blocks with two forms of self-attention, namely grid attention and window attention. This hybrid arrangement enables the model to process visual information along both spatial axes and within localized regions. Each MaxViT block takes an input feature map
x and applies the following sequence:
where
,
, and
denote the trainable weights for the convolutional and attention components.
In this study, the MaxViT-Tiny configuration, optimized for 224 × 224 input sizes, was employed. The layered integration of local and global attention pathways supplies strong spatial reasoning capacity, allowing the model to identify elongated structures, peripheral asymmetry, and complex pigmentation patterns, which are key indicators in melanoma diagnosis.
3.3. Evaluation Metrics
To provide a comprehensive and robust assessment of model performance, we evaluate the predictions on the held-out test set using a standard set of metrics. These metrics are derived from the multi-class confusion matrix, whose structure is shown in
Table 1. In this
matrix, an element
quantifies the number of instances of the actual class
that were predicted as class
[
10].
For a multi-class problem, metrics are calculated using a one-vs-rest (OvR) approach. For any given class , the fundamental components are defined as follows.
True Positives (): The number of samples of class correctly predicted as class . This corresponds to the diagonal element .
False Positives ():The number of samples from other classes incorrectly predicted as class . This is the sum of all values in column i, excluding the diagonal element ().
False Negatives (): The number of samples of class incorrectly predicted as any other class. This is the sum of all values in row i, excluding the diagonal element ().
True Negatives (): The number of samples not belonging to class that were correctly not predicted as class . It is the sum of all elements not in row i or column i.
Based on these components, we compute the following metrics [
10]. Accuracymeasures the overall fraction of correct predictions.
Sensitivity, also known as Recall or the True Positive Rate (TPR), measures the proportion of actual positives that are correctly identified.
Specificity, or the True Negative Rate (TNR), measures the proportion of actual negatives that are correctly identified.
Precision measures the proportion of positive predictions that were actually correct.
F1-Score is the harmonic mean of Precision and Sensitivity, providing a single score that balances both metrics.
For metrics other than Accuracy, we compute the macro-averaged version by calculating the metric for each class independently and then taking the unweighted mean [
32]. This approach treats all classes equally, regardless of imbalance. For example, Macro-F1 is calculated as:
Cohen’s Kappa (
) measures the agreement between predictions and ground-truth labels while correcting for agreement that could occur by chance.
where
is the observed accuracy and
is the expected agreement by chance.
Macro-AUROC is the Area Under the Receiver Operating Characteristic (AUROC) curve. The AUROC curve is a graphical plot that illustrates the diagnostic ability of a classifier. Specifically, it plots the Sensitivity (Recall) on the y-axis against the False Positive Rate (FPR) on the x-axis, where the FPR is defined as
. For this multi-class problem, the final Macro-AUROC is the unweighted mean of the Area Under the Curve (AUC) computed for each class using the one-vs-rest strategy:
3.4. Explainability Methods
To achieve clinically significant predictions in skin lesion classification, it is essential to understand both what a model predicts and the reasons for those predictions. This section introduces the post hoc explainability techniques adopted in the present study, Grad-CAM
++ [
17] and LayerCAM [
18]. These methods create saliency maps that highlight the image regions most influential to the decision of the model. Both techniques are architecture agnostic and can be applied directly to every deep learning model evaluated in this study without modifying the architecture. They are computationally efficient and generate high-resolution visual explanations, enabling clinicians to check whether the network focuses on medically relevant features such as lesion borders or pigmentation patterns or whether it is affected by irrelevant artifacts. Throughout this section, the rectified-linear unit is denoted by
.
3.4.1. Grad-CAM++
Grad-CAM
++ is an extension of the original Grad-CAM [
33]. It incorporates higher-order gradients that refine the localization of distinct salient regions. Let
denote the logit score for class
c, and let
be the
k-th feature map of a selected convolutional layer, with activation
at spatial position
. The relevance weight
for feature map
k is then computed as
These weights modulate a linear combination of the feature maps, followed by a ReLU activation:
The ReLU operation suppresses negative relevance and retains only regions that make a positive contribution to the target class, a characteristic that enhances interpretability when lesions present fragmented or spatially distributed discriminative patterns.
3.4.2. LayerCAM
LayerCAM departs from the single-layer assumption by constructing saliency maps at multiple levels of abstraction. For each selected layer
l, let
be the activation tensor and
the activation vector at location
. The elementwise (Hadamard) product between this vector and its gradient yields a locationwise relevance vector,
Reducing the channel dimension gives a 2D relevance map for the layer,
which is subsequently upsampled and aggregated across a predefined set of layers
:
This hierarchical aggregation allows LayerCAM to combine fine-grained visual cues, such as local texture or pigment irregularities, from shallow layers with high-level structural features captured in deeper layers. The formulation above is fully differentiable and applies equally to convolutional and attention-based modules, making LayerCAM naturally compatible with all architectures considered in this study.
3.5. Morphology Metrics
To understand the spatial structure of the explanatory maps, we compute the five morphological descriptors detailed in this section. The purpose of choosing these specific descriptors is to capture different facets of saliency distribution. We focus on global concentration (entropy, Gini) and rotation-invariant shape regularity (radial symmetry, compactness) as well as dispersion. This ensures our analysis remains relevant even after geometric data augmentations are applied during training.
These metrics collectively offer a concise and understandable representation of saliency map morphology. This representation is then used to examine its connection with how the model behaves. All calculations are performed on a normalized saliency map , where H and W represent the image height and width, respectively. Each pixel (where ) is defined by its saliency value and Cartesian coordinates , measured from the upper-left image corner.
Shannon entropy [
34] quantifies the overall dispersion of saliency values:
Low entropy reflects a focused map where attention is confined to a small region, whereas high entropy suggests a diffuse or ambiguous explanation.
Gini coefficient [
35] measures how unevenly the saliency mass is distributed:
Values close to one indicate that only a few pixels carry most of the explanatory weight, while values near zero correspond to nearly uniform maps.
Dispersion [
36] is defined as the coefficient of variation:
This index captures local irregularity: high dispersion is associated with granular or noisy maps, whereas low values indicate spatially coherent attention.
Radial Symmetry describes the extent to which a distribution remains unchanged when rotated around its center; a perfectly radially symmetric pattern appears identical from every viewpoint. In a saliency map, this property measures how evenly the saliency mass is positioned at equal radial distances from the centroid, weighted by saliency. Formally, let
be the saliency-weighted centroid of the map,
and define
and
. The radial symmetry index [
37] is given by
A high score indicates that saliency is distributed uniformly around the center of mass, matching the symmetry often expected in well-focused lesion representations.
Compactness evaluates the regularity of a shape. The saliency map
S is converted into a binary support
by applying the Otsu thresholding method [
38]. Formally, let
A and
P denote the area and perimeter of
, respectively. Compactness [
39] is defined as:
This expression equals zero for a perfect disk and increases with elongation, fragmentation, or the presence of disconnected salient regions, that is, patterns that often reflect attention to spurious artifacts.
3.6. Experimental Protocol
To ensure full reproducibility, this paragraph details the key hyperparameters used for fine-tuning all models. All models were trained for 300 epochs with a mini-batch size of 32 using the Adam optimizer (with default parameters) and an initial learning rate of . Weight decay and dropout regularization were omitted, and all weights were fine-tuned from ImageNet pretraining without freezing any layers. To address the class imbalance inherent in the dataset, we employed a weighted cross-entropy loss in which class weights were set to the inverse frequency of each class in the training set. The learning rate was modulated with a cosine annealing warm restart scheduler using an initial cycle length of epochs, a cycle extension factor of , and a minimum learning rate of . During training, we applied conservative data augmentation to improve generalization, including random rotations (), horizontal and vertical flips, mild brightness and contrast shifts, hue and saturation perturbations, and Gaussian noise. Validation and test images were left unaltered aside from resizing to pixels and normalization using ImageNet statistics. Stratified sampling was used to partition the dataset into training (60%), validation (20%), and test (20%) subsets.
Following training, saliency maps were generated for all test samples using the two considered post hoc gradient-based attribution methods: Grad-CAM++ and LayerCAM. For each classification model, the target layer for attribution was automatically selected as the last convolutional block in CNNs and the final normalization layer in transformer-based architectures. Transformer outputs were reshaped as needed to conform to the 2D spatial format expected by the saliency algorithms. Saliency maps were computed with respect to the predicted class by backpropagating gradients from the output logit to the selected layer. The resulting maps were normalized to the range, resized to the input resolution, and superimposed on the original images for visual inspection. This procedure enabled both qualitative evaluation and quantitative analysis of the spatial characteristics and clinical plausibility of the model explanations.
All experiments were executed on a workstation equipped with an Intel Core i5-11400 CPU, an NVIDIA GeForce RTX 3060 GPU, and 64 GB of DDR4 RAM, running Ubuntu 20.04 (64-bit). Model training and inference were conducted using PyTorch 1.6.0 with CUDA 10.1. This hardware–software configuration ensured consistent and efficient performance across all evaluated architectures and explainability methods.
5. Discussion
Our study presents two principal findings that together offer a new perspective on model trustworthiness in automated dermoscopy. First, we empirically demonstrate that modern deep learning architectures based on attention mechanisms systematically outperform classical convolutional designs in this multiclass skin lesion classification task. Second, and more importantly, we find that the saliency maps generated by the best-performing models show statistically significant structural differences between correctly and incorrectly classified samples. This indicates that higher predictive accuracy is accompanied by more consistent explanation patterns. It also suggests that the decision-making processes of these models, when successful, tend to follow more stable and interpretable spatial reasoning as captured by XAI techniques.
The performance hierarchy detailed in
Table 2 confirms the advantages of transformer-based backbones for this complex visual task. The superior accuracy, precision, and recall of Swin-Tiny and MaxViT-Tiny can be attributed to their architectural innovations. Unlike the fixed, local receptive fields of traditional CNNs such as ResNet-50, attention mechanisms allow these models to create dynamic, input-specific connections across the entire image [
13,
30]. This is particularly advantageous in dermoscopy, where diagnostic clues may involve long-range spatial dependencies, such as the overall asymmetry of a lesion or subtle textural variations distributed across its surface. The strong performance of ConvNeXt-Tiny, a hybrid design that modernizes the CNN with transformer principles, further reinforces this conclusion, highlighting a clear trend towards architectures that can effectively model global context.
The morphometric analysis of saliency maps provides valuable insight into the observed performance differences. As shown in
Table 3 and
Table 4, a distinct pattern emerges regarding the strength and nature of the statistical differences in the explanations. The transformer-based models, Swin-Tiny and MaxViT-Tiny, exhibit highly significant differences (
) primarily in metrics related to attentional focus. Lower entropy and higher Gini coefficients indicate a more focused model attention, while increased compactness and radial symmetry suggest attention concentrated on the core lesion area rather than on background structures or artifacts. The high significance of these focus-related metrics for transformer-based models suggests that their attentional mechanisms are particularly stable and well-structured when predictions are correct, becoming notably more diffuse when errors occur. In contrast, while convolutional models also show some significant differences, these are scattered across various metrics and are only observed at a standard significance level (
). These patterns yield a measurable morphometric signature that reflects the model’s confidence and confusion.
Synthesizing these two findings leads to our central argument: the improved accuracy of modern architectures is not an incidental outcome but may be a direct consequence of their ability to form more clinically relevant and interpretable internal representations. The quantitative difference in explanation morphology suggests the best models are not merely pattern matching based on opaque features. Instead, they appear to be learning a visual grammar that parallels human clinical assessment, where concepts like symmetry, compactness, and focused attention (as captured by our metrics) are paramount to a correct diagnosis, as codified in heuristics like the ABCDE rule [
3]. Our analysis, therefore, provides quantitative evidence that the black box is learning to reason in a way that is not only more effective but also more transparent and aligned with established clinical expertise.
Figure 4 offers a compelling visual case study that illustrates our core findings. The traditional CNN architectures, ResNet-50 and EfficientNetV2-S, which misclassify the nevus, produce saliency maps with a centrally focused activation pattern. This may reflect a reliance on a coarse analysis of the lesion’s overall shape rather than its internal details. In contrast, the models that correctly classify the lesion, ConvNeXt-Tiny, Swin-Tiny, and MaxViT-Tiny, generate more complex and spatially specific heatmaps. Their activations highlight multiple distinct subregions, suggesting a more nuanced understanding of local texture and structure. This example supports our broader conclusion that the improved performance of modern architectures is associated with the formation of internal representations that are not only more clinically meaningful, but also exhibit statistically distinct saliency characteristics between correct and incorrect predictions.
From a practical standpoint, these findings suggest a promising new direction for clinical safety and model monitoring. Instead of just trusting a model’s output probability, our morphometric indicators offer a way to audit the reasoning process behind each decision. This can enhance clinical trust in two key ways. First, it moves beyond a black box system by providing a transparent second opinion on the quality of the AI’s explanation; a clinician can see not only what the model focused on, but also a quantitative score of how coherently it focused. Second, it helps mitigate the risk of silent failures by flagging predictions based on diffuse or unfocused saliency maps (e.g., high entropy or low compactness), even if the model’s confidence is high. This serves as a crucial safety net against automation bias.
In a clinical environment, these indicators could directly aid diagnosis and optimize workflows. For example, a system could automatically triage cases: predictions with high-quality explanation scores might be marked for faster review, while those with poor scores are prioritized for detailed expert assessment, regardless of the predicted class. This could also function as a second-look mechanism; if an AI-based diagnosis conflicts with a clinician’s initial assessment but is backed by a high-quality explanation score, it could prompt a valuable re-evaluation. Ultimately, these morphometric properties augment expert judgment by providing a real-time, quantitative layer of interpretability that audits the model’s decision-making process on a case-by-case basis.
Limitations and Future Directions
While this study establishes a foundational link between explanation morphology and reliability, it is essential to situate our findings within the specific scope of our methodology and to acknowledge the limitations inherent to our design. Notably, the novelty of our quantitative approach makes direct comparison with previous work unfeasible at this stage. Instead, our findings serve as a critical baseline for this emerging research direction, providing a structured framework for future studies to build upon.
Our experimental design was deliberately focused to ensure reproducibility and statistical control. For this reason, we centered our analysis on a single, high-quality public dataset with expert annotations and utilized a fixed data split. This controlled environment was crucial for isolating the novel relationship between saliency morphology and prediction correctness. This foundational work now paves the way for broader investigations into generalizability. A valuable next step, for instance, would be to validate our morphometric indicators on prospective, multi-center clinical datasets and employ k-fold cross-validation to provide an even more robust estimate of performance.
Similarly, our selection of models and explainability techniques was tailored to the goals of this study. We chose five representative architectures with comparable capacity to specifically isolate the impact of their inductive biases on the explanations. Our focus on gradient-based methods was also intentional, as they produce the dense, high-resolution maps required for our spatial analysis. Having established this baseline, future work could extend this comparative analysis to a wider range of emerging architectures or adapt the morphometric framework for other families of XAI techniques, such as perturbation-based methods like LIME.
Furthermore, conducting a formal ablation study is a future work that will systematically analyze these architectures. This would involve isolating specific components, such as attention mechanisms or skip connections, to determine their precise influence on the morphology and reliability of resulting explanations.
Finally, the morphometric framework itself, built upon five fundamental descriptors, provides a solid starting point that can be readily expanded. Future studies could incorporate a richer set of features, such as advanced texture descriptors or more sophisticated shape metrics, to potentially uncover even more subtle reliability indicators. Such work will be crucial to continue building trust in AI-driven diagnostic tools and to deepen our understanding of their decision-making processes.