Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification

Fernandez, Rafael; Guzmán-Ponce, Angélica; Fernandez-Beltran, Ruben; García-Mateos, Ginés

doi:10.3390/sym17081264

Open AccessArticle

Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification

by

Rafael Fernandez

¹

,

Angélica Guzmán-Ponce

²

,

Ruben Fernandez-Beltran

^3,*

and

Ginés García-Mateos

³

¹

Institute of New Imaging Technologies, University Jaume I, 12071 Castellón de la Plana, Spain

²

Escuela Politécnica Superior, Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain

³

Department of Computer Science and Systems, University of Murcia, 30100 Murcia, Spain

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1264; https://doi.org/10.3390/sym17081264

Submission received: 30 June 2025 / Revised: 25 July 2025 / Accepted: 3 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Symmetry/Asymmetry and Its Applications in Deep Learning and Artificial Intelligence Methods)

Download

Browse Figures

Versions Notes

Abstract

Deep learning has achieved remarkable performance in skin lesion classification, but its lack of interpretability often remains a critical barrier to clinical adoption. In this study, we investigate the spatial properties of saliency-based model explanations, focusing on symmetry and other morphometric features. We benchmark five deep learning architectures (ResNet-50, EfficientNetV2-S, ConvNeXt-Tiny, Swin-Tiny, and MaxViT-Tiny) on a nine-class skin lesion dataset from the International Skin Imaging Collaboration (ISIC) archive, generating saliency maps with Grad-CAM++ and LayerCAM. The best-performing model, Swin-Tiny, achieved an accuracy of 78.2% and a macro-F1 score of 71.2%. Our morphometric analysis reveals statistically significant differences in the explanation maps between correct and incorrect predictions. Notably, the transformer-based models exhibit highly significant differences (

p < 0.001

) in metrics related to attentional focus (Entropy and Gini), indicating that their correct predictions are associated with more concentrated saliency maps. In contrast, convolutional models show less consistent differences, and only at a standard significance level (

p < 0.05

). These findings suggest that the quantitative morphometric properties of saliency maps could serve as valuable indicators of predictive reliability in medical AI.

Keywords:

explainable AI; saliency maps; morphometric analysis; skin lesion classification; deep learning

1. Introduction

Skin cancer ranks among the most prevalent malignancies, and its incidence continues to rise. Melanoma in particular is responsible for the majority of deaths attributed to skin cancer. For this reason, the development of automated methods that improve diagnostic accuracy and reduce subjective interpretation remains a priority [1]. Skin lesion symmetry is a pivotal feature in dermatological assessment. Clinicians often employ the ABCDE rule (Asymmetry, Border irregularity, Color variegation, Diameter, Evolution) as a heuristic for early melanoma detection, where A stands for the asymmetry of the lesion [2,3]. In practice, benign nevi tend to exhibit symmetric shape and color, whereas malignant lesions like melanoma are frequently asymmetric between halves. This emphasis on symmetry in clinical diagnostics underlines the importance of developing computational methods that can recognize and quantify lesion symmetry from images.

Although deep learning models, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved dermatologist-level accuracy in skin lesion classification [4], their “black box” nature hinders clinical trust. To address this, eXplainable Artificial Intelligence (XAI) techniques like saliency maps are employed to visualize the image regions influencing the decision of the model. However, the analysis of these maps has been predominantly qualitative [5]. This work hypothesizes that the morphological characteristics of saliency maps, such as their symmetry and spatial distribution, contain latent signals about model behavior. We propose that these features differ systematically between correct and incorrect predictions, thereby offering a novel quantitative approach to evaluating the reliability of assisted diagnoses.

To investigate this, we present a systematic analysis of saliency map morphometrics in the context of skin lesion classification. Our study first benchmarks five state-of-the-art deep learning architectures, encompassing both convolutional and transformer-based models, on a challenging nine-class skin lesion dataset. For each prediction, we generate visual explanations using two advanced saliency techniques, Grad-CAM++ and LayerCAM, to identify the regions deemed important by the model. Subsequently, we compute a suite of five quantitative morphometric metrics, such as entropy, radial symmetry, and compactness, to characterize the spatial structure of these explanations. This methodology concludes with a statistical comparison of these metrics between correctly classified and misclassified cases to test our hypothesis that the morphology of a saliency map is associated with prediction correctness. Our findings confirm this relationship, revealing statistically significant differences which suggest that the morphometric properties of explanation maps could serve as valuable indicators of model reliability in skin lesion classification.

The remainder of this paper is organized as follows. Section 3 describes the dataset, preprocessing pipeline, model architectures, and saliency techniques employed in this study. Section 4 presents the classification performance of all models and analyzes the statistical behavior of the proposed symmetry-aware morphometric metrics across correct and incorrect predictions. Section 5 discusses the implications of the observed patterns for model trustworthiness and future explainability strategies. Finally, Section 6 concludes the paper by summarizing the key findings and outlining directions for future work.

2. Background

In recent years, deep learning (DL) has revolutionized automated skin lesion classification. CNNs in particular have achieved dermatologist-level accuracy in identifying malignant lesions from dermoscopic images [4]. The versatility of these DL paradigms has been proven in other complex medical domains, such as in multi-label classification systems for radiological images, highlighting their broad applicability [6]. Subsequent advances leveraging larger datasets and model ensembles have further improved performance on challenging multiclass skin lesion tasks [7]. Modern scaling strategies, exemplified by EfficientNet, enhance feature-extraction efficiency while preserving robust generalization across diverse lesion types [8]. Recent CNN variants such as ConvNeXt, particularly when augmented with attention mechanisms, further boost accuracy by jointly capturing local and global context and selectively emphasizing diagnostically salient regions [9,10]. This approach is not unique to medicine, as sophisticated segmentation architectures are continuously proposed in other demanding fields like infrared target detection and shadow detection, which leverage complex contextual information and innovative network topologies to enhance precision, offering paradigms with clear potential for transferability to medical diagnostics [11,12]. More recently, ViT architectures [13] have been introduced to this domain, offering improved modeling of long-range dependencies and global context. Early studies indicate that transformer-based models can match or exceed CNN performance in skin cancer classification [14]. These developments demonstrate the state-of-the-art in lesion classification, but they also bring into focus a critical issue: the interpretability of such complex models.

Although DL models achieve high accuracy, they are frequently described as black boxes, particularly in applications where decisions have significant consequences such as medicine [15]. Techniques from XAI aim to mitigate this opacity by rendering the reasoning of the models more transparent. In medical image analysis, a prevalent XAI strategy involves the generation of saliency maps or class activation maps that highlight the regions of an image that most strongly influence a model prediction [16]. When studying skin lesions, these maps can reveal whether a CNN or ViT directs its attention toward the suspicious area of a mole or instead toward irrelevant artifacts. Recent methods, including Grad-CAM++ and LayerCAM, have proved effective at producing such visual explanations. Grad-CAM++ refines the localization of important features by incorporating higher order gradients [17]. LayerCAM propagates class activation across several convolutional layers to create a more comprehensive highlight of relevant regions [18]. By applying these techniques, clinicians and researchers can determine whether the focus of a model corresponds to patterns that are meaningful in clinical practice, for example, concentration on the irregular pigmented portion of a lesion. This level of interpretability is essential for fostering trust in AI-assisted diagnosis [19].

To address this black box problem, significant research has focused on integrating XAI techniques into the diagnostic pipeline, often leveraging public benchmark datasets. For instance, [20] developed an optimized CNN for seven-class skin cancer classification; their work not only focused on model optimization by testing various activation and optimization functions but also incorporated Grad-CAM and Grad-CAM++ to interpret the model’s decisions, highlighting that Grad-CAM++ provided more detailed and accurate visual explanations, thereby enhancing model transparency. A common strategy involves applying transfer learning with pretrained architectures. Ref. [21] successfully employed a ResNet-18 model on the extensive ISIC 2019 dataset to classify eight different lesion types. To move beyond simple prediction, they utilized the Local Interpretable Model-agnostic Explanations (LIME) framework to generate visual explanations, demonstrating how such interpretations can increase trust and safety in a clinical setting by rationalizing the model’s outputs.

Building on these foundations, more sophisticated systems have been proposed. The Skin-CAD system [22] represents an advanced approach that fuses features from multiple deep layers—specifically the pooling and fully connected layers—of an ensemble of four distinct CNNs. By applying Principal Component Analysis for dimensionality reduction and feature selection techniques, the system achieved very high classification accuracy on both binary and multiclass tasks. The integration of LIME provided explainability for this complex ensemble, validating the decisions of the multi-faceted model. Contrasting with end-to-end image-based classification, some studies explore the utility of preextracted clinical features. Ref. [23] utilized features such as asymmetry and pigment networks to train an XGBoost model. They applied model-agnostic methods, such as SHapley Additive exPlanations (SHAP), to explain the predictions, confirming that the model’s most influential features align with established dermatological criteria, thereby bridging the gap between machine-learned patterns and clinical expertise. A systematic review by [24] observed that XAI is often applied superficially as a sanity checkrather than being rigorously evaluated. The review highlighted a critical gap: the lack of studies systematically assessing the impact of these explanations on the diagnostic performance and confidence of dermatologists, which remains a crucial area for future research.

However, most existing studies rely on saliency maps solely for qualitative assessment, leaving a clear gap in the quantitative analysis of their properties [5,15,24]. Earlier research seldom evaluated measurable aspects of explanation maps, such as symmetry, shape, or the spatial distribution of salient regions, or investigated how these aspects might relate to the accuracy of the model predictions. Recent surveys focusing on transformer-based dermatology AI also observed that explainability is often either underreported or neglected [14]. We argue that these maps go beyond simple visual aids and embed latent signals of model behavior. Our central hypothesis is that the morphology of a saliency map, particularly characteristics associated with symmetry, varies noticeably between correct and incorrect predictions. For instance, a model that accurately classifies a lesion may produce a saliency map with a concentrated and relatively symmetric focus, suggesting attention that is more clinically meaningful, whereas an incorrect prediction may be accompanied by a map that is more chaotic or irregularly dispersed. Confirming this hypothesis could deliver valuable insight into the reliability of a model decision and its susceptibility to error.

3. Materials and Methods

This section outlines the experimental setup and methodological framework adopted in this study. We begin in Section 3.1 by describing the dermoscopic image dataset and the preprocessing pipeline used to ensure a clinically realistic and reproducible evaluation. We then detail, in Section 3.2, the deep learning models selected to cover a diverse range of architectural paradigms and present the metrics used for their performance assessment in Section 3.3. Subsequently, we introduce the post hoc explainability techniques in Section 3.4 and the suite of morphological metrics designed to quantify their spatial characteristics in Section 3.5. Finally, we describe the complete experimental protocol, including the training configuration, in Section 3.6.

Using these methodological components, the overall experimental pipeline is depicted in Figure 1. The process begins with the ISIC dataset, which is preprocessed and subsequently fed into a set of deep learning models representing diverse architectural paradigms: ResNet-50 and EfficientNetV2-S as established CNN baselines, ConvNeXt-Tiny as a modern CNN inspired by transformer design principles, and Swin-Tiny plus MaxViT-Tiny as hierarchical vision transformers. This set was deliberately chosen to span complementary inductive biases (local convolutional priors vs. long-range attention) while keeping the computational scope tractable. After training, the pipeline branches into two complementary evaluation paths: one focused on predictive performance and the other on post hoc explainability. The performance branch evaluates the models using standard classification metrics and conducts a quantitative analysis of the results. The explainability branch generates saliency maps via gradient-based attribution methods (Grad-CAM⁺⁺ and LayerCAM), which are then subjected to morphological analysis using symmetry-aware descriptors. In the last stage, for each model, we statistically compare the morphometric descriptors between correctly and incorrectly classified samples to assess whether explanation structure is systematically associated with prediction correctness, thus providing insight into model reliability and diagnostic behavior.

3.1. Dataset Description

In this study, we employ a publicly available dataset curated by Ahuja [25], derived from the International Skin Imaging Collaboration archive. The dataset comprises 2357 high-resolution dermoscopic RGB images, each annotated by board-certified dermatologists with one of nine diagnostic labels. As illustrated in Figure 2, these categories encompass a broad spectrum of skin lesion types, ranging from benign to malignant. The benign categories are Pigmented benign keratosis (478 images), Nevus (373), Seborrheic keratosis (80), Dermatofibroma (111), and Vascular lesion (142). These lesions are non-cancerous and usually do not call for aggressive intervention. Malignant lesions include Melanoma (454), Basal cell carcinoma (392), and Squamous cell carcinoma (197). Their invasiveness and potential for metastasis differ, with melanoma being the most aggressive and lethal. Additionally, Actinic keratosis (130), a lesion induced by chronic ultraviolet exposure, is included as a premalignant condition due to its potential progression to squamous cell carcinoma [26]. This clinically diverse and expertly annotated dataset is well-suited for benchmarking multi-class classification models aimed at automated skin lesion diagnosis.

To establish a rigorous and clinically relevant learning setup, we implemented a preprocessing pipeline tailored for dermoscopic image analysis. The images were first combined into a single corpus and then divided into three mutually exclusive subsets through stratified sampling: 60% for training, 20% for validation, and 20% for testing. During training, we applied a set of conservative data augmentation techniques that aim to improve generalization while retaining diagnostically relevant detail. These augmentations include random rotations within

\pm 15^{\circ}

(applied with a probability of 50% and implemented with border padding), horizontal and vertical flips (each with a probability of 50%), adjustments of brightness and contrast within a

\pm 10

% range (50% probability), subtle shifts in hue (

\pm 5^{\circ}

), saturation (

\pm 10

%), and value (

\pm 5

%) (each with a probability of 30%), and Gaussian noise injection (20% probability). Each operation was selected to reproduce clinical variability observed in practice without altering lesion morphology. For the validation and test sets, no augmentations are applied to ensure consistent performance evaluation. All images were resized to 224 × 224 pixels to match the input resolution expected by the chosen model architecture and were normalized with ImageNet statistics. This preprocessing scheme balances the need for broader data-driven generalization with the preservation of clinically meaningful visual characteristics.

3.2. Deep Learning Models

This section describes the set of DL architectures evaluated in our study. We selected five representative backbones that span the spectrum from purely convolutional to fully attention-based processing—namely, ResNet-50 and EfficientNetV2-S (classical/modern CNNs), ConvNeXt-Tiny (a CNN informed by transformer design principles), and Swin-Tiny plus MaxViT-Tiny (hierarchical vision transformers). Each architecture embodies a distinct inductive bias, thus offering complementary strengths for skin lesion classification, such as capturing fine-grained texture, modeling spatial relationships, or incorporating long-range contextual information. All five backbones have a broadly comparable parameter count (in the order of 20–30 million) and memory footprint, which helps us disentangle architectural inductive biases from sheer model capacity differences.

3.2.1. ResNet

ResNet [27] is a classical convolutional architecture that introduces identity skip connections to improve gradient propagation within deep networks. These residual links allow the model to learn adjustments to the identity mapping:

y = x + f (x, W)

(1)

where f denotes a residual transformation composed of a three-layer bottleneck (1×1→3×3→1×1) with BatchNorm and ReLU activations, and

W

represents its trainable parameters. Here, x and y denote the input and output feature maps of the residual block.

In our experiments, we adopt the ResNet-50 variant, which offers a balance between depth and computational cost. Its hierarchical structure supports the extraction of both fine-grained textures and high-level morphological patterns, features that are essential for distinguishing benign from malignant skin lesions in dermoscopic images.

3.2.2. EfficientNet

EfficientNetV2 [28] is a convolutional architecture optimized through compound scaling, a principled method for jointly tuning network depth, width, and resolution. Instead of using fixed scaling rules, it leverages neural architecture search to balance accuracy and computational efficiency across multiple deployment scenarios. The core building block is the fused-MBConv, which combines depthwise convolution with pointwise projection and incorporates squeeze-and-excitation (SE) for channel attention.

The functional form of a fused-MBConv block with SE can be expressed as:

y = x + σ (SE (g (x, W)))

(2)

where

g (\cdot)

denotes the depthwise and pointwise convolutions with learnable weights

W

,

SE (\cdot)

applies squeeze-and-excitation channel attention, and

σ

is a non-linear activation function such as Swish or ReLU.

In our experiments, we adopt the EfficientNetV2-S variant, which offers a strong trade-off between model compactness and representational power. Its design enables precise modeling of both fine structures—such as pigment dots or globules—and broader spatial arrangements, which are critical for differentiating benign nevi from malignant lesions under dermoscopic imaging.

3.2.3. ConvNeXt

ConvNeXt [29] is a convolutional architecture that integrates key design principles from ViTs while preserving the inductive biases of traditional CNNs. Its core building block adopts an inverted bottleneck design and incorporates modern components such as Layer Normalization, GELU activations, and large convolutional kernels. Specifically, each block applies a depthwise convolution followed by a pointwise projection and non-linear transformation:

y = σ (PW (DW (x, W_{DW})), W_{PW})

(3)

where

DW (\cdot)

and

PW (\cdot)

denote depthwise and pointwise convolutions with weights

W_{DW}

and

W_{PW}

, respectively, and

σ

is a GELU activation function.

We adopt the ConvNeXt-Tiny variant, which offers a compact architecture with enhanced representational capacity. The use of large convolutional kernels (

k = 7

) increases the effective receptive field, allowing the model to capture long-range spatial patterns without sacrificing local detail. This is particularly beneficial for dermoscopic analysis, where lesion characteristics such as irregular borders, pigment asymmetry, and large-scale structure must be jointly modeled for reliable classification.

3.2.4. Swin Transformer

Swin Transformer [30] is a hierarchical ViTs designed to efficiently model both local and global patterns in visual data. Its architecture divides the image into non-overlapping windows, within which multi-head self-attention (MHSA) is applied. To promote contextual exchange across regions, the windows are shifted between consecutive layers—a design known as shifted window attention.

Let x denote the input feature map to a Swin block. The block applies window-based self-attention followed by a feed-forward network (FFN) with residual connections:

y = x + FFN (MHSA (x, W))

(4)

where

W

are the learned projection weights of the attention heads and FFN layers.

In our experiments, we use the Swin-Tiny (patch4–window7) variant, which provides a lightweight yet expressive representation hierarchy. Its ability to preserve high-resolution local details while modeling a coarse-scale context makes it particularly suitable for dermoscopic images, where both fine-grained texture and global asymmetry can be clinically informative.

3.2.5. MaxViT

MaxViT [31] merges the advantages of convolutional and transformer-based designs within a unified architecture. It interleaves MBConv blocks with two forms of self-attention, namely grid attention and window attention. This hybrid arrangement enables the model to process visual information along both spatial axes and within localized regions. Each MaxViT block takes an input feature map x and applies the following sequence:

y = x + WindowAttn (GridAttn (MBConv (x, W_{1}), W_{2}), W_{3})

(5)

where

W_{1}

,

W_{2}

, and

W_{3}

denote the trainable weights for the convolutional and attention components.

In this study, the MaxViT-Tiny configuration, optimized for 224 × 224 input sizes, was employed. The layered integration of local and global attention pathways supplies strong spatial reasoning capacity, allowing the model to identify elongated structures, peripheral asymmetry, and complex pigmentation patterns, which are key indicators in melanoma diagnosis.

3.3. Evaluation Metrics

To provide a comprehensive and robust assessment of model performance, we evaluate the predictions on the held-out test set using a standard set of metrics. These metrics are derived from the multi-class confusion matrix, whose structure is shown in Table 1. In this

n \times n

matrix, an element

N_{i j}

quantifies the number of instances of the actual class

c_{i}

that were predicted as class

c_{j}

[10].

For a multi-class problem, metrics are calculated using a one-vs-rest (OvR) approach. For any given class

c_{i}

, the fundamental components are defined as follows.

True Positives (

T P_{i}

): The number of samples of class

c_{i}

correctly predicted as class

c_{i}

. This corresponds to the diagonal element

N_{i i}

.

False Positives (

F P_{i}

):The number of samples from other classes incorrectly predicted as class

c_{i}

. This is the sum of all values in column i, excluding the diagonal element (

F P_{i} = \sum_{j = 1, j \neq i}^{n} N_{j i}

).

False Negatives (

F N_{i}

): The number of samples of class

c_{i}

incorrectly predicted as any other class. This is the sum of all values in row i, excluding the diagonal element (

F N_{i} = \sum_{j = 1, j \neq i}^{n} N_{i j}

).

True Negatives (

T N_{i}

): The number of samples not belonging to class

c_{i}

that were correctly not predicted as class

c_{i}

. It is the sum of all elements not in row i or column i.

Based on these components, we compute the following metrics [10]. Accuracymeasures the overall fraction of correct predictions.

Accuracy = \frac{\sum_{i = 1}^{n} N_{i i}}{\sum_{i = 1}^{n} \sum_{j = 1}^{n} N_{i j}}

(6)

Sensitivity, also known as Recall or the True Positive Rate (TPR), measures the proportion of actual positives that are correctly identified.

{Sensitivity}_{i} = {Recall}_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}

(7)

Specificity, or the True Negative Rate (TNR), measures the proportion of actual negatives that are correctly identified.

{Specificity}_{i} = \frac{T N_{i}}{T N_{i} + F P_{i}}

(8)

Precision measures the proportion of positive predictions that were actually correct.

{Precision}_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}}

(9)

F1-Score is the harmonic mean of Precision and Sensitivity, providing a single score that balances both metrics.

{F 1 - Score}_{i} = 2 \times \frac{{Precision}_{i} \times {Sensitivity}_{i}}{{Precision}_{i} + {Sensitivity}_{i}}

(10)

For metrics other than Accuracy, we compute the macro-averaged version by calculating the metric for each class independently and then taking the unweighted mean [32]. This approach treats all classes equally, regardless of imbalance. For example, Macro-F1 is calculated as:

Macro - F 1 = \frac{1}{n} \sum_{i = 1}^{n} {F 1 - Score}_{i}

(11)

Cohen’s Kappa (

κ

) measures the agreement between predictions and ground-truth labels while correcting for agreement that could occur by chance.

κ = \frac{p_{o} - p_{e}}{1 - p_{e}}

(12)

where

p_{o}

is the observed accuracy and

p_{e}

is the expected agreement by chance.

Macro-AUROC is the Area Under the Receiver Operating Characteristic (AUROC) curve. The AUROC curve is a graphical plot that illustrates the diagnostic ability of a classifier. Specifically, it plots the Sensitivity (Recall) on the y-axis against the False Positive Rate (FPR) on the x-axis, where the FPR is defined as

1 - Specificity

. For this multi-class problem, the final Macro-AUROC is the unweighted mean of the Area Under the Curve (AUC) computed for each class using the one-vs-rest strategy:

Macro - AUROC = \frac{1}{n} \sum_{i = 1}^{n} {AUC}_{i}

(13)

3.4. Explainability Methods

To achieve clinically significant predictions in skin lesion classification, it is essential to understand both what a model predicts and the reasons for those predictions. This section introduces the post hoc explainability techniques adopted in the present study, Grad-CAM⁺⁺ [17] and LayerCAM [18]. These methods create saliency maps that highlight the image regions most influential to the decision of the model. Both techniques are architecture agnostic and can be applied directly to every deep learning model evaluated in this study without modifying the architecture. They are computationally efficient and generate high-resolution visual explanations, enabling clinicians to check whether the network focuses on medically relevant features such as lesion borders or pigmentation patterns or whether it is affected by irrelevant artifacts. Throughout this section, the rectified-linear unit is denoted by

ReLU (x) = max (0, x)

.

3.4.1. Grad-CAM++

Grad-CAM⁺⁺ is an extension of the original Grad-CAM [33]. It incorporates higher-order gradients that refine the localization of distinct salient regions. Let

Y^{c}

denote the logit score for class c, and let

A_{k} \in R^{H_{m} \times W_{m}}

be the k-th feature map of a selected convolutional layer, with activation

A_{k}^{i j}

at spatial position

(i, j)

. The relevance weight

α_{k}^{c}

for feature map k is then computed as

α_{k}^{c} = \frac{\sum_{i, j} \frac{\partial^{2} Y^{c}}{\partial {(A_{k}^{i j})}^{2}}}{2 \sum_{i, j} \frac{\partial^{2} Y^{c}}{\partial {(A_{k}^{i j})}^{2}} + \sum_{i, j} A_{k}^{i j} \frac{\partial^{3} Y^{c}}{\partial {(A_{k}^{i j})}^{3}}} .

(14)

These weights modulate a linear combination of the feature maps, followed by a ReLU activation:

L_{{Grad - CAM}^{+ +}}^{c} = ReLU (\sum_{k} α_{k}^{c} A_{k}) .

(15)

The ReLU operation suppresses negative relevance and retains only regions that make a positive contribution to the target class, a characteristic that enhances interpretability when lesions present fragmented or spatially distributed discriminative patterns.

3.4.2. LayerCAM

LayerCAM departs from the single-layer assumption by constructing saliency maps at multiple levels of abstraction. For each selected layer l, let

A_{l} \in R^{H_{l} \times W_{l} \times C_{l}}

be the activation tensor and

A_{l}^{i j} \in R^{C_{l}}

the activation vector at location

(i, j)

. The elementwise (Hadamard) product between this vector and its gradient yields a locationwise relevance vector,

R_{l}^{i j, c} = ReLU (\frac{\partial Y^{c}}{\partial A_{l}^{i j}} ⊙ A_{l}^{i j}) \in R^{C_{l}} .

(16)

Reducing the channel dimension gives a 2D relevance map for the layer,

L_{i j}^{c, l} = \sum_{k = 1}^{C_{l}} R_{l, k}^{i j, c},

(17)

which is subsequently upsampled and aggregated across a predefined set of layers

S

:

L_{LayerCAM}^{c} = \sum_{l \in S} Upsample (L^{c, l}) .

(18)

This hierarchical aggregation allows LayerCAM to combine fine-grained visual cues, such as local texture or pigment irregularities, from shallow layers with high-level structural features captured in deeper layers. The formulation above is fully differentiable and applies equally to convolutional and attention-based modules, making LayerCAM naturally compatible with all architectures considered in this study.

3.5. Morphology Metrics

To understand the spatial structure of the explanatory maps, we compute the five morphological descriptors detailed in this section. The purpose of choosing these specific descriptors is to capture different facets of saliency distribution. We focus on global concentration (entropy, Gini) and rotation-invariant shape regularity (radial symmetry, compactness) as well as dispersion. This ensures our analysis remains relevant even after geometric data augmentations are applied during training.

These metrics collectively offer a concise and understandable representation of saliency map morphology. This representation is then used to examine its connection with how the model behaves. All calculations are performed on a normalized saliency map

S \in {[0, 1]}^{H \times W}

, where H and W represent the image height and width, respectively. Each pixel

i \in {1, \dots, N}

(where

N = H \times W

) is defined by its saliency value

s_{i}

and Cartesian coordinates

(x_{i}, y_{i})

, measured from the upper-left image corner.

Shannon entropy [34] quantifies the overall dispersion of saliency values:

H (S) = - \sum_{i = 1}^{N} p_{i} log p_{i}, p_{i} = \frac{s_{i}}{\sum_{j = 1}^{N} s_{j}} .

(19)

Low entropy reflects a focused map where attention is confined to a small region, whereas high entropy suggests a diffuse or ambiguous explanation.

Gini coefficient [35] measures how unevenly the saliency mass is distributed:

G (S) = \frac{1}{2 \bar{s} N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | s_{i} - s_{j} |, \bar{s} = \frac{1}{N} \sum_{i = 1}^{N} s_{i} .

(20)

Values close to one indicate that only a few pixels carry most of the explanatory weight, while values near zero correspond to nearly uniform maps.

Dispersion [36] is defined as the coefficient of variation:

D (S) = \frac{σ_{s}}{\bar{s}}, σ_{s}^{2} = \frac{1}{N} \sum_{i = 1}^{N} {(s_{i} - \bar{s})}^{2} .

(21)

This index captures local irregularity: high dispersion is associated with granular or noisy maps, whereas low values indicate spatially coherent attention.

Radial Symmetry describes the extent to which a distribution remains unchanged when rotated around its center; a perfectly radially symmetric pattern appears identical from every viewpoint. In a saliency map, this property measures how evenly the saliency mass is positioned at equal radial distances from the centroid, weighted by saliency. Formally, let

(\bar{x}, \bar{y})

be the saliency-weighted centroid of the map,

\bar{x} = \frac{\sum_{i} s_{i} x_{i}}{\sum_{i} s_{i}}, \bar{y} = \frac{\sum_{i} s_{i} y_{i}}{\sum_{i} s_{i}},

(22)

and define

r_{i} = \sqrt{{(x_{i} - \bar{x})}^{2} + {(y_{i} - \bar{y})}^{2}}

and

\bar{r} = \frac{\sum_{i} s_{i} r_{i}}{\sum_{i} s_{i}}

. The radial symmetry index [37] is given by

R (S) = 1 - \frac{\sum_{i = 1}^{N} s_{i} | r_{i} - \bar{r} |}{\sum_{i = 1}^{N} s_{i} \bar{r}}, 0 \leq R (S) \leq 1 .

(23)

A high score indicates that saliency is distributed uniformly around the center of mass, matching the symmetry often expected in well-focused lesion representations.

Compactness evaluates the regularity of a shape. The saliency map S is converted into a binary support

B

by applying the Otsu thresholding method [38]. Formally, let A and P denote the area and perimeter of

B

, respectively. Compactness [39] is defined as:

C (S) = \frac{P^{2}}{4 π A} - 1 .

(24)

This expression equals zero for a perfect disk and increases with elongation, fragmentation, or the presence of disconnected salient regions, that is, patterns that often reflect attention to spurious artifacts.

3.6. Experimental Protocol

To ensure full reproducibility, this paragraph details the key hyperparameters used for fine-tuning all models. All models were trained for 300 epochs with a mini-batch size of 32 using the Adam optimizer (with default parameters) and an initial learning rate of

10^{- 4}

. Weight decay and dropout regularization were omitted, and all weights were fine-tuned from ImageNet pretraining without freezing any layers. To address the class imbalance inherent in the dataset, we employed a weighted cross-entropy loss in which class weights were set to the inverse frequency of each class in the training set. The learning rate was modulated with a cosine annealing warm restart scheduler using an initial cycle length of

T_{0} = 20

epochs, a cycle extension factor of

T_{mult} = 2

, and a minimum learning rate of

10^{- 6}

. During training, we applied conservative data augmentation to improve generalization, including random rotations (

\pm 15^{\circ}

), horizontal and vertical flips, mild brightness and contrast shifts, hue and saturation perturbations, and Gaussian noise. Validation and test images were left unaltered aside from resizing to

224 \times 224

pixels and normalization using ImageNet statistics. Stratified sampling was used to partition the dataset into training (60%), validation (20%), and test (20%) subsets.

Following training, saliency maps were generated for all test samples using the two considered post hoc gradient-based attribution methods: Grad-CAM++ and LayerCAM. For each classification model, the target layer for attribution was automatically selected as the last convolutional block in CNNs and the final normalization layer in transformer-based architectures. Transformer outputs were reshaped as needed to conform to the 2D spatial format expected by the saliency algorithms. Saliency maps were computed with respect to the predicted class by backpropagating gradients from the output logit to the selected layer. The resulting maps were normalized to the

[0, 1]

range, resized to the input resolution, and superimposed on the original images for visual inspection. This procedure enabled both qualitative evaluation and quantitative analysis of the spatial characteristics and clinical plausibility of the model explanations.

All experiments were executed on a workstation equipped with an Intel Core i5-11400 CPU, an NVIDIA GeForce RTX 3060 GPU, and 64 GB of DDR4 RAM, running Ubuntu 20.04 (64-bit). Model training and inference were conducted using PyTorch 1.6.0 with CUDA 10.1. This hardware–software configuration ensured consistent and efficient performance across all evaluated architectures and explainability methods.

4. Results

This section presents the empirical results of our study in two main parts. First, we detail the classification performance of the five evaluated deep learning models in Section 4.1. Then, in Section 4.2, we conduct a quantitative analysis of the morphometric properties of the generated saliency maps, aimed at identifying statistically significant differences between the explanations for correct and incorrect predictions.

4.1. Classification Performance

Table 2 summarizes the classification results obtained by the five evaluated architectures on the held-out test set. As explained in Section 3.6, performance is reported for overall accuracy, macro-averaged sensitivity (recall), macro-averaged specificity, macro-averaged precision, F1 Score, the Cohen

κ

statistic, and the macro-averaged-AUROC. For clarity, the best value for each metric is in bold, and every score is expressed as a percentage except for

κ

.

Figure 3 presents the one-vs-rest Receiver Operating Characteristic curves generated for each model. Each panel reports the class-specific AUC scores for the nine diagnostic categories, providing a detailed performance breakdown that complements the macro-AUROC summary in Table 2.

Table 2 establishes a clear performance ranking among the evaluated architectures; Swin-Tiny surpasses every compared model, securing the top score across all metrics, with a notable accuracy of 78.2% and a macro-F1 score of 71.2%. The Macro-F1 score is especially crucial here, as it ensures a balanced evaluation of performance across all nine classes, mitigating potential bias from the inherent class imbalance. Furthermore, the

κ

of 0.743 signifies a robust level of agreement with the expert annotations, substantially correcting for any agreement that could occur by random chance.

The other modern architectures, ConvNeXt-Tiny and MaxViT-Tiny, follow closely, forming a clear top tier of performance. A noticeable gap separates these models from the classical CNNs, ResNet-50 and particularly EfficientNetV2-S, which exhibit markedly lower scores. This outcome strongly suggests that architectures capable of modeling long-range spatial dependencies, a key feature of attention mechanisms and transformer designs, hold a distinct advantage in this complex visual diagnosis task.

This hierarchy is mirrored in the one-vs-rest ROC curves shown in Figure 3, which offer a more granular, per-class view of each model’s diagnostic ability. As detailed in Section 3.3, the macro-AUROC measures the capacity of the model to correctly distinguish between positive and negative instances for a given class, where a value of 1.0 represents perfect discrimination. Swin-Tiny exhibits strong class discrimination in several key categories; for instance, it achieves a near-perfect AUC of 1.00 for Vascular Lesion and 0.99 for Basal Cell Carcinoma, indicating a very high degree of reliability for these diagnoses. Its performance remains excellent for other malignant classes like Melanoma (AUC = 0.97).

By contrast, the diagnostic challenge posed by certain lesions is also evident in the curves. Performance is universally more modest for classes like Actinic Keratosis (AUC = 0.92 for Swin-Tiny) and Seborrheic Keratosis (AUC = 0.94). This consistent difficulty across all evaluated models suggests a significant visual overlap between these lesions and others, making them inherently ambiguous and harder to classify accurately from images alone. Ultimately, the overall Macro-AUROC of 95.8% for Swin-Tiny encapsulates this strong yet nuanced performance, confirming its superior aggregate ability to reliably distinguish between the nine diagnostic classes when compared to the other models.

4.2. Morphometric Analysis of Explanations

To quantitatively assess the relationship between saliency map morphology and prediction correctness, we analyzed the explanations generated by Grad-CAM++ and LayerCAM using the five morphometric descriptors defined in Section 3.5. For each model and explainability method, we compared the distributions of every morphometric descriptor between correctly and incorrectly classified samples using a two-sided Mann–Whitney U test. This non-parametric test does not assume normality, making it highly suitable for this type of analysis. Statistical significance was assessed at two levels:

p < 0.05

and a higher significance level of

p < 0.001

.

Table 3 and Table 4 summarize the outcomes of this analysis, indicating which morphometric indices showed significant differences for each architecture and explanation method. A clear pattern emerges, where the transformer-based models, particularly Swin-Tiny and MaxViT-Tiny, tend to show highly significant differences (

p < 0.001

), especially in metrics related to attentional focus like Entropy and Gini. For example, with LayerCAM, Swin-Tiny shows high significance for Entropy and Gini, and standard significance (

p < 0.05

) for Dispersion. In contrast, the convolutional architectures display more scattered patterns, with significances generally observed at the standard level. For instance, using LayerCAM, ResNet-50 is significant in Entropy and Radial Symmetry, while EfficientNetV2-S is significant in Dispersion and Compactness.

5. Discussion

Our study presents two principal findings that together offer a new perspective on model trustworthiness in automated dermoscopy. First, we empirically demonstrate that modern deep learning architectures based on attention mechanisms systematically outperform classical convolutional designs in this multiclass skin lesion classification task. Second, and more importantly, we find that the saliency maps generated by the best-performing models show statistically significant structural differences between correctly and incorrectly classified samples. This indicates that higher predictive accuracy is accompanied by more consistent explanation patterns. It also suggests that the decision-making processes of these models, when successful, tend to follow more stable and interpretable spatial reasoning as captured by XAI techniques.

The performance hierarchy detailed in Table 2 confirms the advantages of transformer-based backbones for this complex visual task. The superior accuracy, precision, and recall of Swin-Tiny and MaxViT-Tiny can be attributed to their architectural innovations. Unlike the fixed, local receptive fields of traditional CNNs such as ResNet-50, attention mechanisms allow these models to create dynamic, input-specific connections across the entire image [13,30]. This is particularly advantageous in dermoscopy, where diagnostic clues may involve long-range spatial dependencies, such as the overall asymmetry of a lesion or subtle textural variations distributed across its surface. The strong performance of ConvNeXt-Tiny, a hybrid design that modernizes the CNN with transformer principles, further reinforces this conclusion, highlighting a clear trend towards architectures that can effectively model global context.

The morphometric analysis of saliency maps provides valuable insight into the observed performance differences. As shown in Table 3 and Table 4, a distinct pattern emerges regarding the strength and nature of the statistical differences in the explanations. The transformer-based models, Swin-Tiny and MaxViT-Tiny, exhibit highly significant differences (

p < 0.001

) primarily in metrics related to attentional focus. Lower entropy and higher Gini coefficients indicate a more focused model attention, while increased compactness and radial symmetry suggest attention concentrated on the core lesion area rather than on background structures or artifacts. The high significance of these focus-related metrics for transformer-based models suggests that their attentional mechanisms are particularly stable and well-structured when predictions are correct, becoming notably more diffuse when errors occur. In contrast, while convolutional models also show some significant differences, these are scattered across various metrics and are only observed at a standard significance level (

p < 0.05

). These patterns yield a measurable morphometric signature that reflects the model’s confidence and confusion.

Synthesizing these two findings leads to our central argument: the improved accuracy of modern architectures is not an incidental outcome but may be a direct consequence of their ability to form more clinically relevant and interpretable internal representations. The quantitative difference in explanation morphology suggests the best models are not merely pattern matching based on opaque features. Instead, they appear to be learning a visual grammar that parallels human clinical assessment, where concepts like symmetry, compactness, and focused attention (as captured by our metrics) are paramount to a correct diagnosis, as codified in heuristics like the ABCDE rule [3]. Our analysis, therefore, provides quantitative evidence that the black box is learning to reason in a way that is not only more effective but also more transparent and aligned with established clinical expertise.

Figure 4 offers a compelling visual case study that illustrates our core findings. The traditional CNN architectures, ResNet-50 and EfficientNetV2-S, which misclassify the nevus, produce saliency maps with a centrally focused activation pattern. This may reflect a reliance on a coarse analysis of the lesion’s overall shape rather than its internal details. In contrast, the models that correctly classify the lesion, ConvNeXt-Tiny, Swin-Tiny, and MaxViT-Tiny, generate more complex and spatially specific heatmaps. Their activations highlight multiple distinct subregions, suggesting a more nuanced understanding of local texture and structure. This example supports our broader conclusion that the improved performance of modern architectures is associated with the formation of internal representations that are not only more clinically meaningful, but also exhibit statistically distinct saliency characteristics between correct and incorrect predictions.

From a practical standpoint, these findings suggest a promising new direction for clinical safety and model monitoring. Instead of just trusting a model’s output probability, our morphometric indicators offer a way to audit the reasoning process behind each decision. This can enhance clinical trust in two key ways. First, it moves beyond a black box system by providing a transparent second opinion on the quality of the AI’s explanation; a clinician can see not only what the model focused on, but also a quantitative score of how coherently it focused. Second, it helps mitigate the risk of silent failures by flagging predictions based on diffuse or unfocused saliency maps (e.g., high entropy or low compactness), even if the model’s confidence is high. This serves as a crucial safety net against automation bias.

In a clinical environment, these indicators could directly aid diagnosis and optimize workflows. For example, a system could automatically triage cases: predictions with high-quality explanation scores might be marked for faster review, while those with poor scores are prioritized for detailed expert assessment, regardless of the predicted class. This could also function as a second-look mechanism; if an AI-based diagnosis conflicts with a clinician’s initial assessment but is backed by a high-quality explanation score, it could prompt a valuable re-evaluation. Ultimately, these morphometric properties augment expert judgment by providing a real-time, quantitative layer of interpretability that audits the model’s decision-making process on a case-by-case basis.

Limitations and Future Directions

While this study establishes a foundational link between explanation morphology and reliability, it is essential to situate our findings within the specific scope of our methodology and to acknowledge the limitations inherent to our design. Notably, the novelty of our quantitative approach makes direct comparison with previous work unfeasible at this stage. Instead, our findings serve as a critical baseline for this emerging research direction, providing a structured framework for future studies to build upon.

Our experimental design was deliberately focused to ensure reproducibility and statistical control. For this reason, we centered our analysis on a single, high-quality public dataset with expert annotations and utilized a fixed data split. This controlled environment was crucial for isolating the novel relationship between saliency morphology and prediction correctness. This foundational work now paves the way for broader investigations into generalizability. A valuable next step, for instance, would be to validate our morphometric indicators on prospective, multi-center clinical datasets and employ k-fold cross-validation to provide an even more robust estimate of performance.

Similarly, our selection of models and explainability techniques was tailored to the goals of this study. We chose five representative architectures with comparable capacity to specifically isolate the impact of their inductive biases on the explanations. Our focus on gradient-based methods was also intentional, as they produce the dense, high-resolution maps required for our spatial analysis. Having established this baseline, future work could extend this comparative analysis to a wider range of emerging architectures or adapt the morphometric framework for other families of XAI techniques, such as perturbation-based methods like LIME.

Furthermore, conducting a formal ablation study is a future work that will systematically analyze these architectures. This would involve isolating specific components, such as attention mechanisms or skip connections, to determine their precise influence on the morphology and reliability of resulting explanations.

Finally, the morphometric framework itself, built upon five fundamental descriptors, provides a solid starting point that can be readily expanded. Future studies could incorporate a richer set of features, such as advanced texture descriptors or more sophisticated shape metrics, to potentially uncover even more subtle reliability indicators. Such work will be crucial to continue building trust in AI-driven diagnostic tools and to deepen our understanding of their decision-making processes.

6. Conclusions

Accurate and early recognition of skin lesions is critical for reducing the global burden of skin cancer, as timely intervention can dramatically improve patient outcomes. This work provides a unified analysis of classification accuracy and model explainability in automated dermoscopy. On a challenging nine-class ISIC benchmark, the results demonstrate that modern transformer-based architectures, specifically Swin-Tiny, significantly outperform classical CNNs. More importantly, we show that this superior performance is accompanied by statistically significant differences in the morphometric properties of saliency maps between correctly and incorrectly classified samples. By introducing five morphometric indices, we demonstrate that model explanations exhibit distinct spatial characteristics depending on prediction correctness, suggesting that these structural patterns may serve as informative cues for assessing model reliability.

While this study establishes a foundational link between explanation morphology and reliability, future efforts should focus on validating these morphometric indicators in real-world clinical settings and across a broader spectrum of models and explainability techniques. Such work will be crucial to confirm the generalizability of our findings and pave the way for their integration into trusted clinical decision-support systems.

Author Contributions

Conceptualization: R.F., A.G.-P., R.F.-B. and G.G.-M. Methodology: R.F. and A.G.-P. Software: R.F. and A.G.-P. Investigation, R.F.-B. and G.G.-M. Visualization: R.F. and A.G.-P. Supervision: R.F.-B. and G.G.-M. Funding acquisition: R.F.-B. and G.G.-M. Writing—original draft: R.F. and A.G.-P. Writing—review and editing, R.F.-B. and G.G.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is publicly available at Kaggle: https://www.kaggle.com/datasets/jaiahuja/skin-cancer-detection (accessed on 23 June 2025). The source code will be made available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brochez, L.; Volkmer, B.; Hoorens, I.; Garbe, C.; Röcken, M.; Schüz, J.; Whiteman, D.C.; Autier, P.; Greinert, R.; Boonen, B. Skin cancer in Europe today and challenges for tomorrow. J. Eur. Acad. Dermatol. Venereol. 2025, 39, 272–277. [Google Scholar] [CrossRef]
Friedman, R.J.; Rigel, D.S.; Kopf, A.W. Early detection of malignant melanoma: The role of physician examination and self-examination of the skin. CA Cancer J. Clin. 1985, 35, 130–151. [Google Scholar] [CrossRef]
Abbasi, N.R.; Shaw, H.M.; Rigel, D.S.; Friedman, R.J.; McCarthy, W.H.; Osman, I.; Kopf, A.W.; Polsky, D. Early Diagnosis of Cutaneous Melanoma: Revisiting the ABCD Criteria. JAMA 2004, 292, 2771–2776. [Google Scholar] [CrossRef] [PubMed]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
Guzmán-Ponce, A.; López-Bautista, J.; Fernandez-Beltran, R. Interpretando Modelos de IA en Cáncer de Mama con SHAP y LIME. Ideas Cienc. Ing. 2024, 2, 15–29. [Google Scholar] [CrossRef]
Santamato, V.; Marengo, A. Multilabel Classification of Radiology Image Concepts Using Deep Learning. Appl. Sci. 2025, 15, 5140. [Google Scholar] [CrossRef]
Gessert, N.; Nielsen, M.; Shaikh, M.; Werner, R.; Schlaefer, A. Skin lesion classification using ensembles of multi-resolution EfficientNets with meta data. MethodsX 2020, 7, 100864. [Google Scholar] [CrossRef] [PubMed]
Ramesh, S.; B N, Y.; M, C.; Reddy P, S.; N, A. EfficientNetB7-Based Deep Learning for Skin Lesion Classification. In Proceedings of the 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS), Bengaluru, India, 7–9 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Yutra, A.Z.; Zheng, J.; Li, X.; Endris, A. SkinAACN: An Efficient Skin Lesion Classification Based on Attention Augmented ConvNeXt with Hybrid Loss Function. In Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence, New York, NY, USA, 8–10 December 2024; pp. 295–300. [Google Scholar] [CrossRef]
Baykal Kablan, E.; Ayas, S. Skin lesion classification from dermoscopy images using ensemble learning of ConvNeXt models. Signal Image Video Process. 2024, 18, 6353–6361. [Google Scholar] [CrossRef]
Zhong, S.; Zhang, F.; Duan, J. Context-Guided Reverse Attention Network With Multiscale Aggregation for Infrared Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9725–9734. [Google Scholar] [CrossRef]
Fernandez-Beltran, R.; Guzmán-Ponce, A.; Fernandez, R.; Kang, J.; García-Mateos, G. Shadow detection using a cross-attentional dual-decoder network with self-supervised image reconstruction features. Image Vis. Comput. 2024, 143, 104922. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Khan, S.; Ali, H.; Shah, Z. Identifying the role of vision transformer for skin cancer—A scoping review. Front. Artif. Intell. 2023, 6, 1202990. [Google Scholar] [CrossRef]
Cino, L.; Distante, C.; Martella, A.; Mazzeo, P.L. Skin Lesion Classification Through Test Time Augmentation and Explainable Artificial Intelligence. J. Imaging 2025, 11, 15. [Google Scholar] [CrossRef]
Cerekci, E.; Alis, D.; Denizoglu, N.; Camurdan, O.; Ege Seker, M.; Ozer, C.; Hansu, M.Y.; Tanyel, T.; Oksuz, I.; Karaarslan, E. Quantitative evaluation of Saliency-Based Explainable artificial intelligence (XAI) methods in Deep Learning-Based mammogram analysis. Eur. J. Radiol. 2024, 173, 111356. [Google Scholar] [CrossRef]
Chattopadhyay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar] [CrossRef]
Jiang, P.T.; Zhang, C.B.; Hou, Q.; Cheng, M.M.; Wei, Y. LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef]
Folke, T.; Yang, S.C.; Anderson, S.; Shafto, P. Explainable AI for medical imaging: Explaining pneumothorax diagnoses with Bayesian Teaching. In Proceedings of the Proc. SPIE 11746, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, Online, 12–16 April 2021. [Google Scholar] [CrossRef]
Mridha, K.; Uddin, M.M.; Shin, J.; Khadka, S.; Mridha, M.F. An Interpretable Skin Cancer Classification Using Optimized Convolutional Neural Network for a Smart Healthcare System. IEEE Access 2023, 11, 41003–41018. [Google Scholar] [CrossRef]
Nigar, N.; Umar, M.; Shahzad, M.K.; Islam, S.; Abalo, D. A Deep Learning Approach Based on Explainable Artificial Intelligence for Skin Lesion Classification. IEEE Access 2022, 10, 113715–113725. [Google Scholar] [CrossRef]
Attallah, O. Skin-CAD: Explainable deep learning classification of skin cancer from dermoscopic images by feature selection of dual high-level CNNs features and transfer learning. Comput. Biol. Med. 2024, 178, 108798. [Google Scholar] [CrossRef] [PubMed]
Khater, T.; Ansari, S.; Mahmoud, S.; Hussain, A.; Tawfik, H. Skin cancer classification using explainable artificial intelligence on pre-extracted image features. Intell. Syst. Appl. 2023, 20, 200275. [Google Scholar] [CrossRef]
Hauser, K.; Kurz, A.; Haggenmüller, S.; Maron, R.C.; von Kalle, C.; Utikal, J.S.; Meier, F.; Hobelsberger, S.; Gellrich, F.F.; Sergon, M.; et al. Explainable artificial intelligence in skin cancer recognition: A systematic review. Eur. J. Cancer 2022, 167, 54–69. [Google Scholar] [CrossRef]
Ahuja, J. Skin Cancer Detection; Kaggle Dataset. 2023. Available online: https://www.kaggle.com/datasets/jaiahuja/skin-cancer-detection (accessed on 23 June 2025).
Kassem, M.A.; Hosny, K.M.; Fouad, M.M. Skin Lesions Classification Into Eight Classes for ISIC 2019 Using Deep Convolutional Neural Network and Transfer Learning. IEEE Access 2020, 8, 114822–114832. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. MaxViT: Multi-Axis Vision Transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar] [CrossRef]
Yilmaz, A.E.; Demirhan, H. Weighted kappa measures for ordinal multi-class classification performance. Appl. Soft Comput. 2023, 134, 110020. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 27–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 618–626. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Florian, M.; Gladders, M.; Li, N. The Gini coefficient as a morphological measurement of strongly lensed galaxies in the image plane. Astrophys. J. 2015, 813, 117. [Google Scholar] [CrossRef]
Nembua, C.C. A note on the decomposition of the coefficient of variation squared: Comparing entropy and Dagum’s methods. Econ. Bull. 2006, 4, 1–8. [Google Scholar]
Loy, G.; Zelinsky, A. A fast radial symmetry transform for detecting points of interest. In Proceedings of the Computer Vision—ECCV 2002: 7th European Conference on Computer Vision, Copenhagen, Denmark, 28–31 May 2002; Proceedings, Part I 7. Springer: Berlin/Heidelberg, Germany, 2002; pp. 358–368. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Haralick, R.M.; Shapiro, L.G. Computer and Robot Vision; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1992; Volume 1. [Google Scholar]

Figure 1. Overview of the proposed framework. A dermoscopic dataset from the ISIC archive is used to evaluate five deep learning models. Saliency maps are produced with Grad-CAM++ and LayerCAM, and morphometric descriptors are statistically compared between correct and incorrect predictions for each model.

Figure 2. Representative dermoscopic images from each diagnostic category in the dataset. Benign lesions include Pigmented benign keratosis, Nevus, Seborrheic keratosis, Dermatofibroma, and Vascular lesion. Malignant categories comprise Melanoma, Basal cell carcinoma, and Squamous cell carcinoma. Actinic keratosis is included as a premalignant class.

Figure 3. One-vs-rest ROC curves for the nine-class test set. Legends indicate the AUC for each diagnostic category.

Figure 4. LayerCAM saliency maps for a dermoscopic image with a true label nevus. Each row corresponds to a different model. From left to right: the saliency map overlaid on the original image, the standalone saliency map, and the original dermoscopic image. The title of each row indicates whether the model’s prediction was correct or incorrect.

Table 1. Confusion matrix.

		Predicted Class
		$c_{1}$	$c_{2}$	⋯	$c_{n}$
Real Class	$c_{1}$	$N_{11}$	$N_{12}$	⋯	$N_{1 n}$
	$c_{2}$	$N_{21}$	$N_{22}$	⋯	$N_{2 n}$
	⋯	⋯	⋯	⋯	⋯
	$c_{n}$	$N_{n 1}$	$N_{n 2}$	⋯	$N_{n n}$

Table 2. Performance metrics on the held-out test set. Bold values indicate the best result for each metric.

Model	Accuracy (%)	Macro-Sensitivity (Recall, %)	Macro-Specificity (%)	Macro-Precision (%)	Macro-F1 (%)	$κ$	Macro-AUROC (%)
ResNet-50	73.3	66.3	96.6	68.7	67.1	0.686	93.6
EfficientNetV2-S	68.6	61.0	95.9	64.8	62.0	0.631	90.0
ConvNeXt-Tiny	76.3	67.7	97.0	68.1	67.6	0.721	94.9
Swin-Tiny	78.2	70.3	97.2	72.4	71.2	0.743	95.8
MaxViT-Tiny	75.8	68.1	96.9	69.8	68.7	0.716	95.6

Table 3. Morphological metrics that exhibit statistically significant differences between correct and incorrect predictions when using Grad-CAM⁺⁺. A hollow circle (∘) indicates

p < 0.05

, while a solid circle (•) indicates a higher significance of

p < 0.001

.

Table 3. Morphological metrics that exhibit statistically significant differences between correct and incorrect predictions when using Grad-CAM⁺⁺. A hollow circle (∘) indicates

p < 0.05

, while a solid circle (•) indicates a higher significance of

p < 0.001

.

Architecture	Entropy	Gini	Radial
ResNet-50	∘		∘
EfficientNetV2-S
ConvNeXt-Tiny
Swin-Tiny	•	•
MaxViT-Tiny	•

Table 4. Morphological metrics that exhibit statistically significant differences between correct and incorrect predictions when using LayerCAM. A hollow circle (∘) indicates

p < 0.05

, while a solid circle (•) indicates a higher significance of

p < 0.001

.

Table 4. Morphological metrics that exhibit statistically significant differences between correct and incorrect predictions when using LayerCAM. A hollow circle (∘) indicates

p < 0.05

, while a solid circle (•) indicates a higher significance of

p < 0.001

.

Architecture	Entropy	Gini	Dispersion	Radial	Compactness
ResNet-50	∘			∘
EfficientNetV2-S			∘		∘
ConvNeXt-Tiny				∘
Swin-Tiny	•	•	∘
MaxViT-Tiny	•

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fernandez, R.; Guzmán-Ponce, A.; Fernandez-Beltran, R.; García-Mateos, G. Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification. Symmetry 2025, 17, 1264. https://doi.org/10.3390/sym17081264

AMA Style

Fernandez R, Guzmán-Ponce A, Fernandez-Beltran R, García-Mateos G. Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification. Symmetry. 2025; 17(8):1264. https://doi.org/10.3390/sym17081264

Chicago/Turabian Style

Fernandez, Rafael, Angélica Guzmán-Ponce, Ruben Fernandez-Beltran, and Ginés García-Mateos. 2025. "Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification" Symmetry 17, no. 8: 1264. https://doi.org/10.3390/sym17081264

APA Style

Fernandez, R., Guzmán-Ponce, A., Fernandez-Beltran, R., & García-Mateos, G. (2025). Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification. Symmetry, 17(8), 1264. https://doi.org/10.3390/sym17081264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry in Explainable AI: A Morphometric Deep Learning Analysis for Skin Lesion Classification

Abstract

1. Introduction

2. Background

3. Materials and Methods

3.1. Dataset Description

3.2. Deep Learning Models

3.2.1. ResNet

3.2.2. EfficientNet

3.2.3. ConvNeXt

3.2.4. Swin Transformer

3.2.5. MaxViT

3.3. Evaluation Metrics

3.4. Explainability Methods

3.4.1. Grad-CAM++

3.4.2. LayerCAM

3.5. Morphology Metrics

3.6. Experimental Protocol

4. Results

4.1. Classification Performance

4.2. Morphometric Analysis of Explanations

5. Discussion

Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI