Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support

Aksoy, Serra; Demircioglu, Pinar; Bogrekci, Ismail

doi:10.3390/cosmetics12050194

Open AccessArticle

Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support

by

Serra Aksoy

¹

,

Pinar Demircioglu

^2,*

and

Ismail Bogrekci

²

¹

Institute of Computer Science, Ludwig Maximilian University of Munich (LMU), Oettingenstrasse 67, 80538 Munich, Germany

²

Department of Mechanical Engineering, Aydin Adnan Menderes University (ADU), Aytepe, 09010 Aydin, Turkey

^*

Author to whom correspondence should be addressed.

Cosmetics 2025, 12(5), 194; https://doi.org/10.3390/cosmetics12050194

Submission received: 4 August 2025 / Revised: 1 September 2025 / Accepted: 4 September 2025 / Published: 8 September 2025

(This article belongs to the Special Issue Feature Papers in Cosmetics in 2025)

Download

Browse Figures

Versions Notes

Abstract

Background: Skin cancer represents one of the most prevalent malignancies worldwide, with melanoma accounting for approximately 75% of skin cancer-related deaths despite comprising fewer than 5% of cases. Early detection dramatically improves survival rates from 14% to over 99%, highlighting the urgent need for accurate and accessible diagnostic tools. While deep learning has shown promise in dermatological diagnosis, existing approaches lack clinical explainability and deployable interfaces that bridge the gap between research innovation and practical healthcare applications. Methods: This study implemented a comprehensive multimodal deep learning framework using the HAM10000 dataset (10,015 dermatoscopic images across seven diagnostic categories). Three CNN architectures (DenseNet-121, EfficientNet-B3, ResNet-50) were systematically compared, integrating patient metadata, including age, sex, and anatomical location, with dermatoscopic image analysis. The first implementation of XRAI (eXplanation with Region-based Attribution for Images) explainability for skin lesion classification was developed, providing spatially coherent explanations aligned with clinical reasoning patterns. A deployable web-based clinical interface was created, featuring real-time inference, comprehensive safety protocols, risk stratification, and evidence-based cosmetic recommendations for benign conditions. Results: EfficientNet-B3 achieved superior performance with 89.09% test accuracy and 90.08% validation accuracy, significantly outperforming DenseNet-121 (82.83%) and ResNet-50 (78.78%). Test-time augmentation improved performance by 1.00 percentage point to 90.09%. The model demonstrated excellent performance for critical malignant conditions: melanoma (81.6% confidence), basal cell carcinoma (82.1% confidence), and actinic keratoses (88% confidence). XRAI analysis revealed clinically meaningful attention patterns focusing on irregular pigmentation for melanoma, ulcerated borders for basal cell carcinoma, and surface irregularities for precancerous lesions. Error analysis showed that misclassifications occurred primarily in visually ambiguous cases with high correlation (0.855–0.968) between model attention and ideal features. The web application successfully validated real-time diagnostic capabilities with appropriate emergency protocols for malignant conditions and comprehensive cosmetic guidance for benign lesions. Conclusions: This research successfully developed the first clinically deployable skin lesion classification system combining diagnostic accuracy with explainable AI and practical patient guidance. The integration of XRAI explainability provides essential transparency for clinical acceptance, while the web-based deployment democratizes access to advanced dermatological AI capabilities. Comprehensive validation establishes readiness for controlled clinical trials and potential integration into healthcare workflows, particularly benefiting underserved regions with limited specialist availability. This work bridges the critical gap between research-grade AI models and practical clinical utility, establishing a foundation for responsible AI integration in dermatological practice.

Keywords:

skin lesion classification; deep learning; explainable artificial intelligence; XRAI; dermatoscopy; melanoma detection; clinical deployment; multimodal fusion

1. Introduction

Skin cancer represents one of the most prevalent malignancies worldwide, with melanoma accounting for approximately 75% of skin cancer-related deaths despite comprising fewer than 5% of cases. Early detection dramatically improves survival rates from 14% to over 99%, highlighting the urgent need for accurate and accessible diagnostic tools. While deep learning has shown promise in dermatological diagnosis, existing approaches lack clinical explainability and deployable interfaces that bridge the gap between research innovation and practical healthcare applications.

Traditional dermatoscopic image assessment faces significant challenges, particularly in regions with limited expert availability and variability in clinical experience. Tschandl et al. noted that neural network training for automated diagnosis is limited by small, non-diverse dermatoscopic datasets [1]. The emergence of artificial intelligence, particularly convolutional neural networks (CNNs), has opened new possibilities for automated diagnosis. Esteva et al. demonstrated that deep neural networks could achieve dermatologist-level performance using a dataset of 129,450 clinical images encompassing 2032 diseases, with CNNs matching 21 board-certified dermatologists across critical binary classification tasks [2].

The HAM10000 dataset by Tschandl et al. comprises 10,015 high-resolution dermatoscopic images from diverse populations, encompassing seven diagnostic categories: melanocytic nevi (66.9%), melanoma (11.1%), benign keratosis-like lesions (11.0%), basal cell carcinoma (5.1%), actinic keratoses (3.3%), vascular lesions (1.4%), and dermatofibroma (1.1%). Over 50% of lesions were confirmed histopathologically, with the remaining cases validated by expert consensus, follow-up examination, or confocal microscopy [1].

Significant progress has been achieved through deep learning architectures. Wu et al. showed that CNN architectures, such as VGGNet, GoogleNet, ResNet, and variants, have been successfully applied to skin cancer classification [3]. Ahmad et al. achieved accuracies of 99.3% and 91.5% on the HAM10000 and ISIC2018 datasets using deep learning with explainable AI, data augmentation, fine-tuned pre-trained models, and the Butterfly Optimization Algorithm for feature selection [4]. Recent innovations include Perez et al.’s data augmentation, achieving an AUC of 0.882 [5], Himel et al.’s Vision Transformer, achieving 96.15% accuracy [6], Liu et al.’s SkinNet, achieving 86.7% accuracy with an AUC of 0.96 [7], Munjal et al.’s SkinSage XAI, achieving 96% accuracy and 99.83% AUC [8], and Cino et al.’s test-time augmentation, achieving 97.58% balanced accuracy [9].

Network-level fusion architectures have emerged as promising approaches. Arshad et al. proposed a novel architecture, integrating multiple deep models through depth concatenation, achieving 91.3% and 90.7% accuracies on the HAM10000 and ISIC2018 datasets while maintaining lower computational complexity [10]. Hussein et al. introduced hybrid quantum deep learning utilizing HQCNN, BiLSTM, and MobileNetV2, achieving 97.7% training and 89.3% test accuracy [11]. Krishna et al.’s LesionAid utilized ViTGANs for class imbalance, achieving 99.2% training and 97.4% validation accuracy [12]. Wang et al.’s boundary-aware transformers integrated CNNs with transformer architectures for segmentation [13].

Multimodal approaches have shown significant promise. Tang and Lasser’s asymmetric multimodal fusion method reduced parameters from 58.49 M to 32.48 M without compromising performance [14]. Hasan and Rifat’s hybrid ensemble framework achieved a partial AUC of 0.1755 with an above 80% true positive rate [15]. Thomas demonstrated that combining image features with patient metadata enhanced classification across six deep neural networks [16], while Tran Van and Le achieved 95.73% accuracy through cross-attention fusion [17]. Advanced frameworks include Wang et al.’s self-supervised multi-modality learning [18], Yu et al.’s MDSIS-Net [19], and Christopoulos et al.’s SLIMP nested contrastive learning [20].

Recent contributions by Aksoy et al. evaluated seven deep learning models, with DenseNet169 achieving 85% accuracy and successful web-based deployment featuring visualization and automated medical knowledge extraction [21,22]. Yan et al.’s PanDerm foundation model, pre-trained on over 2 million images, achieved state-of-the-art performance across 28 benchmarks, outperforming clinicians by 10.2% in early-stage melanoma detection [23]. Vision–language models, like Kamal and Oates’ MedGrad E-CLIP, enhanced diagnostic transparency through weighted entropy mechanisms [24].

Clinical accessibility has been addressed through smartphone-based approaches. Oyedeji et al. investigated 13 deep learning models using clinical images, with DenseNet-161 achieving 79.40% binary accuracy and EfficientNet-B7 reaching 85.80% [25]. For clinical deployment, explainability frameworks have been essential. Wu et al. emphasized the importance of explainable AI [3], while Ahmad et al., Munjal et al., and Cino et al. implemented Grad-CAM, LIME, and t-SNE visualizations [4,8,9]. Advanced approaches include Patrício et al.’s concept-based explanations [26], Ieracitano et al.’s TIxAI trustworthiness index [27], and Metta et al.’s ABELE explainer [28].

Comprehensive evaluations have been conducted across ISIC datasets. Yao achieved a validation AUC greater than 94% using EfficientNet B6 and VGG 16 with GBDT ensemble learning [29]. Paccotacya-Yanque et al. compared seven explainability methods, identifying essential properties of fidelity, meaningfulness, and effectiveness [30].

Despite significant advances in deep learning-based skin lesion classification, important gaps remain in the literature. While explainability techniques like Grad-CAM and LIME have been extensively explored for providing pixel-level attribution and feature importance visualization, XRAI has not been implemented for skin lesion analysis. XRAI represents a significant advancement over traditional explainability methods by providing superior region-based explanations that generate more coherent and spatially connected visual explanations compared to conventional pixel-level attribution methods. Unlike Grad-CAM, which often produces scattered heatmaps, or LIME, which segments images into superpixels that may not align with clinical reasoning patterns, XRAI creates contiguous regions that better correspond to how dermatologists naturally examine lesions, focusing on unified areas of clinical significance rather than individual pixels or arbitrary segments. This region-based approach aligns more closely with dermatological reasoning patterns, where clinicians assess lesions based on coherent morphological features, such as asymmetry, border irregularity, color variation, and diameter changes. Furthermore, clinically deployable interfaces that provide both diagnostic capabilities and practical patient guidance are lacking in the existing literature.

These limitations have been addressed in this study through two key novel contributions. First, the first implementation of XRAI explainability for skin lesion classification has been presented, providing more coherent and spatially connected explanations compared to traditional pixel-level attribution methods. Second, the first deployable web-based clinical interface has been developed, which combines diagnostic predictions with visual explanations and evidence-based cosmetic recommendations for benign conditions, bridging the gap between AI research and practical clinical utility.

2. Methodology

2.1. Data Acquisition and Preprocessing

This study utilized the HAM10000 dataset (“Human Against Machine with 10,000 training images”), a comprehensive collection of dermatoscopic images specifically designed for machine learning applications in dermatological diagnosis. The dataset was originally developed for the International Skin Imaging Collaboration (ISIC) 2018 challenge and represents one of the largest publicly available collections of annotated skin lesion images for academic research purposes.

The HAM10000 dataset comprises 10,015 high-resolution dermatoscopic images collected from diverse populations using multiple acquisition modalities and imaging devices. This multi-source approach ensures enhanced generalizability and robustness compared to single-institution datasets. The images encompass seven distinct diagnostic categories representing the most clinically significant types of pigmented skin lesions encountered in dermatological practice.

As illustrated in Table 1, the dataset exhibits significant class imbalance, with melanocytic nevi representing approximately two-thirds of all samples, while rare conditions, such as dermatofibroma and vascular lesions, constitute less than three percent combined. This distribution reflects real-world clinical prevalence patterns but necessitates specialized handling techniques during model training to prevent bias toward majority classes.

The diagnostic ground truth for the HAM10000 dataset was established through multiple validation methodologies, ensuring high confidence in label accuracy. More than fifty percent of the lesions received histopathological confirmation, which represents the gold standard for dermatological diagnosis. The remaining cases were validated through expert consensus among board-certified dermatologists, follow-up examination protocols, or confirmation via in vivo confocal microscopy techniques.

This multimodal validation approach provides robust ground truth labels while accommodating the practical constraints of clinical practice, where not all lesions undergo invasive histopathological examination. The dataset includes lesions with multiple images, trackable through unique lesion identifiers, allowing for comprehensive analysis of individual cases across different imaging conditions and time points.

All dermatoscopic images underwent systematic preprocessing to ensure consistency and optimal model performance. The raw images, originally stored in JPEG format across two directory partitions, were first validated for accessibility and quality. The images were resized to a standardized input resolution of 224 × 224 pixels using high-quality interpolation algorithms to maintain visual fidelity while ensuring computational efficiency.

To optimize feature extraction and model convergence, custom normalization parameters were calculated specifically for the HAM10000 dataset rather than relying on generic ImageNet statistics. Through stratified sampling of 1996 images across all diagnostic categories, the following normalization parameters were derived:

Channel-wise means: [0.763, 0.544, 0.568] for RGB channels, respectively;
Channel-wise standard deviations: [0.141, 0.152, 0.169] for RGB channels, respectively.

These parameters reflect the unique color characteristics of dermatoscopic imagery, which typically exhibit higher red channel intensity and distinct color distributions compared to natural images used in ImageNet pretraining.

Beyond image data, the HAM10000 dataset provides comprehensive metadata, including patient demographics and lesion characteristics. This multimodal information was systematically processed and integrated to enhance diagnostic accuracy.

Age information was normalized using z-score standardization to account for the broad age distribution. Missing age values were imputed with population mean values. Gender information was encoded using binary representation (male = 1, female = 0) with appropriate handling for unspecified cases. Lesion location data encompassed twelve distinct body regions, including face, scalp, neck, trunk, extremities, back, abdomen, and chest. This categorical information was transformed using one-hot encoding to preserve spatial relationships.

The integration of demographic and anatomical metadata creates a comprehensive multimodal dataset that mirrors real-world clinical decision-making processes, where dermatologists consider patient characteristics alongside visual examination findings.

To ensure robust model evaluation and prevent data leakage, the dataset was partitioned using stratified random sampling to maintain proportional class representation across training, validation, and test sets. The final partitioning scheme allocated seventy percent of samples for training (7010 images), fifteen percent for validation (1502 images), and fifteen percent for final testing (1503 images).

This partitioning strategy ensured adequate representation of minority classes in each subset while providing sufficient training data for complex model architectures. The stratified approach maintained the original class distribution across all partitions, preventing potential bias that could arise from uneven class allocation.

Figure 1 presents representative samples from each diagnostic category, illustrating the visual diversity and characteristic features of different lesion types within the dataset. The samples demonstrate the challenging nature of automated skin lesion classification, with subtle visual differences between benign and malignant conditions requiring sophisticated pattern recognition capabilities.

2.2. Model Architectures and Training Methodology

This study employed a comprehensive multi-architecture strategy to evaluate different deep learning approaches for skin lesion classification. Three distinct convolutional neural network architectures were implemented and compared: DenseNet-121, EfficientNet-B3, and ResNet-50. Each architecture was selected to represent different designs and computational approaches, providing complementary perspectives on the classification task while enabling robust performance comparison and ensemble learning opportunities.

All models incorporated a unified multimodal design that combines dermatoscopic image analysis with patient metadata integration. This approach mirrors clinical decision-making processes where dermatologists consider both visual lesion characteristics and patient demographics when making diagnostic assessments.

2.2.1. DenseNet-121 Architecture

The DenseNet-121 model served as the primary architecture, utilizing dense connectivity patterns that facilitate feature reuse and gradient flow throughout the network. The implementation employed ImageNet pre-trained weights as initialization, utilizing transfer learning to benefit from features learned on natural images while adapting to dermatoscopic imagery characteristics.

The DenseNet backbone was modified by replacing the original classification layer with a custom multimodal fusion system. Image features extracted from the final dense block (1024 dimensions) were concatenated with processed metadata features to create a comprehensive representation. The metadata processing pipeline consisted of a two-layer neural network that transformed the twelve-dimensional demographic and anatomical input into a thirty-two-dimensional dense representation through batch normalization, ReLU activation, and dropout regularization.

The classification head employed a three-layer architecture with progressive dimensionality reduction from the fused feature space (1056 dimensions) through 512- and 256-dimensional hidden layers to the final seven-class output. Each layer incorporated batch normalization and dropout (rates of 0.5 and 0.3, respectively) to prevent overfitting while maintaining robust feature learning. The complete DenseNet-121 implementation contained 7,632,743 trainable parameters, providing substantial model capacity while remaining computationally efficient (Table 2).

2.2.2. EfficientNet-B3 Architecture

EfficientNet-B3 was selected as the second architecture to use compound scaling principles that systematically balance network depth, width, and resolution. This architecture represents a modern approach to efficient neural network design, achieving high accuracy while maintaining computational efficiency through carefully optimized scaling coefficients.

The EfficientNet-B3 backbone, initialized with ImageNet pre-trained weights, provided 1536-dimensional feature representations from its final global average pooling layer. The same metadata processing pipeline used in DenseNet was applied, creating a 1568-dimensional fused feature vector for classification. The identical classification head architecture ensured a fair comparison between models while maintaining consistent training dynamics.

With 11,637,263 trainable parameters, EfficientNet-B3 offered increased model capacity compared to DenseNet-121 while incorporating modern architectural innovations, such as mobile inverted bottleneck convolutions and squeeze-and-excitation blocks. Forward pass execution time averaged 0.616 s for a batch of four images, demonstrating acceptable computational efficiency for practical deployment scenarios (Table 2).

2.2.3. ResNet-50 Architecture

ResNet-50 provided the third architecture, implementing residual learning principles that enable training of very deep networks through skip connections. This architecture served as a classical baseline, representing established deep learning approaches while contributing to ensemble diversity through its distinct feature learning characteristics.

The ResNet-50 backbone generated 2048-dimensional feature representations, requiring adaptation of the fusion architecture to accommodate the larger feature space. The metadata processing remained identical, but the fused feature vector expanded to 2080 dimensions before classification. Despite this increased dimensionality, the same classification head structure was maintained to ensure consistent comparison methodology.

ResNet-50 represented the largest model in comparison, with 24,711,207 trainable parameters, providing maximum model capacity at the cost of increased computational requirements. Forward pass execution time was optimized at 0.018 s per batch, demonstrating efficient inference despite the larger parameter count.

2.2.4. Multimodal Feature Integration

All three architectures implemented identical multimodal integration strategies to enable a fair performance comparison. Patient metadata, including age, sex, and anatomical location, underwent systematic preprocessing before neural network integration. Age values were normalized using z-score standardization, while categorical variables received appropriate encoding transformations.

The metadata processing network employed two fully connected layers with sixty-four and thirty-two neurons, respectively, incorporating batch normalization and ReLU activation functions. Dropout regularization (rate 0.3) was applied to prevent overfitting while maintaining generalization capability. This design created meaningful demographic representations that enhanced image-based classification through clinically relevant auxiliary information.

Feature fusion occurred through the simple concatenation of image and metadata representations, creating joint feature vectors that captured both visual and demographic patterns. This approach enabled the models to learn interactions between lesion appearance and patient characteristics, potentially improving diagnostic accuracy for cases where demographic factors influence lesion presentation.

2.2.5. Training Configuration and Optimization

All models employed identical training configurations to ensure a fair comparison and reproducible results. The AdamW optimizer was selected with a learning rate of 0.001 and weight decay of 1 × 10⁻⁴ to provide robust optimization with regularization. Learning rate scheduling utilized ReduceLROnPlateau with patience of three epochs and reduction factor of 0.5, enabling adaptive learning rate adjustment based on validation performance.

Focal Loss, with a gamma parameter of 2.0 and an alpha of 1.0, addressed the significant class imbalance present in the HAM10000 dataset. This loss function provided increased focus on difficult examples while maintaining stability during training. Gradient clipping with a maximum norm of 1.0 prevented gradient explosion and ensured stable training dynamics across all architectures.

Training proceeded for a maximum of twenty-five epochs with early stopping based on validation accuracy plateaus. The batch size was set to thirty-two samples to balance memory efficiency with stable gradient estimation. Each model was trained using the same 70/15/15 data split to enable direct performance comparison.

All experiments were conducted on an NVIDIA RTX 4060 GPU with 8 GB VRAM. The models were implemented in PyTorch (version 2.7.0) with CUDA acceleration, using Python 3.13 on a Windows-based workstation. Supporting libraries, such as scikit-learn and NumPy, were utilized for evaluation metrics, visualization, and statistical analysis.

2.2.6. Test-Time Augmentation Enhancement

Test-time augmentation (TTA) was implemented as a post-training enhancement technique to improve model robustness and accuracy without requiring model retraining. The TTA strategy employed eight distinct augmentation transformations applied to each test sample: original image, horizontal flip, vertical flip, fifteen-degree rotation, negative fifteen-degree rotation, top-left crop, bottom-right crop, and ninety-percent scaling.

During inference, each test image underwent all eight transformations, generating multiple predictions that were ensemble-averaged to produce the final classification result. This approach utilized the principle that consistent predictions across multiple image variations indicate robust model confidence, while averaging reduces prediction variance and improves overall accuracy.

The TTA implementation maintained metadata consistency across all augmented versions while transforming only the image component. The probability distributions from all eight predictions were averaged before final class selection, providing a more robust inference mechanism than single-image prediction. This technique required no additional training time while potentially improving test accuracy by 2–3 percentage points.

2.2.7. Model Evaluation Methodology

Comprehensive evaluation protocols were established to ensure robust and reproducible assessment of model performance across all architectures. The evaluation framework incorporated multiple complementary metrics and visualization techniques to provide a thorough analysis of classification accuracy, per-class performance, and model reliability.

All models were evaluated using identical protocols, including overall accuracy calculation, precision, recall, and F1-score for each diagnostic category. Classification reports generated detailed per-class statistics with four-decimal precision to enable precise performance comparisons between architectures. Confusion matrices provided a visual assessment of classification patterns and common misclassification errors across different lesion types.

The stratified 70/15/15 data split ensured unbiased evaluation with sufficient sample sizes for statistical significance testing. Test set evaluation was performed only once per model using the best validation checkpoint to prevent data leakage and maintain evaluation integrity. Per-class accuracy calculations accounted for class imbalance by analyzing performance within each diagnostic category separately.

Comprehensive visualization pipelines generated standardized plots for direct model comparison. Training progression plots displayed accuracy and loss curves across epochs with target threshold lines at eighty-five percent accuracy. Confusion matrices employed consistent color schemes and annotation formats to enable visual comparison between architectures. Per-class accuracy bar charts highlighted strengths and weaknesses across different lesion types with sample size annotations.

To validate the statistical significance of observed performance differences, comprehensive statistical testing was conducted using paired t-tests, effect size analysis, and confidence interval estimation across all reported accuracy metrics (test accuracy, validation accuracy, and test-time augmentation accuracy).

The evaluation methodology ensured fair comparison between architectures while maintaining rigorous standards for medical AI research, providing the foundation for reliable performance assessment and clinical applicability analysis. The comprehensive multi-architecture approach enabled thorough evaluation of different deep learning paradigms while providing opportunities for ensemble learning and robust performance assessment across varying computational constraints.

2.3. Explainability Implementation (XRAI)

The implementation of explainable artificial intelligence (XAI) capabilities utilized the XRAI (eXplanation with Region-based Attribution for Images) algorithm, specifically adapted for medical imaging applications through the PAIRML saliency library. XRAI was selected over alternative explanation methods due to its good performance in medical image analysis, providing region-based explanations that align with clinical diagnostic reasoning patterns. Unlike pixel-level attribution methods, such as Grad-CAM or LIME, XRAI generates coherent, spatially connected explanations that correspond to anatomically meaningful structures in dermatoscopic images. The XRAI algorithm operates by partitioning the input image into hierarchical regions and computing attribution scores based on ranked area integrals. This approach ensures that explanations maintain spatial coherence while reflecting the model’s decision-making process. For skin lesion classification, this methodology proves particularly valuable as dermatologists naturally evaluate lesions by examining specific regions and structures rather than individual pixels.

2.3.1. Technical Implementation Framework

The explainability system was integrated with the best-performing EfficientNet-B3 architecture to provide interpretable predictions. The implementation required careful adaptation of the PAIRML saliency framework to accommodate the multimodal nature of the skin lesion classification model, incorporating both image features and patient metadata during explanation generation. The explainability framework utilized PyTorch hooks to capture intermediate feature representations and gradients from the final convolutional layer of EfficientNet-B3. Specifically, the system registered forward and backward hooks on the Conv2d layer with dimensions (384, 1536, kernel_size = (1,1)), enabling extraction of both feature maps and gradient information required for attribution computation. The hook implementation employed tensor dimension manipulation to ensure compatibility between PyTorch’s channel-first format and XRAI’s expected channel-last configuration. The explanation generation process maintained consistency with the training preprocessing pipeline, applying identical normalization parameters derived from the HAM10000 dataset statistics. The images underwent standardized resizing to 224 × 224 pixels, followed by tensor conversion and GPU memory allocation for efficient computation. The preprocessing function incorporated gradient requirement activation, enabling backpropagation through the network during attribution computation.

2.3.2. Multimodal Explanation Generation

The explainability implementation addressed the unique challenge of generating explanations for multimodal inputs, combining dermatoscopic images with patient demographics. The system developed a specialized model call function that integrated metadata features during explanation computation while maintaining focus on visual features most relevant to clinical interpretation. During explanation generation, patient metadata, including age, sex, and anatomical location, was processed through the same encoding pipeline used during training. For cases where specific metadata was unavailable, the system employed default values (age = 45, female encoding for unknown sex, zero-encoded anatomical location) to ensure explanation generation. In analysis modes, the system used zero-filled metadata vectors, focusing purely on image-based explanations.

The explanation framework computed attributions for both predicted classes and alternative diagnostic possibilities, providing comprehensive insight into model decision-making processes. For each input image, the system generated XRAI attributions targeting the predicted class while also computing explanations for other clinically relevant classes, enabling a comparative analysis of model focus patterns across different diagnostic hypotheses. This was achieved by iterating through all the output logits individually for each of the target classes using class-specific backpropagation to obtain gradients. Calculations of the gradients were batched together to make them execute quickly, and the attributions were calculated using Integrated Gradients, which were refined using the region-growing technique of XRAI. Computed attribution tensors were stored in dictionaries grouped by image ID, as well as class label, allowing for the retrieval of the attribution for easy interactive visualization within the user interface.

2.3.3. Explanation Visualization and Interpretation

The visualization framework generated comprehensive explanation displays, incorporating multiple complementary views to enhance clinical interpretability. The system produced standardized outputs, including original images, XRAI heatmaps, importance-filtered regions, and overlay visualizations, to support different aspects of clinical reasoning. The explanation system computed and displayed regions at multiple importance thresholds, typically showing the top ten percent, twenty percent, and thirty percent most important areas. This hierarchical approach enables clinicians to understand both primary diagnostic features and secondary supporting evidence used by the model. The visualization employed the inferno colormap for heatmap generation, providing intuitive color coding, where brighter regions indicate higher diagnostic importance. Each explanation included a comprehensive statistical analysis of attribution patterns, computing maximum importance scores, mean attribution values, and percentile thresholds for quantitative assessment. The system calculated focus ratios to distinguish between highly localized attention patterns and more distributed decision-making strategies, providing insight into model confidence and reasoning patterns for different lesion types.

2.3.4. Error Analysis and Validation Framework

The explainability implementation incorporated error analysis capabilities to understand model limitations and validate explanation quality. The system specifically analyzed misclassified cases to identify patterns in model failures and compare attribution patterns between correct and incorrect predictions.

The framework prioritized analysis of clinically dangerous errors, particularly cases where malignant lesions (melanoma, basal cell carcinoma) were misclassified as benign conditions. For each error case, the system generated dual explanations showing both the model’s actual focus (leading to incorrect prediction) and the regions that should have received attention for correct classification. This comparative analysis revealed systematic patterns in model failures and validated the reliability of XRAI explanations.

The error analysis computed correlation coefficients between attribution patterns for predicted versus true classes, quantifying the similarity or divergence in model focus. Correlation values below 0.3 indicated completely different focus patterns that explained prediction errors, while values above 0.6 suggested close diagnostic calls where model and ground truth reasoning showed substantial overlap.

2.4. Web Application Architecture

The web application was developed using the Gradio framework to provide an intuitive, clinically oriented interface for real-time skin lesion analysis. The interface design prioritized accessibility, clinical workflow integration, and clear presentation of diagnostic information while maintaining professional medical standards. The application architecture employed responsive design principles with a maximum container width of 1400 pixels to ensure optimal viewing across different device types and screen resolutions.

The application employed a two-column grid layout, with the left column dedicated to input collection (scale = 2) and the right column for displaying the results (scale = 3). This asymmetric layout prioritized result visibility while maintaining efficient space utilization for input controls. The upload section featured a distinctive dashed border and light gray background (#f8f9fa) to clearly delineate the input area, while the results section employed white backgrounds with subtle shadows for professional presentation.

2.4.1. Real-Time Inference Pipeline

The application implemented a real-time inference pipeline capable of processing dermatoscopic images with integrated XRAI explainability generation. The system was designed to handle variable input formats while maintaining consistent preprocessing standards and delivering results within acceptable latency constraints for clinical use.

The preprocessing pipeline maintained strict adherence to training data standards, applying identical normalization parameters derived from HAM10000 dataset statistics (mean = [0.763, 0.544, 0.568], std = [0.141, 0.152, 0.169]). The input images underwent automatic format detection and conversion, supporting both numpy arrays and PIL Image objects. The system implemented standardized resizing to 224 × 224 pixels using high-quality interpolation algorithms to ensure optimal model performance while preserving diagnostic features.

The application integrated patient demographic information with image analysis through a comprehensive metadata encoding system. Age values underwent normalization to a 0–1 scale (age/100), while categorical variables, including sex and anatomical location, received appropriate encoding. The location mapping system supported twelve anatomical regions (abdomen, back, chest, face, foot, hand, lower extremity, neck, scalp, trunk, upper extremity) with one-hot encoding implementation to maintain consistency with the training data structure. This structured approach ensured that both numerical and categorical patient data were combined with image-derived features for downstream model training and inference.

2.4.2. Model Integration and Deployment

The web application integrated the best-performing EfficientNet-B3 model with comprehensive error handling and degradation capabilities. The system implemented automatic model loading with state dictionary restoration from trained weights, incorporating proper device management for both CPU and GPU deployment scenarios.

The application employed robust model initialization procedures with exception handling to ensure reliable deployment across different computational environments. The system automatically detected available hardware (CUDA-enabled GPU or CPU fallback) and configured device placement accordingly. Asynchronous inference requests were also supported through a task queue mechanism using background workers, preventing UI blocking and ensuring scalability for multiple simultaneous users.

For XRAI explainability integration, the application implemented PyTorch hook registration on the final convolutional layer of the EfficientNet-B3 backbone. Forward hooks captured feature maps while backward hooks collected gradient information, storing outputs in globally accessible dictionaries for attribution computation. The hook system employed proper tensor dimension manipulation to ensure compatibility between PyTorch’s channel-first format and XRAI’s expected channel-last configuration. The XRAI attributions were computed by combining integrated gradients with region-based segmentation masks generated via a hierarchical superpixel algorithm, enabling pixel-level contribution mapping. The resulting attribution maps were normalized to a 0–1 range, converted to NumPy arrays, and overlaid onto the original input images using OpenCV-based alpha blending for seamless visualization in the web interface.

2.4.3. Safety Protocols and Risk Stratification

The application incorporated safety protocols to ensure responsible deployment in clinical contexts while maintaining clear boundaries regarding medical advice and diagnostic authority. The system implemented risk stratification with appropriate urgency messaging and safety disclaimers. The classification system employed a four-tier risk stratification framework: very high (melanoma), high (basal cell carcinoma), medium (actinic keratoses), and low (benign conditions). Each risk level triggered specific color coding (red for very high, orange for high, yellow for medium, green for low) and corresponding urgency messaging. Critical cases (melanoma, basal cell carcinoma) generated emergency alerts recommending immediate dermatological consultation, while lower-risk cases provided appropriate guidance for routine monitoring or evaluation. The application consistently emphasized its educational and research purpose through disclaimers positioned both within individual results and in the global footer information. The system explicitly stated that outputs should not replace professional medical diagnosis or treatment, directing users to consult healthcare professionals for medical concerns. This approach maintained ethical responsibility while providing valuable educational insights.

2.4.4. Evidence-Based Cosmetic Guidance System

The cosmetic recommendation system included evidence-based guidance for four benign condition types: melanocytic nevi, benign keratosis, dermatofibroma, and vascular lesions. Each recommendation set underwent validation to ensure safety and appropriateness. The system excluded cosmetic guidance for potentially malignant conditions (melanoma, basal cell carcinoma, actinic keratoses) to prevent inappropriate self-treatment of serious conditions. Cosmetic guidance was presented through styled cards featuring gradient backgrounds with hierarchical organization. Recommendations were structured as evidence-based tips with specific product categories (broad-spectrum sunscreen SPF 30+, fragrance-free moisturizers, gentle cleansers) and application instructions.

2.4.5. Explainability Visualization Framework

The application integrated a comprehensive XRAI explainability visualization to provide clinically meaningful insights into model decision-making processes. The visualization framework generated multiple complementary views to support different aspects of clinical interpretation and model transparency. The XRAI explanation system generated three-panel visualizations, including an original image display, an attribution heatmap using the inferno colormap, and an overlay visualization combining the original image with a semi-transparent attribution overlay. This multi-panel approach enables clinicians to understand both the raw model focus and its relationship to anatomical structures within a lesion. The application generated comprehensive probability visualizations showing confidence scores across all seven diagnostic categories using horizontal bar charts. The predicted class received distinctive red highlighting, while alternative classes were displayed in neutral gray. Probability measures were rounded to three decimals for the simplicity of quantitative measures of model uncertainty as well as confidence. Grid lines, along with the value labels, were provided for the visualization for added readability as well as clinical utility.

2.4.6. Web Application Validation Protocol

A comprehensive validation protocol was established to evaluate the successful translation of the research model into a functional clinical interface. The validation methodology employed systematic stratified sampling to select representative test cases from the held-out test set, ensuring unbiased evaluation across all diagnostic categories while accounting for class imbalance inherent in the HAM10000 dataset.

Seven test images were selected, one from each diagnostic class, using a random selection algorithm that validated file accessibility and extracted corresponding metadata, including patient demographics and lesion characteristics. The selected validation cases encompassed the full spectrum of diagnostic categories with varying patient demographics and anatomical locations: vascular lesions (50-year-old female, abdomen), melanocytic nevi (60-year-old female, chest), melanoma (70-year-old male, back), dermatofibroma (55-year-old male, upper extremity), benign keratosis (75-year-old male, chest), basal cell carcinoma (65-year-old male, back), and actinic keratoses (55-year-old female, upper extremity). This diverse selection provided comprehensive coverage of age ranges, anatomical locations, and lesion morphologies representative of clinical practice.

The validation protocol evaluated three key performance areas: (1) diagnostic accuracy and confidence scoring across all lesion types, (2) safety protocol implementation, including risk stratification and emergency response systems, and (3) technical performance metrics, including inference time, probability visualization accuracy, and XRAI explainability generation consistency. Each test case underwent complete workflow validation from image upload through final diagnostic output and safety protocol activation.

3. Results

3.1. Model Performance Results

Table 3 provides the training and test performance of three deep learning architectures, namely, DenseNet-121, EfficientNet-B3, and ResNet-50, for the classification of skin lesions. All three architectures demonstrated stable training convergence with distinct learning patterns reflecting their underlying designs. The performance of each model is represented in the form of training progress curves, confusion matrices, and per-class accuracy curves. These plots reveal the convergence behavior, classification potential, and generalization performance of the models for various lesion classes.

Training was conducted over twenty-five epochs using identical hyperparameters and data augmentation strategies to ensure a fair comparative assessment. The DenseNet-121 model exhibited steady convergence, with final training accuracy of 82.3% and validation accuracy of 80.0%. The model achieved its best validation performance of 81.09% at epoch 22, demonstrating robust learning without significant overfitting. Training loss decreased consistently from initial values of 0.571 to final values of 0.223, while validation loss stabilized around 0.283, indicating well-balanced model complexity for the skin lesion classification task. EfficientNet-B3 delivered exceptional training performance, achieving 96.7% training accuracy and 89.3% validation accuracy by epoch 25. The model’s best validation performance of 90.08% was recorded at epoch 24, substantially exceeding the target threshold of 85%. The learning rate scheduling proved highly effective, with accuracy improvements accelerating after learning rate reductions at epochs 14, 19, and 23. Training loss decreased dramatically from 0.469 to 0.029, while validation loss reduced to 0.208, demonstrating superior optimization efficiency. ResNet-50 provided a solid baseline performance with 78.4% training accuracy and 78.2% validation accuracy. The model achieved its best validation performance of 78.23% at the final epoch, showing consistent but more gradual improvement compared to the other architectures. Despite having the largest parameter count (24.7 M parameters), ResNet-50 demonstrated more conservative learning dynamics, with training loss decreasing from 0.594 to 0.280 and validation loss stabilizing at 0.350.

3.1.1. Test Set Performance Analysis

Final model evaluation on the held-out test set revealed significant performance differences between the architectures, validating the multi-model comparison approach and highlighting the superior effectiveness of the architectures for medical image analysis. EfficientNet-B3 emerged as the clear performance leader, achieving 89.09% test accuracy and demonstrating great generalization capability. Notably, EfficientNet-B3 achieved this performance with moderate parameter count, demonstrating excellent efficiency compared to ResNet-50’s larger architecture (Table 4).

DenseNet-121 provided strong intermediate performance at 82.83% test accuracy while maintaining the smallest parameter footprint. The model’s performance demonstrates the effectiveness of dense connectivity patterns for feature reuse in medical image analysis, achieving competitive results with significantly fewer parameters than ResNet-50.

3.1.2. Per-Class Performance Analysis

Detailed analysis of per-class performance revealed distinct patterns reflecting the clinical challenges associated with different lesion types and the inherent class imbalance in the HAM10000 dataset.

EfficientNet-B3 achieved the highest test accuracy (89.09%) and demonstrated the most consistent per-class outcomes. Nevi were classified with both high precision (0.939) and recall (0.956), yielding an F1 of 0.948 and ensuring that almost all benign cases were correctly identified while minimizing unnecessary false alarms. Basal cell carcinoma (0.877, 0.831, 0.853) was also reliably detected, balancing sensitivity and specificity in a clinically meaningful way since delays in identifying these lesions can compromise timely treatment. High performance extended to vascular lesions (0.952, 0.909, 0.930), where both false positives and false negatives were rare. Performance on rarer classes was more variable: actinic keratoses (0.809, 0.776, 0.792) and benign keratosis-like lesions (0.774, 0.788, 0.781) retained moderately high recall, indicating satisfactory sensitivity despite class imbalance. Melanoma achieved an F1 of 0.703 (0.719 precision, 0.689 recall), showing better balance than the alternative models but still revealing room for improvement in minimizing missed malignant cases, which remains a critical priority in clinical contexts. Dermatofibroma presented the greatest difficulty, with perfect precision (1.000) but limited recall (0.588), resulting in an F1 of 0.741 and highlighting the challenge of detecting this underrepresented class.

DenseNet-121 reached a test accuracy of 82.83% and produced intermediate results across categories. Nevi again performed strongly (0.910, 0.950, 0.930), with high recall minimizing missed diagnoses. Vascular lesions were also well detected (1.000, 0.773, 0.872), though recall lagged behind EfficientNet-B3. Basal cell carcinoma (0.741, 0.779, 0.759) showed a reasonable balance but at lower absolute levels. Melanoma remained a clear weakness, with only moderate precision (0.566) and recall (0.491), producing an F1 of 0.526, an underperformance that risks both false positives and missed malignancies. Rare classes were especially challenging: actinic keratoses (0.582, 0.653, 0.615) and benign keratosis-like lesions (0.628, 0.564, 0.594) reflected difficulty in distinguishing overlapping dermoscopic patterns, while dermatofibroma recall dropped to 0.294 (F1 0.417), leaving most cases undetected despite decent precision.

ResNet-50 achieved the lowest overall accuracy (78.78%) and displayed the most inconsistent per-class performance. Nevi remained strong (0.860, 0.963, 0.909), with very high recall (0.963), ensuring almost no benign lesions were missed. However, other classes showed substantial limitations. Basal cell carcinoma detection dropped to an F1 of 0.559 (0.644 precision, 0.494 recall), and melanoma performance was particularly poor (0.621, 0.216, 0.320), where the very low recall indicated that most malignant cases were overlooked, which presented a critical shortcoming in a diagnostic context. Dermatofibroma detection suffered similarly (0.600, 0.176, 0.273), with high precision but minimal sensitivity. Vascular lesions (1.000, 0.409, 0.581) showed perfect precision but poor recall, meaning that although the predictions were always correct, most true cases went unrecognized. These results confirmed that ResNet-50 was disproportionately vulnerable to class imbalance, with recall collapses most evident in minority and clinically high-risk categories.

Overall, EfficientNet-B3 consistently provided the most clinically reliable results, balancing sensitivity and specificity across both common and high-risk classes. DenseNet-121 offered intermediate performance but was hindered by poor melanoma and dermatofibroma detection. ResNet-50 showed strong recall only for nevi, but severe deficiencies elsewhere, particularly for melanoma, which limited its practical clinical utility.

3.1.3. Clinical Relevance and Performance Robustness

The observed performance differences between architectures were substantial, given the consistent evaluation protocol and large test set size (N = 1503). These differences carry clear clinical implications: EfficientNet-B3 consistently delivered more reliable detection of both common and minority lesion categories, particularly where sensitivity is critical for patient safety.

At the aggregate level, EfficientNet-B3 achieved a macro-averaged F1-score of 0.821, outperforming DenseNet-121 (0.673) and ResNet-50 (0.530). The weighted F1-scores of 0.890, 0.823, and 0.764, respectively, further highlight EfficientNet-B3’s superior balance of sensitivity and specificity while accounting for the dataset’s highly imbalanced class distribution.

Confusion matrix analysis provided additional insight into the types of errors made by each architecture. For EfficientNet-B3, most misclassifications occurred among visually similar benign categories, such as nevi vs. benign keratosis-like lesions, rather than critical errors, such as classifying malignant melanoma as a benign nevus. This pattern suggests that the model was generally safe in clinical terms, as the most harmful mistakes were minimized. DenseNet-121 showed a similar distribution of errors but with more frequent confusion between melanoma and benign keratosis-like lesions, leading to reduced sensitivity for melanoma. In contrast, ResNet-50 exhibited a much less favorable error profile: melanoma cases were often misclassified as nevi, and dermatofibroma was rarely detected at all. These recall failures in minority and high-risk classes indicate that ResNet-50 was disproportionately vulnerable to dataset imbalance, undermining its clinical reliability.

Taken together, these results confirm that EfficientNet-B3 not only achieved higher overall accuracy but also maintained a safer diagnostic error profile, with fewer clinically dangerous misclassifications compared to DenseNet-121 and ResNet-50.

3.1.4. Test-Time Augmentation Enhancement

Test-time augmentation (TTA) provided performance improvements across all architectures, validating the effectiveness of inference-time enhancement techniques for medical image analysis robustness.

Figure 2 demonstrates the TTA evaluation results, showing a comprehensive comparison across all three models with confidence intervals and performance rankings. EfficientNet-B3 achieved 90.09% TTA accuracy.

TTA improvements were most pronounced for EfficientNet-B3 (+1.00 percentage points), demonstrating the architecture’s responsiveness to inference-time enhancement techniques. The consistent improvement across all models validates TTA as a reliable technique for medical AI deployment, providing enhanced robustness without requiring model retraining (Table 5).

Figure 2 also reveals that TTA improvements were most significant for challenging minority classes, with dermatofibroma and melanoma classification showing enhanced accuracy across multiple architectures. EfficientNet-B3 maintained exceptional melanocytic nevi performance (97% TTA accuracy) while achieving robust results for vascular lesions (91% TTA accuracy) and other diagnostic categories. The comprehensive model performance analysis establishes EfficientNet-B3 as the optimal architecture for skin lesion classification, achieving clinically relevant accuracy levels suitable for deployment in dermatological AI assistance applications.

3.1.5. Statistical Validation of Model Performance

Statistical analysis confirmed that EfficientNet-B3 demonstrated statistically superior performance compared to both alternative architectures (p < 0.05 for all comparisons) (Table 6). The mean performance advantage of 7.46% over DenseNet-121 (95% CI: 5.30–9.61%) and 10.98% over ResNet-50 (95% CI: 9.71–12.25%) represents not only statistical significance but also substantial practical significance, as evidenced by large effect sizes (Cohen’s d > 0.8).

Per-class F1-score analysis further validated these findings, with EfficientNet-B3 achieving significantly higher performance across diagnostic categories compared to DenseNet-121 (t = 3.85, p = 0.009) and ResNet-50 (t = 5.32, p = 0.002). The consistently large effect sizes across all comparisons confirm that observed performance differences reflect genuine architectural advantages rather than random variation, supporting EfficientNet-B3 as the optimal choice for clinical deployment.

3.2. Explainability Analysis Results

A comprehensive XRAI analysis was conducted on correctly classified examples from all seven diagnostic categories to understand model attention patterns and validate the clinical relevance of learned features. The analysis revealed distinct attention patterns for each lesion type, demonstrating that the EfficientNet-B3 model learned clinically meaningful features corresponding to established dermatological diagnostic criteria.

Figure 3 presents the complete XRAI visualization matrix showing representative examples from each class with their corresponding attention heatmaps, filtered important regions, and overlay visualizations. The analysis included examples with high confidence predictions: actinic keratoses (99.6% confidence), basal cell carcinoma (100.0% confidence), benign keratosis (80.6% confidence), dermatofibroma (99.9% confidence), melanoma (99.9% confidence), melanocytic nevi (99.9% confidence), and vascular lesions (99.8% confidence).

Statistical analysis of attribution maps revealed consistent moderately focused attention patterns across all classes, with focus ratios ranging from 0.103 to 0.132. This intermediate focus level indicates that the model balances between highly localized feature detection and broader contextual analysis, which aligns with dermatological practice where both specific lesion characteristics and surrounding tissue context inform diagnosis.

3.2.1. Melanocytic Nevi and Benign Lesion Analysis

For benign conditions, XRAI analysis demonstrated concentrated attention on central lesion areas with clear boundary delineation. Melanocytic nevi (ISIC_0026269) showed highly focused attention on the central dark pigmented region, with attribution values concentrated in a circular pattern, reflecting the model’s recognition of symmetric pigmentation characteristic of benign moles. The maximum importance value of 0.0015 with a mean attribution of 0.0003 indicated precise but moderate activation levels.

Benign keratosis (ISIC_0031580) exhibited the highest attribution intensity among all classes (maximum importance: 0.2782, mean: 0.0760), suggesting strong feature responses to the characteristic waxy, stuck-on appearance typical of seborrheic keratoses. The model’s attention pattern concentrated on textural irregularities and surface characteristics rather than pigment distribution, demonstrating appropriate feature discrimination for this lesion type.

Vascular lesions (ISIC_0025707) showed distinctive attention patterns focusing on reddish vascular structures with attribution maxima of 0.0088. The XRAI visualization revealed concentrated attention on blood vessel patterns and erythematous regions, validating the model’s ability to recognize vascular architectural features essential for this diagnostic category.

3.2.2. Malignant Lesion Feature Recognition

Analysis of malignant conditions revealed attention patterns targeting diagnostically critical features. Melanoma (ISIC_0032192) demonstrated focused attention on irregular pigmentation patterns and asymmetric borders with a maximum attribution of 0.0043. The model’s attention distribution encompassed both central pigmented areas and peripheral irregular edges, consistent with established ABCDE criteria for melanoma detection (Asymmetry, Border irregularity, Color variation, Diameter, Evolution).

Basal cell carcinoma (ISIC_0027786) exhibited concentrated attention on central ulcerated areas and raised borders characteristic of this cancer type. The relatively low attribution intensity (maximum: 0.0030) combined with high confidence (100%) suggests efficient feature detection for this well-defined lesion category. The attention pattern focused on the characteristic pearly, translucent appearance typical of basal cell carcinomas.

Actinic keratoses (ISIC_0029268) showed moderate attention intensity (maximum: 0.0110), concentrated on rough, scaly surface textures characteristic of these precancerous lesions. The model’s focus on surface irregularities and hyperkeratotic changes demonstrates appropriate recognition of early malignant transformation features.

3.2.3. Error Pattern Analysis and Clinical Implications

Critical error analysis revealed important insights into model failure modes and their clinical significance. Three high-severity error cases were analyzed: melanoma misclassified as melanocytic nevi (mel→nv), melanoma misclassified as benign keratosis (mel→bkl), and basal cell carcinoma misclassified as benign keratosis (bcc→bkl).

Figure 4 illustrates the comprehensive error analysis showing the original images, model focus patterns (incorrect predictions), ideal focus patterns (correct classifications), and probability distributions for misclassified cases. The analysis reveals that dangerous misclassifications occur when malignant lesions exhibit visual characteristics similar to benign conditions.

The mel→nv error case (ISIC_0025195) demonstrated high focus pattern similarity (correlation: 0.862) between model attention and ideal melanoma features, indicating a “close call” scenario where visual ambiguity contributed to misclassification. Despite focusing on appropriate regions, the model incorrectly weighted benign features due to the lesion’s symmetric appearance, which mimicked a benign mole. The extremely low true class probability (0.07%) compared to high model confidence (99.6%) highlights the challenge of borderline cases in clinical practice.

The mel→bkl error (ISIC_0027436) showed a similar high correlation (0.855) between actual and ideal attention patterns, with the model correctly identifying important regions but misinterpreting surface characteristics as benign keratotic changes rather than malignant features. The higher true class probability (0.26%) compared to the predicted class probability (3.2%) suggests the model detected melanoma features but was overwhelmed by keratosis-like appearance.

The bcc→bkl error (ISIC_0031597) presented a particularly challenging case with extremely high correlation (0.968) between focus patterns, indicating near-perfect attention alignment despite incorrect classification. The model’s perfect confidence (100%) in an incorrect prediction demonstrates that even sophisticated attention mechanisms can fail when lesions exhibit atypical presentations that mimic benign conditions.

3.2.4. Clinical Validation of Attention Patterns

The XRAI analysis provides strong clinical validation of model decision-making processes. The moderately focused attention patterns across all classes reflect an appropriate balance between specific feature detection and contextual analysis required for dermatological diagnosis. High-confidence correct predictions consistently showed attention patterns aligned with established clinical diagnostic criteria, while misclassifications revealed the inherent challenges of visually ambiguous lesions that challenge even experienced dermatologists. The wide range of attribution intensities across classes (from 0.0030 for BCC to 0.2782 for benign keratosis) indicates that the model appropriately adjusts sensitivity based on lesion characteristics. Higher attribution values for textural lesions (keratoses) compared to pigmented lesions (melanoma, nevi) reflect the different visual processing requirements for these diagnostic categories. The narrow range of focus ratios (0.103–0.132) across all classes demonstrates consistent attention mechanisms despite diverse lesion appearances. This consistency suggests robust feature extraction processes that maintain diagnostic stability across different morphological presentations.

3.3. Web Application Validation Results

The web application validation demonstrated successful translation of the research model into a functional clinical interface with EfficientNet-B3 and integrated XRAI explainability. All seven test cases across diagnostic categories were correctly classified, with confidence scores ranging from 81.6% to 98.4%. Detailed validation results for individual lesion types, including specific confidence scores, XRAI visualizations, and interface screenshots, are presented in Appendix A.

3.3.1. Diagnostic Performance Validation

The web application achieved perfect classification accuracy across all seven diagnostic categories. Malignant lesions were correctly identified with high confidence for melanoma (81.6%) and basal cell carcinoma (82.1%), both triggering appropriate emergency protocols. Precancerous actinic keratoses achieved 88.0% confidence with correct medium-risk classification. Benign conditions demonstrated excellent performance, with confidence scores ranging from 95.2% to 98.4% for benign keratosis, melanocytic nevi, dermatofibroma, and vascular lesions.

3.3.2. Safety Protocol Performance

The four-tier risk stratification system operated effectively across all test cases. Very high risk (melanoma) triggered emergency alerts with red highlighting, high risk (basal cell carcinoma) generated urgent consultation recommendations with orange highlighting, medium risk (actinic keratoses) prompted appropriate 2–4 week consultation guidance with yellow highlighting, and low risk (benign conditions) provided routine monitoring guidance with green highlighting.

The evidence-based cosmetic guidance system demonstrated complete safety compliance, appropriately providing recommendations only for benign conditions while correctly excluding guidance for potentially malignant lesions. Emergency response protocols functioned correctly, escalating medical guidance with clear visual indicators.

3.3.3. Technical Performance

Technical performance metrics validated a robust deployment architecture. The average inference time remained stable at approximately 14 s across all lesion types, demonstrating computational efficiency suitable for clinical workflows. The probability visualization system generated accurate confidence distributions, with scores appropriately reflecting diagnostic certainty.

XRAI explainability generation showed consistent performance across different lesion morphologies, providing clinically relevant attention patterns aligned with established dermatological diagnostic criteria. Multimodal integration of patient metadata with image analysis maintained consistency throughout validation, with demographic and anatomical information enhancing diagnostic accuracy without interference.

The validation establishes the web application as a robust clinical prototype ready for controlled clinical trials and extended evaluation in dermatological practice settings.

4. Discussion

The comprehensive evaluation of deep learning architectures with XRAI explainability integration demonstrates significant advances in automated dermatological diagnosis, establishing a foundation for clinically deployable AI systems that bridge the gap between research innovation and practical healthcare applications. This study’s approach, combining architectural comparison, explainable AI implementation, and clinical deployment validation, addresses critical gaps in existing dermatological AI research. Similar calls for clinically relevant deployment have been emphasized by Esteva et al. [2], Wu et al. [3], and Aksoy et al. [21], but few studies have combined high performance with interpretability and integration into web-based applications. The system was implemented using a modular PyTorch-based pipeline combined with Gradio, enabling integration into web applications and supporting both CPU and GPU inference through dynamic device allocation.

4.1. Model Performance and Comparative Analysis

The systematic comparison of three CNN architectures revealed fundamental insights into the effectiveness of different deep learning paradigms for medical image analysis. EfficientNet-B3’s superior performance (89.09% test accuracy, 90.08% validation accuracy) demonstrates the clinical relevance of compound scaling principles that systematically balance network depth, width, and resolution. This aligns with earlier findings by Wu et al. [3], who demonstrated that CNNs are robust baselines across architectures, and with Yao [29], who highlighted EfficientNet’s promise for lesion classification.

The substantial performance gap between EfficientNet-B3 and ResNet-50 (89.09% vs. 78.78%), despite ResNet-50’s larger parameter count (24.7 M vs. 11.6 M), highlights the importance of architectural innovation over mere model size in medical applications. Comparable insights were reported by Arshad et al. [10] through fusion strategies and Hussein et al. [11] through hybrid quantum deep learning, both showing that design choices can outweigh brute parameter count.

Benchmarking against recent state-of-the-art CNN approaches reveals important performance considerations within contemporary research trends. Hussain and Toscano [31] achieved exceptional performance using multiple CNN architectures with tailored data augmentation on HAM10000, with EfficientNetV2-B3 reaching over 98% accuracy through extensive preprocessing and class-specific augmentation strategies. While their accuracy substantially exceeds our 89.09%, their approach prioritized pure classification performance without incorporating explainability frameworks or clinical deployment considerations. Similarly, Roy et al. [32] demonstrated competitive results with 91.17% F1-score and 90.75% accuracy using wavelet-guided attention mechanisms and gradient-based feature fusion, representing sophisticated feature engineering approaches that complement our architectural comparison findings.

Recent Vision Transformer implementations have shown remarkable promise for skin lesion analysis. Agarwal and Mahto [33] further advanced hybrid approaches, achieving 92.81% accuracy on HAM10000 through sequential and parallel CNN–Transformer models with Convolutional Kolmogorov–Arnold Network (CKAN) fusion, showcasing the potential of architectural hybridization. Zoravar et al. [34] explored domain adaptation challenges through Conformal Ensemble of Vision Transformers (CE-ViTs), achieving 90.38% coverage rates across multiple datasets and highlighting the importance of uncertainty quantification in clinical applications.

When positioned within existing HAM10000 research, these results reveal important methodological trade-offs. Ahmad et al.’s framework achieved 99.3% accuracy using the Butterfly Optimization Algorithm [4], while Liu et al.’s SkinNet ensemble reached 86.7% through stacking techniques [7]. Krishna et al. [12] likewise pushed accuracy with Transformers and GAN-based imbalance correction. The current study’s 89.09% accuracy reflects deliberate prioritization of explainability integration and clinical deployment readiness over pure accuracy maximization. While recent state-of-the-art approaches [31,32,33,34] demonstrate higher classification accuracies, they predominantly focus on algorithmic optimization without addressing the critical gap between research performance and clinical utility. Unlike previous studies that concluded with performance metrics [31,32,33,34], this research provides a fully functional clinical prototype with comprehensive safety protocols, XRAI explainability, and evidence-based patient guidance systems.

The comparative analysis reveals distinct research philosophies within contemporary skin lesion classification. Pure accuracy-driven approaches [31,33] excel in controlled evaluation scenarios but lack the transparency and deployment infrastructure essential for clinical acceptance. Explainability-focused methods [32] demonstrate sophisticated feature analysis but remain research-grade implementations without clinical interfaces. Our integrated approach bridges this gap by accepting moderate accuracy trade-offs (89.09% vs. 96–98% in pure accuracy studies) in exchange for clinical explainability, real-time deployment capability, and comprehensive patient safety protocols that are absent in higher-performing but research-only implementations.

Per-class analysis reveals critical insights for clinical deployment. EfficientNet-B3’s good performance for common conditions (95.6% for melanocytic nevi, 83.1% for basal cell carcinoma) establishes strong reliability for typical dermatological presentations. However, reduced performance for dermatofibroma (58.8% accuracy) reflects the inherent challenges of extremely rare conditions (17 test cases). Similar imbalance-related challenges were emphasized by Krishna et al. [12], Tang and Lasser [14], and recent studies [31,32,33,34], highlighting the importance of tailored approaches to minority-class detection in future research.

The comprehensive statistical validation provides robust evidence supporting EfficientNet-B3’s superior performance through multiple complementary analyses. All pairwise comparisons achieved statistical significance (p < 0.05) with large effect sizes (Cohen’s d > 0.8), indicating that observed differences represent meaningful practical improvements rather than statistical noise. The 95% confidence intervals that exclude zero further confirm the reliability of these performance advantages. These statistical findings validate the architectural comparison methodology and support the selection of EfficientNet-B3 for clinical deployment, addressing concerns about whether observed performance differences could be attributed to chance variation.

4.2. XRAI Explainability: Methodological Innovation

The implementation of XRAI explainability represents a significant methodological advancement over traditional attribution techniques. While previous studies employed Grad-CAM [4,8,9] or LIME [8] for visualization, these pixel-level methods often generate fragmented explanations that fail to align with clinical reasoning patterns. More advanced methods, including Patrício et al.’s concept-based explanations [26], Ieracitano et al.’s TIxAI trustworthiness index [27], and Metta et al.’s ABELE explainer [28], attempted to move toward higher-level interpretability, but none have applied region-based XRAI analysis in dermatology. The XRAI approach generates coherent, spatially connected explanations corresponding to anatomically meaningful structures, addressing this gap.

The comprehensive XRAI analysis revealed clinically meaningful attention patterns that align with established dermatological diagnostic criteria. For melanoma detection, concentrated attention on irregular pigmentation patterns and asymmetric borders aligns with ABCDE criteria, while focused attention on central ulcerated areas for basal cell carcinoma reflects appropriate morphological feature recognition. This complements observations by Munjal et al. [8] and Cino et al. [9], who used visualizations but lacked region-level coherence. The moderately focused attention patterns across all classes (focus ratios 0.103–0.132) demonstrate appropriate balance between specific feature detection and contextual analysis required for comprehensive dermatological evaluation.

Critical error analysis provided essential insights into model failure modes, particularly for high-stakes misclassifications. The melanoma misclassified as a melanocytic nevi case showed high correlation (0.862) between model attention and ideal features, indicating “close call” scenarios reflecting inherent diagnostic challenges even for experienced clinicians. The basal cell carcinoma error with high correlation (0.968) despite 100% confidence demonstrates that sophisticated attention mechanisms can fail with atypical presentations, emphasizing the importance of maintaining clinical oversight protocols. Similar emphasis on the role of oversight was highlighted by Wu et al. [3] and Thomas [16], especially when integrating AI into clinical workflows.

These findings indicate that misclassifications often arise not only from the inherent ambiguity of lesion morphology but also from dataset imbalance, as the HAM10000 collection is dominated by benign nevi while rarer conditions remain underrepresented [1]. This dual challenge underscores the need for clinical oversight and the development of balanced datasets to reduce systematic bias, consistent with observations by Tschandl et al. [1] and Tran Van and Le [17].

4.3. Multimodal Integration and Clinical Deployment Innovation

The successful integration of patient metadata with dermatoscopic images demonstrates significant advancement over purely image-based approaches that dominated previous HAM10000 research. This multimodal approach aligns with clinical reality, where patient characteristics significantly influence lesion presentation and diagnostic interpretation. The metadata processing pipeline enhanced classification accuracy while maintaining computational efficiency, reflecting the natural diagnostic process used by dermatologists.

The development of the first deployable web-based clinical interface with explainability and recommendations tailored to benign skin conditions represents a unique contribution that addresses the critical gap between research-grade AI models and clinical utility. Unlike previous studies focused on accuracy optimization, this research provides a fully functional clinical prototype with real-time inference, comprehensive safety protocols, and patient education components. The four-tier risk stratification system with appropriate urgency messaging and evidence-based cosmetic guidance for benign conditions demonstrates responsible medical AI deployment that prioritizes patient safety while maximizing diagnostic utility.

Validation across all diagnostic categories establishes robust performance suitable for controlled clinical trials, contrasting with previous studies that typically focused on algorithmic development without deployment validation. The integration of XRAI explainability within a real-time clinical interface enables healthcare professionals to validate AI reasoning processes during clinical decision-making, addressing transparency requirements essential for clinical acceptance.

4.4. Limitations and Future Research Directions

Several limitations require consideration for future research and clinical implementation. The persistent class imbalance challenges, particularly evident in dermatofibroma classification, highlight the need for specialized approaches to rare condition detection. Oversampling strategies, such as SMOTE or heavy synthetic augmentation, were not applied in this study, as such methods may introduce artificial dermoscopic patterns that fail to reflect true clinical presentations. While this approach preserved dataset authenticity and interpretability, it inevitably limited sensitivity in underrepresented categories, such as dermatofibroma and melanoma. Addressing this imbalance in future work will require integration of larger, multi-institutional datasets and collaborations with clinical partners to ensure more representative coverage of rare but clinically important lesions. In addition, future research should explore advanced data augmentation techniques, synthetic data generation, or federated learning approaches to address minority class limitations while maintaining diagnostic accuracy for common conditions.

Additionally, we acknowledge that the HAM10000 dataset contains multiple images per lesion, and while this study employed stratified random splitting to preserve class balance, this approach may allow images from the same lesion to appear across training, validation, and test sets. Although such stratified strategies have been commonly used in prior HAM10000 research to ensure minority-class representation, they may introduce potential data leakage. Future studies should therefore employ grouped splitting by lesion ID and extend validation to external datasets to further ensure robust generalization.

Moreover, this study emphasized clinical applicability and system deployment rather than exhaustive statistical validation. Accordingly, formal significance testing procedures (e.g., paired t-tests, confidence intervals, bootstrap resampling) were not applied in the present analysis. Future research should incorporate repeated-seed training and formal statistical validation across a broader set of models and multimodal variants to rigorously assess whether observed differences are statistically significant, thereby strengthening the robustness and reproducibility of comparative findings. Epoch selection in this study was based on validation-driven early stopping, as both training and validation curves plateaued at approximately 24 epochs. While statistical testing does not directly determine the optimal epoch count, future work will combine repeated-seed training with paired tests and bootstrap confidence intervals to evaluate whether extended training yields statistically reliable improvements. Furthermore, this study focused primarily on accuracy, F1-scores, and per-class accuracy as key evaluation metrics, given the application-oriented emphasis on clinical deployment. Operating points at fixed high-sensitivity thresholds, AUROC and PR-AUC scores, as well as per-class sensitivity and specificity with confidence intervals via bootstrapping, were not included in the present analysis. Future research should incorporate these metrics to provide an assessment of model performance, particularly for high-stakes categories, such as melanoma, basal cell carcinoma, and actinic keratoses, where sensitivity at clinically relevant thresholds is critical.

Another limitation lies in the model’s performance on cases with indistinct lesion boundaries and atypical morphologies. As shown in the error analysis, such cases often led to misclassification despite strong alignment between model attention and clinically relevant regions. This issue is exacerbated by the significant class imbalance in the HAM10000 dataset, where benign lesions vastly outnumber malignant and rare categories, limiting the model’s exposure to difficult cases. Future research should therefore explore uncertainty quantification, ensemble predictions, and longitudinal imaging analysis, alongside balanced and diverse datasets, to provide more cautious and context-aware outputs in borderline scenarios. While this study evaluated performance on HAM10000, future work should explicitly incorporate independent test sets drawn from alternative sources, such as other ISIC challenge datasets or multi-institutional cohorts collected under different imaging conditions. Such external validation is critical for demonstrating true generalizability and for ensuring robustness across sites, devices, and populations.

An important limitation of this study is the restricted demographic diversity of the HAM10000 dataset. While the dataset is comprehensive in terms of lesion types, it primarily represents lighter Fitzpatrick skin phototypes and Central European populations. This underrepresentation of darker skin tones and broader ethnic groups may limit the generalizability of the model across diverse clinical settings. Future work should therefore focus on validating performance across multi-ethnic and international cohorts to ensure equitable diagnostic accuracy. Approaches such as federated learning and cross-dataset benchmarking may provide promising strategies to mitigate demographic bias and improve global clinical utility. While this study integrated patient metadata (age, sex, and anatomical site) alongside dermatoscopic imagery, we did not explicitly measure the incremental contribution of metadata compared to image-only models or analyze subgroup performance stratified by demographic factors such as sex, age, or lesion location. Similarly, robustness to missing or erroneous metadata was not systematically evaluated, as missing values were imputed or zero-encoded in the present work. Future research should therefore investigate metadata robustness more formally, including ablation studies (image-only vs. image + metadata), subgroup performance analysis, and sensitivity testing to incomplete or noisy metadata inputs.

A further limitation is the absence of invasive squamous cell carcinoma (SCC) in the HAM10000 dataset. Although the dataset includes actinic keratoses/intraepithelial carcinoma (akiec), which represent early-stage precursors to SCC, invasive SCC lesions are not represented. This means that while the present framework already covers melanoma, basal cell carcinoma, actinic keratoses, and benign conditions, future studies should expand training and validation datasets to include SCC cases in order to provide complete coverage of all major skin cancer types. Computational requirements for XRAI explainability generation, while acceptable for clinical deployment, may benefit from optimization techniques to reduce inference latency in high-volume clinical environments.

4.5. Clinical Impact and Healthcare Translation

The successful development of a clinically deployable skin lesion classification system addresses critical gaps in dermatological care accessibility, particularly in regions with limited specialist availability. The combination of diagnostic accuracy with transparent decision-making processes through XRAI visualization provides essential foundations for clinical acceptance and integration into existing healthcare workflows.

The web-based deployment architecture offers significant advantages for underserved regions by enabling AI-assisted diagnosis with only basic internet connectivity, eliminating the need for expensive local computational infrastructure or specialized hardware. This approach democratizes access to advanced dermatological AI by allowing healthcare providers in resource-constrained environments to leverage sophisticated diagnostic capabilities through standard web browsers, dramatically reducing implementation barriers and deployment costs compared to traditional on-premise AI solutions.

The web application’s comprehensive approach to patient education, incorporating diagnostic assessment and evidence-based guidance for benign conditions, has the potential to enhance patient engagement and self-monitoring capabilities while maintaining appropriate clinical boundaries. This educational component could contribute to improved health literacy and earlier detection of concerning lesion changes in populations with limited access to specialized dermatological care, particularly benefiting remote and rural communities where internet access may be the only available connection to advanced medical technologies.

5. Conclusions

This study successfully developed and validated a clinically deployable deep learning system for automated skin lesion classification, demonstrating significant advances in both diagnostic accuracy and practical clinical utility. Through a systematic comparison of three CNN architectures, EfficientNet-B3 emerged as the optimal model, achieving 89.09% test accuracy with superior performance across most diagnostic categories, including critical malignant conditions such as melanoma (81.6% confidence) and basal cell carcinoma (82.1% confidence).

The implementation of XRAI explainability represents a crucial methodological innovation that addresses the transparency requirements essential for clinical acceptance. Unlike traditional pixel-level attribution methods, XRAI generated coherent, spatially connected explanations that align with established dermatological diagnostic criteria, providing clinicians with interpretable insights into model decision-making processes.

The successful integration of patient metadata with dermoscopic images enhanced classification accuracy while reflecting the natural diagnostic process used by clinicians who consider both visual lesion characteristics and patient demographics. This multimodal approach contributed to the system’s robust performance across diverse patient populations and anatomical locations.

The development of the first deployable web-based clinical interface with integrated explainability and evidence-based recommendations represents a significant contribution to the field. The application successfully demonstrated real-time inference capabilities with safety protocols, including a four-tier risk stratification system and appropriate emergency response protocols for malignant conditions.

This research addresses critical gaps in dermatological care accessibility, particularly in regions with limited specialist availability. The web-based deployment architecture democratizes access to advanced diagnostic capabilities by requiring only standard internet connectivity, eliminating expensive infrastructure requirements. The comprehensive patient education components, including evidence-based cosmetic guidance for benign conditions, enhance patient engagement while maintaining appropriate clinical boundaries.

Future research should focus on addressing persistent class imbalance challenges through advanced data augmentation techniques, validating performance across international datasets to ensure global generalizability, and optimizing computational requirements for high-volume clinical environments. The successful development of this clinically deployable system establishes a foundation for responsible AI integration in dermatological practice, bridging the critical gap between research innovation and practical healthcare applications while prioritizing patient safety and clinical transparency.

Author Contributions

S.A. helped in conceptualization, software, validation, writing—original draft, resources, and data curation. P.D. and I.B. contributed to supervision, funding acquisition, and formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is contained within this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Benign Lesion Classification Performance

Appendix A.1.1. Vascular Lesion Classification Performance

The web application demonstrated exceptional performance for vascular lesion classification, achieving 97.0% confidence in correctly identifying the reddish angiomatous lesion on a 50-year-old female patient’s abdomen (Figure A1). The interface appropriately categorized this as low-risk and provided evidence-based cosmetic guidance, including UV protection recommendations, soothing moisturizers, and products containing vitamin K or niacinamide for redness reduction. XRAI visualization demonstrated precise attention to the central vascular structures, validating clinically relevant feature detection for angiomatous lesions.

Figure A1. Vascular lesion classification.

The system’s ability to accurately identify vascular lesions represents a critical capability given their distinctive appearance and potential confusion with other reddish skin conditions. XRAI explainability clearly highlighted the vascular architectural features that distinguished this lesion type from inflammatory or malignant conditions, providing clinicians with transparent insight into the diagnostic reasoning process.

Appendix A.1.2. Melanocytic Nevi Classification Performance

Melanocytic nevi classification was 98.4% confident in characterizing a 60-year-old woman’s chest melanocytic nevus as a benign brown nevus (Figure A2). It was rated low risk, recommending SPF 30+ sunscreen, non-fragrant moisturizers, and ABCDE monitoring. XRAI explainability noted central pigment and symmetry, which is consistent with benign features.

Figure A2. Melanocytic nevi classification.

Appendix A.1.3. Dermatofibroma Classification Performance

Dermatofibroma classification achieved high performance with 98.0% confidence for a 55-year-old male’s upper extremity lesion (Figure A3). The system correctly identified the firm, brown nodular lesion as low-risk benign, demonstrating strong model performance even for this rare condition, which represented only 1.1% of the training dataset. The interface provided appropriate cosmetic guidance, focusing on sun protection to minimize pigmentation changes, gentle moisturizers, and vitamin E products for texture improvement. XRAI visualization showed moderate attention patterns on the central fibrous nodule, reflecting the subtle morphological features characteristic of dermatofibroma.

Figure A3. Dermatofibroma classification.

Appendix A.1.4. Benign Keratosis Classification Performance

Benign keratosis classification demonstrated the system’s ability to differentiate between benign and precancerous keratotic lesions, achieving 95.2% confidence for a 75-year-old male’s chest lesion (Figure A4). The system correctly identified the waxy, stuck-on appearance characteristic of seborrheic keratosis as low risk and generated comprehensive cosmetic guidance, including broad-spectrum sunscreen to prevent darkening, fragrance-free moisturizers for dry areas, and recommendations for gentle AHA exfoliation if dermatologist-approved.

XRAI analysis demonstrated high attribution intensity focused on the characteristic surface textures and raised morphology typical of benign keratotic lesions. This contrasted significantly with the actinic keratoses attention patterns, demonstrating the model’s ability to distinguish between morphologically similar but clinically distinct keratotic conditions.

Figure A4. Benign keratosis classification.

Appendix A.2. Malignant Lesion Detection and Emergency Response

Appendix A.2.1. Melanoma Detection Capabilities

Melanoma classification achieved 81.6% confidence for a 70-year-old male’s back lesion, correctly identifying the irregular, asymmetric pigmented lesion as very high risk (Figure A5). The interface triggered appropriate emergency medical guidance with distinctive red border highlighting and urgent recommendations for immediate dermatologist/oncologist consultation. XRAI analysis revealed attention patterns focused on irregular borders, asymmetric pigmentation, and morphological irregularities characteristic of malignant transformation, demonstrating the model’s ability to detect critical diagnostic features essential for early melanoma identification.

Figure A5. Melanoma classification.

Appendix A.2.2. Basal Cell Carcinoma Detection Performance

The application successfully demonstrated critical malignant lesion detection capabilities with appropriate emergency response protocols across both primary skin cancers. Basal cell carcinoma classification achieved 82.1% confidence for a 65-year-old male’s back lesion, correctly identifying the translucent, pearly lesion with central ulceration as a high-risk malignancy (Figure A6). The interface appropriately triggered urgent medical guidance with orange border highlighting and recommendations for dermatologist consultation within 1–2 days. XRAI analysis showed concentrated attention on the characteristic raised borders and central depression typical of basal cell carcinoma morphology.

Figure A6. Basal cell carcinoma classification.

Appendix A.3. Precancerous Lesion Detection Capabilities

Actinic Keratosis Classification Performance

Actinic keratosis classification validated the system’s capability for precancerous lesion detection, achieving 88% confidence for a 55-year-old female’s upper extremity lesion (Figure A7). The system appropriately identified the rough, scaly precancerous lesion as medium risk, triggering yellow-coded risk assessment and recommending dermatologist consultation within 2–4 weeks. The interface correctly excluded cosmetic guidance for this potentially malignant condition, focusing exclusively on medical guidance to prevent progression to squamous cell carcinoma.

Figure A7. Actinic keratoses classification.

The interface design clearly communicated the medium-risk status through distinctive yellow highlighting and appropriate urgency messaging. Users can upload lesion images along with demographic information, such as age, sex, and anatomical site of the body, and by clicking “Analyze Image,” this information is processed for prediction. The AI visualizations appear in the lower portion of the interface, with the XRAI module depicting the raw image, heatmap, and overlay of the model’s decision regions, accompanied by a classification confidence graph for all seven classes with the predicted class highlighted in red.

XRAI analysis showed focused attention on surface irregularities, hyperkeratotic changes, and rough texture characteristic of actinic damage and early malignant transformation. This appropriate attention pattern validates the model’s understanding of precancerous morphological features that distinguish actinic keratoses from benign keratotic conditions. The system appropriately communicated that this represents EfficientNet-B3 with multimodal input trained on the HAM10000 dataset of 10,000 images, achieving 89.09% accuracy with interpretability using XRAI, while clearly stating this is not for treatment or professional diagnosis.

References

Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 Dataset, a Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
Wu, Y.; Chen, B.; Zeng, A.; Pan, D.; Wang, R.; Zhao, S. Skin Cancer Classification With Deep Learning: A Systematic Review. Front. Oncol. 2022, 12, 893972. [Google Scholar] [CrossRef]
Ahmad, N.; Shah, J.H.; Khan, M.A.; Baili, J.; Ansari, G.J.; Tariq, U.; Kim, Y.J.; Cha, J.-H. A Novel Framework of Multiclass Skin Lesion Recognition from Dermoscopic Images Using Deep Learning and Explainable AI. Front. Oncol. 2023, 13, 1151257. [Google Scholar] [CrossRef] [PubMed]
Perez, F.; Vasconcelos, C.; Avila, S.; Valle, E. Data Augmentation for Skin Lesion Analysis. In International Workshop on Computer-Assisted and Robotic Endoscopy; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Himel, G.M.S.; Islam, M.; Al-Aff, K.A.; Karim, S.I.; Sikder, K.U.; Positano, V. Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-Based Noninvasive Digital System. Int. J. Biomed. Imaging 2024, 2024, 3022192. [Google Scholar] [CrossRef]
Liu, X.; Yu, Z.; Tan, L.; Yan, Y.; Shi, G. Enhancing Skin Lesion Diagnosis with Ensemble Learning. In Proceedings of the 2024 4th International Conference on Computer Science and Blockchain (CCSB), Shenzhen, China, 6–8 September 2024. [Google Scholar]
Munjal, G.; Bhardwaj, P.; Bhargava, V.; Singh, S.; Nagpal, N. SkinSage XAI: An Explainable Deep Learning Solution for Skin Lesion Diagnosis. Health Care Sci. 2024, 3, 438–455. [Google Scholar] [CrossRef] [PubMed]
Cino, L.; Distante, C.; Martella, A.; Mazzeo, P.L. Skin Lesion Classification Through Test Time Augmentation and Explainable Artificial Intelligence. J. Imaging 2025, 11, 15. [Google Scholar] [CrossRef] [PubMed]
Arshad, M.; Khan, M.A.; Almujally, N.A.; Alasiry, A.; Marzougui, M.; Nam, Y. Multiclass Skin Lesion Classification and Localziation from Dermoscopic Images Using a Novel Network-Level Fused Deep Architecture and Explainable Artificial Intelligence. BMC Med. Inform. Decis. Mak. 2025, 25, 215. [Google Scholar] [CrossRef]
Hussein, A.A.; Montaser, A.M.; Elsayed, H.A. Skin Cancer Image Classification Using Hybrid Quantum Deep Learning Model with BiLSTM and MobileNetV2. Quantum Mach. Intell. 2025, 7, 66. [Google Scholar] [CrossRef]
Krishna, G.S.; Supriya, K.; Sorgile, M. LesionAid: Vision Transformers-Based Skin Lesion Generation and Classification. arXiv 2023, arXiv:2302.01104. [Google Scholar]
Wang, J.; Wei, L.; Wang, L.; Zhou, Q.; Zhu, L.; Qin, J. Boundary-Aware Transformers for Skin Lesion Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
Tang, P.; Lasser, T. Pay Less On Clinical Images: Asymmetric Multi-Modal Fusion Method For Efficient Multi-Label Skin Lesion Classification. arXiv 2024, arXiv:2407.09999. [Google Scholar]
Hasan, M.Z.; Rifat, F.Y. Hybrid Ensemble of Segmentation-Assisted Classification and GBDT for Skin Cancer Detection with Engineered Metadata and Synthetic Lesions from ISIC 2024 Non-Dermoscopic 3D-TBP Images. arXiv 2025, arXiv:2506.03420. [Google Scholar]
Thomas, S.A. Combining Image Features and Patient Metadata to Enhance Transfer Learning. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, 1–5 November 2021. [Google Scholar] [CrossRef]
Tran-Van, N.-Y.; Le, K.-H. A Multimodal Skin Lesion Classification through Cross-Attention Fusion and Collaborative Edge Computing. Comput. Med. Imaging Graph. 2025, 124, 102588. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Ahn, E.; Bi, L.; Kim, J. Self-Supervised Multi-Modality Learning for Multi-Label Skin Lesion Classification. Comput. Methods Programs Biomed. 2023, 265, 108729. [Google Scholar] [CrossRef]
Yu, Y.; Jia, H.; Zhang, L.; Xu, S.; Zhu, X.; Wang, J.; Wang, F.; Han, L.; Jiang, H.; Zhou, Q.; et al. Deep Multi-Modal Skin-Imaging-Based Information-Switching Network for Skin Lesion Recognition. Bioengineering 2025, 12, 282. [Google Scholar] [CrossRef]
Christopoulos, D.; Spanos, S.; Baltzi, E.; Ntouskos, V.; Karantzalos, K. Skin Lesion Phenotyping via Nested Multi-Modal Contrastive Learning. arXiv 2025, arXiv:2505.23709. [Google Scholar] [CrossRef]
Aksoy, S.; Demircioglu, P.; Bogrekci, I. Advanced Artificial Intelligence Techniques for Comprehensive Dermatological Image Analysis and Diagnosis. Dermato 2024, 4, 173–186. [Google Scholar] [CrossRef]
Aksoy, S.; Demircioglu, P.; Bogrekci, I. Deep Learning-Based Web Application for Automated Skin Lesion Classification and Analysis. Dermato 2025, 5, 7. [Google Scholar] [CrossRef]
Yan, S.; Yu, Z.; Primiero, C.; Vico-Alonso, C.; Wang, Z.; Yang, L.; Tschandl, P.; Hu, M.; Ju, L.; Tan, G.; et al. A Multimodal Vision Foundation Model for Clinical Dermatology. Nat. Med. 2025, 31, 2691–2702. [Google Scholar] [CrossRef]
Kamal, S.; Oates, T. MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis. arXiv 2025, arXiv:2501.06887. [Google Scholar]
Oyedeji, M.O.; Okafor, E.; Samma, H.; Alfarraj, M. Interpretable Deep Learning for Classifying Skin Lesions. Int. J. Intell. Syst. 2025, 2025, 2751767. [Google Scholar] [CrossRef]
Patrício, C.; Neves, J.C.; Teixeira, L.F. Coherent Concept-Based Explanations in Medical Image and Its Application to Skin Lesion Diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, BC, Canada, 17 June 2023. [Google Scholar]
Ieracitano, C.; Morabito, F.C.; Hussain, A.; Suffian, M.; Mammone, N. TIxAI: A Trustworthiness Index for eXplainable AI in Skin Lesions Classification. Neurocomputing 2025, 630, 129701. [Google Scholar] [CrossRef]
Metta, C.; Beretta, A.; Guidotti, R.; Yin, Y.; Gallinari, P.; Rinzivillo, S.; Giannotti, F. Advancing Dermatological Diagnostics: Interpretable AI for Enhanced Skin Lesion Classification. Diagnostics 2024, 14, 753. [Google Scholar] [CrossRef] [PubMed]
Yao, C. A Comprehensive Evaluation Study on Risk Level Classification of Melanoma by Computer Vision on ISIC 2016-2020 Datasets. arXiv 2023, arXiv:2302.09528. [Google Scholar] [CrossRef]
Paccotacya-Yanque, R.Y.G.; Bissoto, A.; Avila, S. Are Explanations Helpful? A Comparative Analysis of Explainability Methods in Skin Lesion Classifiers. In Proceedings of the 2024 20th International Symposium on Medical Information Processing and Analysis (SIPAIM), Antigua, Guatemala, 13–15 November 2024. [Google Scholar]
Hussain, S.I.; Toscano, E. Enhancing Recognition and Categorization of Skin Lesions with Tailored Deep Convolutional Networks and Robust Data Augmentation Techniques. Mathematics 2025, 13, 1480. [Google Scholar] [CrossRef]
Roy, A.; Sarkar, S.; Ghosal, S.; Kaplun, D.; Lyanova, A.; Sarkar, R. A Wavelet Guided Attention Module for Skin Cancer Classification with Gradient-Based Feature Fusion. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024. [Google Scholar]
Agarwal, S.; Mahto, A.K. Skin Cancer Classification: Hybrid CNN-Transformer Models with KAN-Based Fusion. arXiv 2025, arXiv:2508.12484. [Google Scholar]
Zoravar, M.; Alijani, S.; Najjaran, H. Domain Adaptive Skin Lesion Classification via Conformal Ensemble of Vision Transformers. arXiv 2025, arXiv:2505.15997. [Google Scholar] [CrossRef]

Figure 1. HAM10000 sample images by class.

Figure 2. TTA evaluation results (Top: TTA accuracy with confidence and model ranking. Middle: confusion matrices for all models. Bottom: per-class TTA accuracy with sample counts).

Figure 3. XRAI explainability analysis for correctly classified examples across all seven diagnostic categories. Each row shows one representative case per class. Columns display: (1) original image, (2) XRAI heatmap, where lighter regions denote areas of higher diagnostic importance, (3) top 20% most important regions, and (4) overlay view combining the heatmap with the original image. This layout illustrates how the model focuses on clinically meaningful lesion areas.

Figure 4. XRAI error analysis of representative misclassifications. Each row presents (1) the original image, (2) model focus heatmap for the incorrect prediction, (3) ideal focus pattern for the correct class, (4) top 20% of model focus regions, and (5) predicted probability distributions. Lighter areas denote regions where the model attributed the greatest diagnostic importance. These analyses demonstrate failure modes, particularly where malignant lesions share benign-like visual features.

Table 1. HAM10000 dataset class distribution and clinical descriptions.

Class Code	Clinical Description	Number of Images	Percentage	Clinical Significance
nv	Melanocytic Nevi	6705	66.9%	Benign moles
mel	Melanoma	1113	11.1%	Malignant melanoma
bkl	Benign Keratosis-like Lesions	1099	11.0%	Seborrheic keratoses
bcc	Basal Cell Carcinoma	514	5.1%	Most common skin cancer
akiec	Actinic Keratoses	327	3.3%	Precancerous lesions
vasc	Vascular Lesions	142	1.4%	Angiomas and related
df	Dermatofibroma	115	1.1%	Benign fibrous lesions

Table 2. Model architecture comparison.

Architecture	Parameters	Features	Forward Time (Batch = 4)	Memory (GB)
DenseNet-121	7,632,743	1024 + 3	0.068 s	0.43
EfficientNet-B3	11,637,263	1536 + 3	0.616 s	0.58
ResNet-50	24,711,207	2048 + 3	0.018 s	0.97

Table 3. Model training and test results.

DenseNet-121
EfficientNet-B3
ResNet-50

Table 4. Model performance comparison.

Model	Test Accuracy	Validation Accuracy	Training Accuracy
DenseNet-121	82.83%	81.09%	82.3%
EfficientNet-B3	89.09%	90.08%	96.7%
ResNet-50	78.78%	78.23%	78.4%

Table 5. Test-time augmentation results.

Model	Standard Accuracy	TTA Accuracy	Improvement
DenseNet-121	82.83%	82.97%	+0.14%
EfficientNet-B3	89.09%	90.09%	+1.00%
ResNet-50	78.78%	79.31%	+0.53%

Table 6. Statistical comparison of model performance.

Comparison	Mean Difference (%)	t-Statistic	p-Value	Cohen’s d	Effect Size	Significant
EfficientNet-B3 vs.DenseNet-121	7.46	9.25	0.011 *	8.83	Large	Yes
EfficientNet-B3 vs.ResNet-50	10.98	24.10	0.002 *	19.69	Large	Yes
DenseNet-121 vs.ResNet-50	3.52	10.06	0.010 *	4.23	Large	Yes

* p < 0.05 indicates statistical significance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aksoy, S.; Demircioglu, P.; Bogrekci, I. Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support. Cosmetics 2025, 12, 194. https://doi.org/10.3390/cosmetics12050194

AMA Style

Aksoy S, Demircioglu P, Bogrekci I. Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support. Cosmetics. 2025; 12(5):194. https://doi.org/10.3390/cosmetics12050194

Chicago/Turabian Style

Aksoy, Serra, Pinar Demircioglu, and Ismail Bogrekci. 2025. "Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support" Cosmetics 12, no. 5: 194. https://doi.org/10.3390/cosmetics12050194

APA Style

Aksoy, S., Demircioglu, P., & Bogrekci, I. (2025). Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support. Cosmetics, 12(5), 194. https://doi.org/10.3390/cosmetics12050194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Web-Based Multimodal Deep Learning Platform with XRAI Explainability for Real-Time Skin Lesion Classification and Clinical Decision Support

Abstract

1. Introduction

2. Methodology

2.1. Data Acquisition and Preprocessing

2.2. Model Architectures and Training Methodology

2.2.1. DenseNet-121 Architecture

2.2.2. EfficientNet-B3 Architecture

2.2.3. ResNet-50 Architecture

2.2.4. Multimodal Feature Integration

2.2.5. Training Configuration and Optimization

2.2.6. Test-Time Augmentation Enhancement

2.2.7. Model Evaluation Methodology

2.3. Explainability Implementation (XRAI)

2.3.1. Technical Implementation Framework

2.3.2. Multimodal Explanation Generation

2.3.3. Explanation Visualization and Interpretation

2.3.4. Error Analysis and Validation Framework

2.4. Web Application Architecture

2.4.1. Real-Time Inference Pipeline

2.4.2. Model Integration and Deployment

2.4.3. Safety Protocols and Risk Stratification

2.4.4. Evidence-Based Cosmetic Guidance System

2.4.5. Explainability Visualization Framework

2.4.6. Web Application Validation Protocol

3. Results

3.1. Model Performance Results

3.1.1. Test Set Performance Analysis

3.1.2. Per-Class Performance Analysis

3.1.3. Clinical Relevance and Performance Robustness

3.1.4. Test-Time Augmentation Enhancement

3.1.5. Statistical Validation of Model Performance

3.2. Explainability Analysis Results

3.2.1. Melanocytic Nevi and Benign Lesion Analysis

3.2.2. Malignant Lesion Feature Recognition

3.2.3. Error Pattern Analysis and Clinical Implications

3.2.4. Clinical Validation of Attention Patterns

3.3. Web Application Validation Results

3.3.1. Diagnostic Performance Validation

3.3.2. Safety Protocol Performance

3.3.3. Technical Performance

4. Discussion

4.1. Model Performance and Comparative Analysis

4.2. XRAI Explainability: Methodological Innovation

4.3. Multimodal Integration and Clinical Deployment Innovation

4.4. Limitations and Future Research Directions

4.5. Clinical Impact and Healthcare Translation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Benign Lesion Classification Performance

Appendix A.1.1. Vascular Lesion Classification Performance

Appendix A.1.2. Melanocytic Nevi Classification Performance

Appendix A.1.3. Dermatofibroma Classification Performance

Appendix A.1.4. Benign Keratosis Classification Performance

Appendix A.2. Malignant Lesion Detection and Emergency Response

Appendix A.2.1. Melanoma Detection Capabilities

Appendix A.2.2. Basal Cell Carcinoma Detection Performance

Appendix A.3. Precancerous Lesion Detection Capabilities

Actinic Keratosis Classification Performance

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI