1. Introduction
Skin cancer is among the most prevalent malignancies worldwide, and melanoma represents one of the deadliest forms due to its aggressive metastatic potential [
1,
2]. Early and accurate diagnosis is crucial to improving patient survival; however, clinical assessment of skin lesions remains challenging and highly dependent on dermatologist expertise. Dermoscopic imaging has become an essential noninvasive tool for the evaluation of skin lesions; however, its interpretation is subject to variability between observers and clinical experience, particularly in primary care settings and resource-limited settings [
3].
In recent years, artificial intelligence (AI)–driven image analysis, especially deep learning–based approaches, has demonstrated strong performance in automated dermoscopic image classification [
4]. Convolutional neural networks (CNNs) and, more recently, Vision Transformers (ViTs) and hybrid CNN–ViT architectures have achieved dermatologist-level accuracy in controlled experimental settings. Consequently, AI-assisted dermoscopic analysis has been widely investigated as a potential decision support tool to improve diagnostic consistency and efficiency [
5].
Despite these advances, most existing studies rely heavily on the HAM10000 dataset as the primary benchmark for model training and evaluation [
6]. While this dataset has played a pivotal role in the advancement of dermoscopic image analysis, its limited scale, class imbalance, and constrained diversity may lead to overly optimistic performance estimates when models are evaluated under identical data distributions. In real-world clinical environments, variations in imaging conditions, acquisition devices, and patient populations often result in substantial distribution shifts, causing models trained on homogeneous datasets to exhibit degraded generalization performance.
Another important limitation in current research is the increasing performance saturation of deep learning models. As modern architectures become deeper and more expressive, the performance gains achieved through further scaling or architectural refinement have become marginal. This trend suggests that relying solely on single-model optimization may be insufficient to achieve meaningful improvements, particularly when dealing with large-scale and heterogeneous datasets. Consequently, new strategies are required to enhance robustness and generalization beyond what can be achieved through individual model advancements alone.
To systematically investigate these limitations, we curate HAM20000, a controlled expansion of the widely used HAM10000 dataset by integrating multiple publicly available ISIC releases from 2017 to 2024. Unlike datasets designed to maximize classification accuracy through extensive reannotation or aggressive curation, HAM20000 intentionally preserves the heterogeneous characteristics of real-world dermoscopic data, including class imbalance, annotation uncertainty, and inter-dataset variability. As a result, HAM20000 serves not as a new benchmark optimized for peak performance, but as a stress-test dataset for analyzing performance saturation, architectural sensitivity, and generalization behavior under realistic clinical conditions.
Beyond single-model evaluation, this study explores ensemble learning as a principled mechanism for mitigating performance saturation and improving robustness under domain shift. Rather than proposing a new ensemble algorithm, we focus on how greedy ensemble selection (GES) can be used to construct compact yet heterogeneous model ensembles that implicitly capture complementary inductive biases across convolutional, hierarchical Transformer, and hybrid CNN–Transformer architectures.
Since real-world clinical deployment inevitably involves domain shift, systematic external validation is essential but remains underexplored in previous work. Evaluation protocols are often confined to the same dataset or population, resulting in limited understanding of model generalizability. More broadly, recent studies have highlighted that the clinical usability of medical AI systems depends not only on predictive performance but also on the preservation of diagnostically relevant information and model trustworthiness. For instance, explainable AI–based approaches have been introduced to ensure that critical diagnostic features are retained during image processing tasks such as denoising [
7].
To address this gap, we perform zero-shot cross-dataset evaluation using two independent external datasets: CSMUH, representing dermoscopic images collected from an East Asian population, and BCN20000, obtained from a different European clinical source [
8,
9]. These datasets differ substantially from ISIC-based data in terms of population characteristics, imaging protocols, and acquisition environments, providing a realistic and challenging assessment of model robustness under domain shift.
Specifically, this study aims to address the following research questions:
- (1)
Does expanding the dataset scale and heterogeneity continue to benefit single-model performance, or does performance saturation emerge across architectural families?
- (2)
How do different architectural inductive biases influence robustness under cross-population and cross-institution domain shifts?
- (3)
Can heterogeneous ensemble selection mitigate performance saturation and improve generalization without target-domain fine-tuning?
By answering these questions, this work seeks to provide methodological insight into the design and evaluation of robust dermoscopic decision support systems beyond conventional benchmark-centred optimization.
2. Materials and Methods
2.1. Research Workflow
Figure 1 presents the overall research workflow of the proposed study. This study first used HAM20000 as the training and testing dataset. After data consolidation, image size normalization, and reclassification into seven lesion categories, a unified seven-class dataset was constructed. Subsequently, 27 pre-trained models with diverse architectures (including CNN, ViT, and Hybrid CNN-ViT) were employed for model training. Soft voting was used to integrate predictions from multiple models, and five-fold cross-validation was applied to evaluate the training and test accuracy of each model.
In the ensemble stage, Greedy Ensemble Selection (GES) was further applied to perform weighted ensemble optimization, aiming to improve test performance. The best-performing single model and the optimal ensemble model on HAM20000 were then selected. To evaluate model generalizability, these selected models were applied directly to the BCN20000 and CSMUH datasets for external validation, without additional training, and their test accuracy was compared to assess performance in cross-dataset and cross-population scenarios.
All models in this study were trained exclusively using image-level annotations, with no pixel-level, lesion-level, or attribute-level supervision provided. Each dermoscopic image was associated with a single diagnostic or risk label, reflecting the overall clinical assessment of the lesion rather than localized morphological features. Consequently, the primary objective of the proposed framework is image-based risk stratification, aiming to estimate the relative likelihood of malignancy at the image level, rather than to perform fine-grained lesion analysis, segmentation, or structural interpretation. This design choice aligns with common clinical screening scenarios, in which rapid global risk assessment is of greater practical relevance than detailed lesion characterization.
2.2. HAM20000 Dataset
HAM20000 is a controlled expansion of the widely used HAM10000 dermoscopic dataset. HAM10000 contains 10,015 dermoscopic images covering seven common pigmented lesion categories [
6] under the ISIC annotation framework. To systematically study performance saturation and robustness under realistic heterogeneity, HAM20000 integrates multiple public ISIC releases from 2017 to 2024 on top of HAM10000. HAM20000 includes ISIC 2017 [
10], ISIC 2018 [
11], ISIC 2019 [
6,
11,
12], ISIC 2020 [
13], and the most recent ISIC 2024 [
14] dataset.
Importantly, HAM20000 is curated with an evaluation-driven goal rather than an accuracy-maximization goal. Specifically, we intentionally preserve real-world heterogeneity in acquisition conditions (devices, illumination, clinical workflows), inter-institution variability, and the inherent annotation uncertainty of multi-source dermoscopic data. This design choice positions HAM20000 not as a newly annotated benchmark but as a stress-test dataset for analyzing (i) the limits of single-model scaling on a larger and more diverse distribution, and (ii) the interaction between architectural inductive biases and domain shift in downstream deployment scenarios.
After compiling the five ISIC publicly available datasets, a total of 20,042 images were obtained. To prevent data leakage during the release of ISIC versions, we removed duplicate images and cases from datasets from different years. The final HAM20000 dataset contains 19,968 dermoscopic images, classified according to the same seven diagnostic categories as HAM10000: malignant melanoma (MM), melanocytic nevus (Mv), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BK), dermatofibroma (DF), and vascular lesions (Vasc). Ground-truth labels follow the original ISIC/HAM annotation schema, where diagnoses are based on histopathology whenever available and otherwise on expert consensus. Dataset sources are publicly accessible through the ISIC Archive, enabling transparency and reproducibility.
To improve curation transparency, we report the final class distribution after integration (
Table 1) and keep the taxonomy consistent with HAM10000 to support a direct comparison with the previous literature. Our curation protocol emphasizes (i) deduplication across yearly ISIC releases to avoid leakage, (ii) label harmonization into a seven-class taxonomy, and (iii) controlled enrichment of minority classes to mitigate extreme imbalance while preserving clinically realistic variability.
2.3. CSMUH Dataset
The CSMUH dataset consists of dermoscopic images collected at Chung Shan Medical University Hospital (CSMUH), Taiwan, by board-certified dermatologists in a real-world clinical setting [
15]. This dataset was specifically assembled to represent East Asian patient populations, which are underrepresented in most publicly available dermoscopic datasets dominated by Western cohorts.
The dataset includes a total of 666 dermoscopic images, classified into the same seven categories of lesion as HAM20000, following the ISIC labeling convention: malignant melanoma, melanocytic nevus, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, and vascular lesions [
15]. All cases were diagnosed through clinical evaluation and confirmed by histopathological examination, ensuring high reliability of the annotation.
Images in the CSMUH dataset were captured using clinical dermoscopy equipment in routine dermatological workflows. Due to differences in skin phenotype, lesion appearance, and imaging practices, the CSMUH dataset exhibits distinct characteristics compared to ISIC-based datasets, making it a suitable benchmark for evaluating cross-population generalization and domain shift effects. The dataset is not publicly released due to institutional and ethical constraints, but its collection and labeling protocol have been described in prior peer-reviewed studies. The class distribution of the CSMUH external validation set is summarized in
Table 2.
2.4. BCN20000 Dataset
The BCN20000 dataset is a large-scale dermoscopic image collection developed by the Department of Dermatology at the Hospital Clínic of Barcelona, Spain, and formally released in Scientific Data. The original dataset comprises 18,946 dermoscopic images acquired between 2010 and 2016, reflecting a wide range of real-world clinical imaging conditions [
9]. In this study, the BCN20000 dataset was reorganized to match the same seven diagnostic categories used in HAM20000 to ensure label consistency for cross-dataset evaluation.
Samples belonging to classes not included in the HAM taxonomy, as well as ambiguous and out-of-distribution (OOD) lesions, were excluded. After this harmonization process, the resulting BCN20000 subset consists of 16,917 dermoscopic images categorized into the following seven types of lesions: malignant melanoma, melanocytic nevus, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, and vascular lesions.
The BCN20000 dataset was originally designed to represent dermoscopic lesions in the wild, encompassing clinically challenging scenarios such as large lesions that exceed the dermoscope field of view, lesions located in difficult regions (e.g., nails and mucosal surfaces), and hypopigmented or low contrast lesions. Consequently, even after harmonization of the dataset retains substantial variability in the appearance and image acquisition characteristics.
The patient population in BCN20000 mainly represents Southern European (Mediterranean) individuals. Images were captured under routine clinical conditions using heterogeneous dermoscopy devices and acquisition protocols, resulting in pronounced intra-dataset variability. Diagnostic labels were assigned based on histopathological confirmation whenever available and supplemented by consensus assessments of expert dermatologists for cases without pathological verification.
The curated seven-class BCN20000 subset was used in this study as an external validation dataset to assess the robustness and generalizability under realistic clinical domain shift conditions, complementing the HAM20000 training data. The class distribution of the BCN20000 external validation set is summarized in
Table 3.
2.5. Data Preprocessing
To ensure consistency and comparability across different datasets, as well as to enhance the stability and generalization capability of the models during both training and inference, a systematic data preprocessing pipeline was applied to all dermoscopic images in this study. The preprocessing procedures included image size standardization, pixel normalization, label harmonization, and data split configuration, and were consistently applied throughout all experiments.
2.5.1. Image Preprocessing and Size Standardization
All input images were resized to a unified resolution of 224 × 224 pixels prior to being fed into the models to meet the input requirements of the selected CNN and Vision Transformer architectures [
16]. During preprocessing, the original color channels of the images were preserved and converted to the RGB color space to avoid potential learning bias caused by color space discrepancies [
4].
No manual lesion localization was performed. Instead, complete dermoscopic images were used directly as model inputs to better reflect real-world clinical application scenarios and to avoid introducing additional prior assumptions into the learning process.
2.5.2. Pixel Normalization
To accelerate model convergence and improve training stability, all images were converted into tensors and subsequently normalized using the ImageNet standard mean and standard deviation as follows:
Mean = [0.485, 0.456, 0.406]
Std = [0.229, 0.224, 0.225]
This normalization strategy aligns the input data distribution with that of the pre-trained weights, which is particularly beneficial for transfer learning and maintaining stable performance under cross-dataset evaluation scenarios.
2.5.3. Class Harmonization and Label Consistency
As this study involves multiple dermoscopic datasets originating from different sources (HAM20000, CSMUH, and BCN20000), all datasets were merged into a seven-class skin lesion taxonomy to ensure fair cross-dataset comparison and external validation. The class definitions strictly follow those of HAM20000 and include: malignant melanoma (MM), melanocytic nevus (Mv), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BK), dermatofibroma (DF) and vascular lesions (Vasc) [
6,
11].
Samples in the original BCN20000 dataset that did not belong to the above classification system, as well as cases labeled uncertain or out-of-distribution (OOD), were excluded during preprocessing. Furthermore, to prevent evaluation bias caused by inconsistent label semantics, all external datasets were subjected to label remapping based on a predefined class correspondence table when loaded, ensuring that identical numerical labels represent the same clinical diagnoses across different datasets.
2.5.4. Data Splitting and Experimental Consistency
During model training and validation, the HAM20000 dataset was divided into training and test sets according to established settings, and cross-validation was used for model selection and ensemble candidate screening in the ensemble [
6]. In the external validation stage, the CSMUH and BCN20000 datasets were used exclusively for testing, without any additional fine-tuning or retraining, to rigorously evaluate model generalization under cross-population and cross-clinical environment conditions [
4,
6].
Overall, the data preprocessing pipeline was strictly kept consistent across all models and experimental settings, ensuring that subsequent model performance comparisons and ensemble learning results reflect differences in model architecture and methodology, rather than inconsistencies in data handling.
2.6. Model Selection
The objective of this study was to conduct a comprehensive evaluation of the performance and integration of multiple deep learning architectures in the classification of dermoscopic images. To ensure that the evaluation results were representative and generalizable, we systematically selected three major categories of architecture families: Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and hybrid CNN–ViT models.
This design simultaneously considered: (1) the ability of traditional networks to learn local visual features, (2) the ability of Transformer-based models to capture global and long-range dependencies, and (3) differences in computational efficiency and parameter scale. Through this formulation, we analyzed how each architecture behaved under data augmentation and cross-dataset evaluation scenarios and further established a set of complementary candidate models for subsequent ensemble learning.
2.6.1. Convolutional Neural Networks (CNNs)
Within the CNN architecture category, this study selected the EfficientNet and EfficientNet-V2 series as the core comparison models. EfficientNet employs a compound scaling strategy that optimizes network depth, width, and input resolution, and has been shown to achieve a favorable balance between performance and parameter efficiency in medical image analysis tasks [
17].
To investigate the impact of model scale on classification performance, we included EfficientNet-B4, B5 and B6, covering configurations from mid-scale to large-scale models, thus progressively enhancing representational capacity.
Furthermore, EfficientNet-V2/M and EfficientNet-V2/L were incorporated to evaluate the effects of improved training strategies and faster convergence behavior in dermoscopic image fine-tuning scenarios [
18].
Furthermore, CSATv2 R540 and R680 were included as high-performance convolutional architectures for supplementary comparison, allowing an analysis of how different CNN design philosophies influence feature extraction under the same resolution input conditions.
2.6.2. Vision Transformer Models
To assess the potential of self-attention mechanisms in dermoscopic image classification, this study systematically incorporated a diverse range of Vision Transformer (ViT) architectures. ViT-B/16, ViT-L/16, and ViT-L/32 were selected as canonical Transformer representatives, allowing evaluation of the effects of model capacity and patch size on global feature modeling [
16].
Moreover, Swin Transformer and Swin Transformer V2 models (Tiny, Small, and Base) were included. Their hierarchical architectures and window-based self-attention mechanisms effectively balance local and global feature representation while improving computational efficiency for high-resolution images [
19,
20].
To further increase architectural diversity, NaFlexViT (GAP and Parallel GAP) [
21,
22,
23] and MaxViT [
24] were also selected, representing flexible attention designs and hybrid window-plus-grid attention mechanisms, respectively. Additionally, a Low-Latency Vision Transformer [
25] was incorporated to investigate the trade-off between performance and computational efficiency in lightweight Transformer architectures, particularly for latency-sensitive or resource-constrained applications.
2.6.3. Hybrid CNN–ViT Models
To combine the strengths of CNNs in local texture extraction with the global dependency modeling capability of Transformers, several hybrid CNN–ViT architectures were included for comparison. The EfficientFormerV2 series (S0 to L) [
26] integrates convolution and attention modules at different network stages, achieving a balance between computational efficiency and representational power.
SwiftFormer [
27] and MobileViTV2 [
25] emphasize lightweight design principles, making them suitable for deployment scenarios with limited computational resources or practical clinical applications. Furthermore, CoAtNet-Lite-Small [
28] interleaves convolutional and self-attention operations at different depth levels, offering high structural diversity and complementary feature learning strategies.
Collectively, these hybrid architectures exhibit substantial variation in the structural design and feature learning mechanisms, providing strong model complementarity for the subsequent ensemble learning stage.
2.7. Transfer Learning
Transfer learning involves applying models trained on a source domain to a new target domain, allowing the reuse of learnt feature representations and eliminating the need to train models from scratch [
29]. One of its main advantages lies in significantly reducing the amount of task-specific training data required while maintaining strong performance [
30].
In this study, transfer learning was implemented by fine-tuning pre-trained models, where the weights of the early layers were frozen to preserve general visual features, and the remaining higher-level layers were trained to adapt to the dermoscopic image classification task. This strategy allows the models to retain robust low-level representations learned from large-scale natural image datasets while learning task-specific discriminative features in deeper layers.
A total of 27 widely adopted pre-trained architectures were selected as backbone models, including representative convolutional neural networks (CNNs) and Vision Transformer (ViT)-based models, all of which were initialized with ImageNet pre-trained weights to ensure a fair and consistent comparison between architectures. Importantly, all models were trained under identical hyperparameter settings to ensure a fair and controlled comparison. Specifically, the Adam optimizer was used with a learning rate of 1 × 10−4, a batch size of 32, and 100 training epochs for all experiments. All experiments were implemented using Python 3.10.19, PyTorch 2.7.1 + cu118, torchvision 0.22.1 + cu118, timm 1.0.25, and CUDA 11.5.
2.8. Greedy Ensemble Selection for Model Integration
This study adopts greedy ensemble selection (GES) as a practical instrument to construct compact heterogeneous ensembles from a library of pretrained models. We emphasize that our contribution is not a new ensemble algorithm; rather, we use GES to operationalize a domain-shift–oriented evaluation framework that examines how architectural diversity translates into robustness when deployment data differ from training data.
GES incrementally builds an ensemble by adding, at each iteration, the candidate model that yields the largest marginal improvement on a dataset when combined with the current ensemble prediction. This stepwise procedure implicitly favors models that contribute complementary error patterns rather than simply selecting the strongest models [
31]. In our setting, the candidate pool spans convolutional networks (CNNs), hierarchical Vision Transformers, and hybrid CNN–Transformer architectures, which differ substantially in inductive bias (local texture sensitivity, hierarchical locality, and global contextual modeling).
While greedy ensemble selection (GES) is designed to improve overall performance, it does not guarantee that the resulting ensemble will outperform the best individual model in all cases. Instead, the method prioritizes robustness and stability across datasets by aggregating complementary model predictions.
To reduce redundancy and limit unnecessary complexity, we stop adding models when the marginal validation gain becomes negligible or when the ensemble reaches a compact size that remains feasible for deployment. The final ensemble prediction is produced through soft voting (probability averaging), which tends to be more stable under uncertainty than hard voting in multiclass medical image classification [
8].
2.9. Cross-Population Dataset Testing and External Validation
To assess real-world robustness, we adopt a strict zero-shot external validation protocol: All model selection (including ensemble composition) is performed using HAM20000 only, and the resulting models are directly evaluated on two independent external datasets (CSMUH and BCN20000) without any additional fine-tuning. This design intentionally isolates generalization behavior from target domain adaptation, thereby providing a conservative estimate of deployment robustness under domain shift [
32].
The CSMUH dataset represents an East Asian population collected in a single medical center, while BCN20000 represents a Southern European cohort acquired under heterogeneous clinical conditions. Both datasets are organized into the same seven-class taxonomy to avoid label semantic mismatch. Under this protocol, we analyze (i) the magnitude of performance drop from source to target domains, (ii) architecture-specific sensitivity to population and institutional shifts, and (iii) whether heterogeneous ensembling can consistently mitigate domain-shift degradation beyond what any single model achieves.
3. Results
3.1. HAM20000: Performance Comparison of Different Pretrained Models
Table 4 summarizes the classification accuracy of the 27 pre-trained models evaluated on the HAM20000 dataset, reporting mean accuracy and standard deviation in repeated experiments for both training and test sets. The results highlight clear performance differences between architectural families, as well as the limitations of single-model optimization.
Among convolutional neural network (CNN) architectures, the EfficientNet family demonstrated consistently strong performance, with test accuracies ranging from 0.832 to 0.853. The larger variants, such as EfficientNetB6 and EfficientNetV2-L, achieved higher test accuracies compared to their smaller counterparts, indicating that the increased model capacity contributes to an improved feature representation in HAM20000. However, the gains from scaling were relatively incremental, suggesting diminishing returns as the size of the model increased.
The CSATv2 models achieved the best performance among CNN-based approaches, with CSATv2 R540 and CSATv2 R680 reaching test accuracies of 0.871 and 0.867, respectively. These results suggest that CNN architectures optimized for high-resolution input and multi-scale feature aggregation can provide competitive performance in dermoscopic image classification tasks.
Vision Transformer (ViT) models exhibited more varied behavior. The classic ViT architectures (ViT-B/16, ViT-L/16, ViT-L/32) showed moderate test performance, with accuracies between 0.831 and 0.843, generally lower than those of their CNN counterparts. This indicates that standard ViT models may struggle to generalize effectively on dermoscopic images without additional inductive bias.
In contrast, hierarchical and window-based Transformers demonstrated stronger performance. The Swin Transformer and Swin Transformer V2 variants consistently achieved test accuracy above 0.860, with Swin Base, SwinV2 Base, and MaxViT Small reaching up to 0.871. These architectures benefit from hierarchical feature representations that better capture both local lesion structures and broader contextual information.
Lightweight Transformer variants, including NaFlexViT (GAP and Parallel GAP) and the Low-Latency Vision Transformer, achieved lower test accuracies (0.839–0.843), reflecting the trade-off between computational efficiency and representational capacity.
Hybrid CNN–ViT architectures generally exhibited strong and stable performance. In particular, the EfficientFormerV2 series showed a clear performance progression with increasing model scale. The test accuracy improved from 0.845 (S0) to 0.869 (S2), and EfficientFormerV2 L achieved the highest single-model test accuracy of 0.879 among all models evaluated.
Other hybrid models, including MobileViTV2_100 and Swin Tiny, also achieved competitive results, reinforcing the effectiveness of combining convolutional inductive bias with Transformer-based global context modeling. These findings indicate that hybrid architectures provide a favorable balance between representation power and generalizability for dermoscopic image analysis.
In general, the results in
Table 4 demonstrated that despite architectural diversity and scaling strategies, single-model performance in HAM20000 tended to saturate around the 0.87–0.88 test accuracy range. Although hybrid and advanced Transformer architectures marginally outperformed traditional CNNs, no individual model achieved a decisive breakthrough. This performance saturation motivated the adoption of ensemble learning strategies, as the combination of complementary models may offer a more effective path to further performance improvement.
3.2. Results of Greedy Ensemble Selection
This section presents the results of model selection and performance evaluation using the Greedy Ensemble Selection (GES) framework based on the HAM20000 test set.
Table 5 summarizes the candidate models and their selection outcomes within the GES procedure. Initially, all 27 trained models were included in the GES candidate pool. Model selection was performed using five-fold cross-validation (CV), with the GES algorithm applied independently within each fold. As a result, different subsets of models were selected across folds.
Accordingly, the models listed in the CV row of
Table 5 represent the union of models selected in at least one of the five cross-validation folds, rather than a single fixed ensemble configuration. In contrast, the final ensemble was constructed based on the best-performing fold, from which five models were selected and used in subsequent experiments. This design ensures that ensemble selection is conducted entirely within the cross-validation framework and remains independent of the final holdout test set.
The performance reported in the CV section (training accuracy: 0.988 ± 0.006; test accuracy: 0.918 ± 0.008) reflects the fold-averaged performance of the GES ensembles across all five folds, rather than individual model results. By comparison, the holdout evaluation corresponds to a single ensemble configuration derived from the best-performing fold. The final ensemble, consisting of EfficientFormerV2_L, EfficientNetV2_M, MaxViT-Small, EfficientNet_B6, and EfficientNetV2_L, achieved a training accuracy of 0.982 and a holdout test accuracy of 0.930.
These results demonstrate that while different models may be selected across folds, the GES framework consistently identifies complementary architectures and integrates them into a stable and high-performing final ensemble.
Table 6 reports the computational efficiency of the GES-selected models under both cross-validation and holdout evaluation settings. To ensure practical relevance, inference performance was evaluated using the largest selected architecture, EfficientNetV2-L, with an input resolution of 224 × 224 × 3 and a batch size of 32. Under FP32 precision, batch inference with EfficientNetV2-L requires approximately 2.5 GB of GPU memory, which includes model parameters, input tensors, intermediate activations, and framework overhead. This modest memory footprint indicates that the proposed framework can be deployed on GPUs with limited memory capacity commonly available in clinical environments.
In terms of inference latency, the total execution time for cross-validation is 3829 s, reflecting repeated inference across multiple folds, whereas the holdout evaluation requires only 760 s. The substantial reduction in inference time under the holdout setting demonstrates the scalability of the proposed approach for real-world deployment, where inference is typically performed once per case rather than repeatedly. Overall, these results show that the GES-selected model achieves a favorable balance between predictive performance and computational efficiency, supporting its feasibility as a clinical screening and decision-support tool.
3.3. GES Performance Evaluation in HAM20000
Figure 2 presents the five-fold averaged confusion matrix for the best GES model evaluated on the HAM20000 dataset. Overall, the model demonstrates strong diagonal dominance across all seven classes, indicating that most samples are correctly classified. High true positive counts are observed for the majority classes, particularly class 1 and class 0, reflecting the model’s effectiveness on categories with sufficient training samples. However, non-negligible misclassifications remain between visually similar lesion categories, such as confusions between class 0 and class 4, as well as between class 2 and class 3, suggesting intrinsic inter-class ambiguity in dermoscopic patterns.
In the confusion matrix as listed in
Figure 2, the support of each class corresponds to the total number of ground-truth samples (row-wise sum), whereas the diagonal element represents only the true positives. Therefore, for class 0, the support is 996, not 905 (shown in
Table 7). For the best GES model, the overall classification accuracy on the HAM20000 dataset reaches 91.8%.
To provide a more fine-grained evaluation beyond overall accuracy,
Table 7 reports per-class precision, recall, and F1-score. The results show that the model achieves consistently high precision across all classes, with precision values exceeding 0.85 for every category and reaching above 0.94 for classes 1 and 6. This indicates that the model produces relatively few false positives. Recall values are also high for most major classes (classes 0, 1, 2, and 4), confirming reliable sensitivity on frequently occurring lesion types. In contrast, minority classes such as class 3 and class 5 exhibit lower recall, despite maintaining high precision, implying that the model tends to make conservative predictions for underrepresented categories and may miss some true cases rather than over-predicting them.
Table 8 further summarizes the class-balanced evaluation using macro-averaged and weighted-averaged metrics. The macro-averaged F1-score of 0.881 indicates that, when treating all classes equally, the model maintains balanced performance across categories despite the pronounced class imbalance of the dataset. Meanwhile, the weighted-average F1-score of 0.918 is consistent with the overall accuracy, demonstrating that the strong performance on majority classes does not mask severe degradation on minority classes. Together, these results confirm that the proposed single model achieves both high overall accuracy and reasonable class-balanced performance, while highlighting remaining challenges in improving recall for rare lesion types.
3.4. External Validation Results on the CSMUH Dataset
Table 9 and
Table 10 list the accuracy of external evaluation classification on the CSMUH external dataset, grouped by architectural category. All models were trained exclusively on the HAM20000 dataset and evaluated on CSMUH without any additional fine-tuning. This experimental setting reflects a realistic and challenging cross-population generalization scenario in which differences in skin phenotype, lesion characteristics, and imaging conditions introduce a substantial domain shift.
3.4.1. Single-Model Performance on CSMUH
As presented in
Table 9, the test accuracies of the individual models varied from 0.370 to 0.629, showing significant differences in robustness among various architectural families. Overall, single-model performance on the CSMUH dataset was substantially lower than that observed on the HAM20000 test set, confirming the presence of a pronounced cross-population domain shift.
Among CNN-based models, the EfficientNet family achieved moderate generalization performance. EfficientNet-B5 delivered the best accuracy within this group (0.491 ± 0.019), closely followed by EfficientNet-B6 and EfficientNetV2 variants. On the contrary, the CSATv2 models (R540 and R680) exhibited a markedly weaker generalization, with test accuracies below 0.40, suggesting higher sensitivity to domain shift despite their competitive performance in the source dataset.
Vision Transformer architectures demonstrated more robust cross-dataset generalization compared to conventional CNNs. In particular, hierarchical Transformer models consistently outperformed standard ViT variants. The Swin Tiny Transformer achieved the highest single-model accuracy on the CSMUH dataset (0.629 ± 0.036), followed by the Swin Base (0.557 ± 0.027) and SwinV2 variants, all exceeding 0.50. These results indicated that hierarchical feature representations and localized self-attention mechanisms contributed substantially to improved robustness under cross-population conditions. In contrast, the vanilla ViT models (Vision-B16, Vision-L16, Vision-L32) remained below 0.47, showing limited resilience to domain mismatch.
Hybrid CNN–ViT architectures provided competitive and comparatively stable performance across different configurations. As shown in
Table 9, EfficientFormerV2-L achieved the best performance among hybrid models (0.513 ± 0.018). MobileViTV2-100 (0.505 ± 0.009) and CoAtNet Lite Small (0.493 ± 0.020) also performed favorably, outperforming most CNN-only models. However, except for Swin-based models, most single architectures achieved an accuracy of less than approximately 0.53, highlighting a practical performance ceiling for individual models under severe domain shift.
3.4.2. GES Performance on CSMUH
Table 10 presents the external validation results of the GES-based ensemble model on the CSMUH dataset. The ensemble, composed of EfficientNet B6, EfficientNetV2 M, EfficientNetV2 L, MaxViT-Small, and EfficientFormerV2 L, was constructed based on model selection from the HAM20000 test set and achieved an accuracy of 0.565 ± 0.028. Although this predefined ensemble does not include all single models evaluated in external validation (e.g., Swin Tiny Transformer shown in
Table 9), its performance demonstrates a clear improvement over most individual models, as well as all hybrid and CNN-only approaches evaluated on the CSMUH dataset.
The observed performance gain indicates that the GES strategy still effectively integrates complementary inductive biases from CNN, vision transformer, and hybrid architectures. Such integration helps alleviate limitations of individual models under cross-population and domain-shift conditions, leading to enhanced robustness and more stable predictive behavior in external validation settings.
In summary, external validation on the CSMUH dataset revealed substantial performance degradation in single-model approaches due to cross-population domain shift. Although hierarchical Vision Transformers, particularly Swin-based architectures, exhibited relatively stronger robustness among individual models, potentially due to their multi-scale hierarchical representations and context-aware attention mechanisms, their standalone performance remained limited under external validation conditions. By comparison, the proposed GES-based ensemble, predefined through model selection on the HAM20000 dataset, demonstrated more stable and competitive performance across the external dataset. Rather than delivering a substantial performance leap, these results indicate that the ensemble strategy provides a modest yet consistent improvement in cross-dataset generalization by integrating complementary inductive biases, thereby enhancing robustness under domain shift and supporting its potential applicability in real-world clinical settings.
3.5. External Validation Results on the BCN20000 Dataset
Table 11 and
Table 12 summarize the external validation results on the BCN20000 dataset for individual models and the best GES-based ensemble model, respectively. All models were trained exclusively on HAM20000 and evaluated on BCN20000 without additional fine-tuning, thereby assessing their robustness under cross-population and cross-institution domain shifts.
3.5.1. Single-Model Performance on BCN20000
Table 11 indicates that single-model performance was generally better on the BCN20000 dataset than on the CSMUH dataset, with test accuracies varying from 0.447 to 0.601. This suggested that there was a relatively smaller domain shift between HAM20000 and BCN20000. This observation likely reflected closer alignment in patient demographics, imaging equipment, and lesion acquisition protocols among datasets originating from Europe.
Among CNN-based models, the EfficientNet and EfficientNetV2 families demonstrated moderate and stable generalization. EfficientNet-B5 achieved the highest accuracy within this group (0.541 ± 0.019), followed by EfficientNetV2-M (0.539 ± 0.019) and EfficientNetV2-L (0.536 ± 0.015). In contrast, CSATv2 R680 exhibited noticeably reduced robustness, reflected in both a lower mean accuracy (0.447 ± 0.048) and a higher variance, suggesting sensitivity to cross-institution variability.
Vision Transformer models generally achieved competitive performance on the BCN20000 dataset, though with greater architectural variability than observed on CSMUH. Classical ViT architectures (ViT-B/16, ViT-L/16, and ViT-L/32) consistently achieved test accuracies around 0.57, with minimal differences between patch sizes. Hierarchical Transformer variants further improved performance: Swin Small, SwinV2 Small, and SwinV2 Base all achieved accuracies in the range of 0.56, while MaxViT Small achieved the highest single-model performance overall (0.601 ± 0.020). In particular, the Tiny Swin Transformer underperformed (0.467 ± 0.042), indicating that the reduced model capacity limits its effectiveness even under moderate domain shift.
Hybrid CNN–ViT architectures continued to demonstrate strong and consistent performance across configurations. The EfficientFormerV2 series showed a clear upward performance trend with increasing model capacity, improving from 0.553 ± 0.010 (S0) to 0.598 ± 0.009 (L). MobileViTV2-100 (0.582 ± 0.014) and SwiftFormer-Small (0.568 ± 0.017) also achieved competitive results, confirming the effectiveness of hybrid designs in balancing representational power, efficiency, and cross-dataset generalization.
Overall, while several single models approached 0.60 test accuracy, performance differences among the best-performing architectures remained limited, suggesting a saturation effect like that observed on HAM20000.
3.5.2. GES Performance on BCN20000
As shown in
Table 12, the GES-based ensemble model further improved performance on the BCN20000 dataset. The selected ensemble, comprising EfficientNet-B6, EfficientNetV2-M, EfficientNetV2-L, MaxViT Small, and EfficientFormerV2-L, achieved a test accuracy of 0.616 ± 0.009, exceeding all individual model results.
This improvement demonstrated that even under relatively mild domain shift conditions, ensemble learning remained beneficial. By integrating complementary feature representations and inductive biases from CNN, Vision Transformer, and hybrid architectures, the ensemble effectively reduced model-specific errors and improved prediction stability.
In summary, external validation on the BCN20000 dataset demonstrated that single-model performance was consistently higher than on the CSMUH dataset, reflecting reduced domain shift between HAM20000 and BCN20000. Vision Transformer and hybrid CNN–ViT architectures generally outperformed conventional CNNs, with MaxViT and EfficientFormerV2-L representing the strongest individual models. Nevertheless, performance among top single models tended to plateau around 0.60 test accuracy. The proposed GES-based ensemble approach achieved the best overall performance, confirming its effectiveness in improving robustness and generalization in multi-institution clinical datasets.
4. Discussion
Artificial intelligence–based dermoscopic analysis has shown promising performance in controlled experimental settings; however, translating these advances into reliable clinical decision support systems remains challenging. This study focuses on two practical issues that are critical for real-world deployment: (i) the limited gains achieved by further optimizing single deep learning models, and (ii) the substantial performance degradation observed when models are applied to new populations or clinical environments. By using HAM20000 as a stress-test dataset and conducting strict zero-shot external validation on two independent cohorts (CSMUH and BCN20000), we evaluate model robustness from a clinically relevant perspective and assess whether heterogeneous ensembling can improve reliability under realistic domain shift conditions.
It is important to emphasize that the proposed models are not intended to provide detailed morphological explanations or region-specific lesion attributes. Instead, their role is to support high-level clinical triage through image-based risk stratification, based solely on global visual patterns present in dermoscopic images. While fine-grained lesion analysis may offer complementary insights, such capabilities would require dedicated lesion- or pixel-level annotations, which fall outside the scope of the current study.
4.1. In-Domain Performance and the Clinical Meaning of Performance Saturation
On the HAM20000 dataset, a wide range of modern CNN, Vision Transformer, and hybrid CNN–Transformer architectures achieve strong classification performance. Nevertheless, improvements obtained by selecting increasingly complex or larger models are incremental, with single-model accuracy converging to a narrow range. From a clinical standpoint, this saturation indicates that simply adopting newer or more computationally intensive architectures is unlikely to yield substantial diagnostic benefit once a certain performance level has been reached.
In dermoscopic image analysis, this behavior is clinically plausible. Many lesion categories share overlapping visual characteristics, particularly at early or borderline disease stages, and diagnostic ambiguity may persist even for experienced dermatologists. In addition, multi-source datasets inevitably contain annotation uncertainty and acquisition variability. These factors limit the achievable performance of any single image-based model and suggest that further gains require strategies that address robustness and error structure rather than model capacity alone.
4.2. External Validation Highlights the Impact of Population and Institutional Shift
A major strength of this study is the use of strict zero-shot external validation, which reflects a realistic deployment scenario in which models trained on public datasets are applied to new clinical settings without additional fine-tuning. When evaluated on the CSMUH dataset, representing an East Asian population, all single models exhibit a marked drop in performance compared with HAM20000. This finding underscores a critical clinical risk: high accuracy on a public benchmark does not guarantee reliable performance in a different population or institution.
Importantly, robustness varies across architectural families. Hierarchical Transformer-based models, particularly Swin-based variants, demonstrate comparatively better generalization on CSMUH than many CNN-only models. This suggests that architectures capable of capturing multi-scale contextual information may be less sensitive to changes in skin phenotype, lesion appearance, and acquisition conditions. On the BCN20000 dataset, where the population and imaging characteristics are closer to those of ISIC-based data, overall performance is higher; nevertheless, performance differences among top single models remain modest, again indicating a practical ceiling for single-model deployment.
From a clinical perspective, these results highlight the necessity of external validation across diverse cohorts before deployment. Models intended for decision support should not be evaluated solely on data drawn from similar distributions to their training sets, as this can mask clinically significant failure modes.
4.3. Inter-Class Confusion and Clinically Relevant Error Patterns
Analysis of confusion matrices provides further insight into model behavior beyond aggregate accuracy. Across internal and external evaluations, most misclassifications occur between visually and pathologically adjacent lesion categories. While the ensemble reduces overall misclassification rates, these confusion patterns persist, particularly under domain shift.
The observed confusion between visually similar lesion categories should be interpreted in the context of inherent ambiguities in dermatological imaging. Many benign and malignant lesions share overlapping visual characteristics, including color heterogeneity, irregular borders, and textural patterns, which can complicate discrimination even for experienced dermatologists. Such ambiguity is further influenced by variations in image acquisition conditions, lesion evolution stages, and inter-observer variability in diagnostic assessment. Therefore, a proportion of the misclassifications observed in our results likely reflects intrinsic diagnostic uncertainty embedded in the imaging modality itself, rather than solely limitations of the proposed models.
Such errors are clinically meaningful. In practice, adjacent categories often correspond to lesions with overlapping dermoscopic features and ambiguous diagnostic boundaries. Therefore, persistent confusion between these classes reflects inherent task difficulty rather than isolated model deficiencies. From a deployment standpoint, this reinforces the importance of cautious interpretation of automated predictions, especially in screening or triage settings, and supports the need for decision support systems that assist rather than replace clinical judgment.
4.4. Ensemble Learning as a Practical Robustness Strategy
Although this study does not introduce a novel ensemble algorithm, the results indicate that heterogeneous ensembling can provide tangible clinical value. Overall, the greedy ensemble selection (GES) framework demonstrates competitive performance across internal testing and external validation, with the most pronounced gains observed on the BCN20000 external validation set. However, its performance advantages are dataset-dependent. Notably, on the CSMUH external validation set, GES outperforms several single-model baselines but does not surpass the best-performing individual architecture (Swin Tiny Transformer), underscoring that ensemble approaches may not universally exceed the strongest standalone models. Accordingly, GES should be regarded as a robust and complementary strategy rather than a uniformly superior solution.
Clinically, this suggests that combining models with different inductive biases, such as CNNs emphasizing local texture and Transformers capturing broader contextual patterns, can reduce correlated errors and improve prediction stability. Notably, the ensemble remains relatively compact, which is advantageous for practical deployment compared with large, unwieldy ensembles.
Importantly, the ensemble improves robustness without requiring access to target-domain labels or fine-tuning, a scenario that closely mirrors real-world clinical adoption where local annotated data may be scarce or unavailable.
4.5. Implications for Clinical Deployment and Model Evaluation
The findings of this study have several implications for clinical AI development. First, reported benchmark accuracy should not be interpreted as a proxy for deployment reliability. External validation across populations and institutions should be considered a minimum requirement for dermoscopic decision support systems. Second, architectural choices should be informed not only by in-domain performance but also by robustness characteristics under domain shift. Third, ensemble-based approaches provide a pragmatic means of improving stability and mitigating risk when uncertainty about deployment conditions exists.
From a workflow perspective, AI systems based on such models are best positioned as supportive tools—for example, in triage, second-reader assistance, or prioritization—rather than as standalone diagnostic systems. Transparent reporting of limitations and error patterns is essential to foster appropriate clinical use and trust.
4.6. Limitations and Future Directions
This study has limitations. Performance evaluation primarily relies on accuracy; future work should incorporate class-balanced metrics, calibration analysis, and uncertainty estimation to better reflect clinical risk, particularly for underrepresented lesion categories. In addition, the analysis is limited to image-level classification without explicit lesion localization or integration of clinical metadata, which may further enhance interpretability and robustness. Finally, although the proposed ensemble is compact, computational efficiency and inference latency should be quantified in future deployment-oriented studies.
A further limitation of this study is that the impact of label noise was not explicitly assessed. All experiments were conducted under the assumption that the provided image-level labels represent the reference standard, and no additional analyses were performed to quantify or model potential annotation uncertainty. Given that label noise is a well-recognized and paramount challenge in image-based medical diagnosis, particularly in dermatology, its influence on model performance warrants systematic investigation. Future work may explore noise-aware learning strategies, multi-reader annotation schemes, or consensus-based labeling to better account for this source of uncertainty.
Future research will focus on robustness-oriented model design, uncertainty-aware decision support, and prospective validation in real-world clinical settings. Expanding evaluation to larger and more diverse populations will be essential for establishing the generalizability and clinical utility of AI-based dermoscopic decision support systems.
5. Conclusions
In this study, we conducted a comprehensive evaluation of modern deep learning approaches for multiclass dermoscopic image classification, with a particular emphasis on clinical robustness and generalizability under realistic deployment conditions. By systematically comparing convolutional neural networks, Vision Transformers, hybrid CNN–Transformer architectures, and a compact greedy-selected heterogeneous ensemble, we examined not only in-domain performance but also behavior under strict zero-shot external validation across distinct populations and clinical environments.
Our results demonstrate that, although advanced single models can achieve strong performance on a large, heterogeneous source dataset, their diagnostic accuracy exhibits clear saturation and degrades substantially when applied to external cohorts. This performance decline is especially pronounced under cross-population shifts, underscoring a critical clinical risk: benchmark-level accuracy on public datasets does not necessarily translate into reliable real-world performance. These findings highlight the importance of evaluating dermoscopic AI systems beyond source-dataset optimization and of explicitly considering population and institutional variability during model assessment.
Importantly, we show that heterogeneous ensemble learning provides a practical and effective strategy to improve robustness under domain shift. Without introducing new algorithms or relying on target-domain fine-tuning, the greedy ensemble selection–based approach consistently improves prediction stability and overall accuracy across both internal testing and external validation. From a clinical perspective, this suggests that integrating complementary inductive biases from different architectural families can mitigate correlated errors and partially offset the limitations of individual models when deployment conditions differ from training data.
Analysis of confusion patterns further reveals that most residual errors occur between visually and pathologically adjacent lesion categories, reflecting inherent diagnostic ambiguity rather than isolated model failure. This observation reinforces the role of AI systems as decision-support tools rather than standalone diagnostic solutions and highlights the need for cautious interpretation of automated predictions, particularly in screening or triage scenarios.
Overall, this work emphasizes that robust dermoscopic decision support requires more than incremental architectural improvements. External validation across diverse cohorts, transparent reporting of failure modes, and robustness-oriented strategies such as compact heterogeneous ensembling are essential for bridging the gap between experimental performance and clinical utility. Future research should focus on uncertainty-aware decision support, clinically meaningful performance metrics, and prospective validation in real-world workflows to further enhance the safety, reliability, and trustworthiness of AI-based dermatological systems.