Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification

Aksoy, Serra; Demircioglu, Pinar; Bogrekci, Ismail

doi:10.3390/dermato6020022

Open AccessArticle

Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification

by

Serra Aksoy

^1,*

,

Pinar Demircioglu

²

and

Ismail Bogrekci

²

¹

Institute of Computer Science, Ludwig Maximilian University of Munich (LMU), Oettingenstrasse 67, 80538 Munich, Germany

²

Department of Mechanical Engineering, Aydin Adnan Menderes University (ADU), Aytepe, 09010 Aydin, Turkey

^*

Author to whom correspondence should be addressed.

Dermato 2026, 6(2), 22; https://doi.org/10.3390/dermato6020022

Submission received: 18 December 2025 / Revised: 16 March 2026 / Accepted: 20 May 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Melanoma: Updates and Path Forward)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: Vision Transformers (ViTs) have demonstrated impressive performance in dealing with large-scale natural image datasets. They have started to be used in medical image classification problems as well. However, how they behave under real-world conditions, such as data scarcity and extreme class imbalance, has not been well investigated. In this study, we examine the feasibility of using a standard Vision Transformer Base model that learned from scratch how to classify skin lesion images into multiple classes using the ISIC 2019 dataset. Methods: The Vision Transformer architecture was trained from scratch using stratified splitting of the data, class-balanced cross-entropy loss, multi-seed initialization, and control of hyperparameters such as patch size and dropout rate. The evaluation of the Vision Transformer architecture was performed using a hold-out test set with metrics such as accuracy, macro-F1, weighted-F1, and analysis of the confusion matrix. Results: Across all configurations, the training exhibited substantial instability and consistent overfitting behavior, with an average accuracy gap between validation and test sets of 22.7%. Test accuracy ranged from 8.0% to 37.8%, showing high sensitivity to initialization. For minority classes, the F1-score remained very low (F1 < 0.05) even though the classes were balanced in the loss function. Conclusions: The results indicate that a standard ViT-Base model trained from scratch can exhibit pronounced instability and a tendency toward majority-class bias when applied to multi-class skin lesion classification under conditions of extreme class imbalance and data scarcity. The findings point to the limitations of using simple transformer models without pre-training or other forms of inductive bias in scarce data settings.

Keywords:

skin lesion classification; skin cancer; DenseNet169; transfer learning

1. Introduction

This study investigates whether standard Vision Transformer (ViT) models trained from scratch are capable of effective performance on dermatologically realistic dermoscopic image classification tasks with limited available data in the presence of extreme class imbalance using a case study with the ISIC 2019 skin lesion dataset. Although Vision Transformers are known to be highly effective on large-scale general image classification benchmarks, their performance characteristics on resource-constrained medical imaging remain inadequately explored. The literature review is organized to provide the necessary context for a number of key areas. First is the prevalence of convolutional neural networks in the task of analyzing dermoscopic images. Next is a discussion on the ISIC challenge and the well-noted shortcomings in their dataset. Third is a list of strategies to address the problem of extreme class imbalance in medical image analysis tasks. Last is an exploration of the impact of pretraining on the progress of a generalized ViT.

Early benchmark studies have shown that deep CNNs are able to approach or even rival the performance of dermatologists at dermoscopic image classification. Notably, Esteva et al. [1] have shown the potential of CNN-based image classification systems in skin cancer diagnosis at the dermatologist level using a single CNN learning from the entire available dataset [2]. Further studies on clinical readers have compared the performance of CNN-based models to that of dermatologists in the classification of melanoma, with the models frequently being on par or even better than human dermatologists in specific experimental settings. Following the promising results of clinical reader comparison studies, the article by Brinker et al. [3] reported the most comprehensive comparison study of various clinical participants with different levels of experience, which again confirmed the superior feature discrimination capability of CNN-based models compared to various clinical participants. The initial success was largely dependent on transfer learning via ImageNet-pretrained CNN-featured backbone networks, as well as ensemble learning strategies. Architectures like DenseNet became popular within medical image processing domains due to their dense connectivity patterns, which helped facilitate superior gradient flow. Moreover, the superior connectivity patterns enable superior repeated feature usage within the architecture. This specific architecture pattern is superior in the presence of fine-tuning on relatively smaller medical image datasets. This CNN-dominated landscape helps form the basis of this specific study since it emphasizes the potential risks associated with the training procedure of the proposed Vision Transformer.

A substantial number of studies also use ISIC image datasets [4] to investigate dermoscopic image analysis tasks. Initially, most studies concentrated on melanoma vs. benign lesion categorization, which was facilitated by large numbers of images. Noteworthy studies (Hekler et al. [5]; Rotemberg et al. [6]; Bisla et al. [7]; Bissoto et al. [8]; Hasan et al. [9]; Brinker et al. [10]) integrate ISIC 2016 to 2020 images and additional images (PH², Dermofit, HAM10000). Some also address removing duplicates (Le et al., 2020 [11]; Rotemberg et al., 2020 [6]), though most do not, leading to ambiguities regarding comparisons.

Multi-class classification of a greater number of samples, such as ISIC 2018 and ISIC 2019, has also been increasingly studied. Seven-class systems that include HAM10000 categories (Le et al. [11], Carcagnì et al. [12], Tschandl et al. [13], Ratul et al. [14]) and eight-class systems that include atypical nevi are also used (Gessert et al. [15], Kassem et al. [16], Rezvantalab et al. [17]).

A smaller number of samples from the ISIC 2016–2018 datasets have been segmented (lesion boundaries and semantic segmentation), which provide pixel-level annotations. These example datasets (Xie et al. [18]; Hasan et al. [19]; Goyal et al. [20]; Tang et al. [21]) often include segmentation and classification. The number of segmentation studies is relatively lower due to the cessation of segmentation challenges at ISIC starting in 2019, with smaller numbers of samples than that of samples available for classification.

ISIC [4] challenge problems have been integral to facilitating a consensus on dermoscopic image analysis tasks, driving advancements by providing a shared resource and toolkit for comparing performance between different techniques via a leaderboard. The challenge problem for ISIC 2018 [22] has been cited as a standard reference example, providing a clear definition of classification tasks, structure for datasets, and metrics for benchmarking, allowing for an easy comparison of different models. Furthermore, apart from providing access to datasets, a full citation and description resource, through its challenge problem site, has been provided, solidifying its importance as a resource for the field. However, some studies have pointed out structural problems inherent to the ISIC datasets, making accurate generalization difficult to achieve. These problems are especially pronounced in class-imbalanced settings, where models tend to rely on implicit correlations between data points rather than learning meaningful representations.

Given this context, the present study critically evaluates the applicability of conventional Vision Transformer (ViT) models under the clinically realistic conditions of limited dataset availability and extreme class imbalance in dermoscopic image classification tasks. In contrast to previous studies that commonly rely on pre-trained models and/or hybridCNN-Transformer architectures, we train and evaluate standard ViT models directly on the dataset without pre-training, in order to isolate and examine their learning dynamics under adverse conditions. Through a comprehensive analysis of training stability, generalization ability, sensitivity to initialization, and class-wise predictive performance on the ISIC 2019 dataset, this study shows that conventional ViT models exhibit inherent limitations in multi-class skin lesion classification from dermoscopic images.

Although class imbalance is widely acknowledged in medical image analysis, its direct impact on the stability and generalization behavior of ViTs trained from scratch has not been extensively investigated. The novelty of the present study lies not in introducing a new method for handling class imbalance, rather in empirically analyzing its effect on transformer performance.

To provide a structured comparison between the present study and the existing literature, Table 1 summarizes representative methods in dermoscopic image classification in terms of architectural design, pre-training strategy, imbalance handling mechanism, and evaluation focus. In contrast to the vast majority of existing studies, which rely on pre-trained models or hybrid CNN-Transformer architectures, the present study evaluates the performance of a Vision Transformer model trained from scratch under severe class imbalance, with particular emphasis on the stability and reproducibility of the training process.

Therefore, the major contributions of this study can be summarized as follows:

-: An empirical study of ViT model training under realistic clinical conditions, with particular emphasis on data availability and class imbalance in dermoscopic image classification.
-: Design and implementation of a controlled experimental environment with multiple random seeds to assess the training stability and sensitivity of the ViT model.
-: Analysis of generalization failure and overfitting, evidenced by the discrepancy between validation and test performance and by class-wise degradation on long-tailed medical image datasets.
-: Illustration of the challenges of directly training standard ViT models and applying them to class-imbalanced classification, highlighting the importance of inductive biases and pretraining.
-: Discussion of the methodological contribution of this work, namely the need to assess training stability and reproducibility when applying ViT models to class-imbalanced medical image classification.

2. Related Work

Vision Transformers (ViTs), introduced by Dosovitskiy et al. [23], replace convolutional inductive biases with patch tokenization and global self-attention, demonstrating strong scaling behavior on large-scale natural image datasets. Several follow-up studies improved data efficiency (such as distillation in DeiT [24]) and hierarchical locality (such as shifted-window attention in Swin Transformer [25]), enabling Transformers to become practical vision models.

In medical image analysis, transformer-based architectures are typically used in three settings: (i) as pure transformer classifiers or segmenters, (ii) in combination with CNNs, where convolutional components preserve fine textures in the image, or (iii) through transfer learning based on large-scale pre-training. For image segmentation, TransUNet combines a CNN-based feature extractor with a transformer encoder [26], whereas Swin-Unet adopts a U-shaped network built from Swin blocks to enable feature fusion across local and global scales [27]. Overall, this body of work suggests that transformers perform best when supported by strong inductive biases (e.g., those provided by CNNs or hierarchical window mechanisms) or by extensive pre-training, neither of which is used in the present study.

For image classification, CNN-based and transfer learning approaches remain widely adopted, and many challenge submissions have used ensemble methods, multi-resolution information, metadata fusion, and augmentation to improve performance. As mentioned above, dataset-related issues such as spurious correlation, acquisition artifacts, and label noise have been consistently reported for skin lesion image datasets; examples include the multi-source composition of the HAM10000 dataset [28] and label-bias analyses of ISIC-derived datasets [4]. Additionally, significant class imbalance is common, with minority classes receiving few gradient updates, which can lead the model to rely on such correlations rather than on meaningful representations.

Class imbalance mitigation in medical image classification comprises data-level, objective-level, and post-processing approaches. Objective-level methods include re-weighting and margin shaping: a widely used example is focal loss [29], which down-weights well-classified (easy) examples so that learning concentrates on hard ones. However, transformer models trained from scratch may still exhibit high variance and biased predictions when minority classes are extremely underrepresented.

Notably, many successful applications of transformers in medical imaging tasks benefit from architectural modifications that compensate for the absence of convolutional inductive bias. Typical variants include hierarchical window mechanisms (Swin Transformer) and hybrid encoder–decoder structures (TransUNet). In contrast, the behavior of a plain ViT-Base model trained from scratch under long-tailed conditions remains poorly understood. Table 2 summarizes several popular transformer backbones and their common training assumptions. Table 3 summarizes the objectives of imbalance-mitigation methods commonly applied to dermoscopic classification and other long-tailed medical image datasets.

Recent studies in medical image analysis have begun to adopt transformer-based architectures to capture long-range dependencies and global context in medical images. Unlike CNN-based architectures, where convolutional operations and strong inductive biases such as translation invariance play a central role, transformers use self-attention to model interactions among spatial regions across the entire image. This has motivated the use of ViT and hybrid CNN-Transformer architectures for a variety of tasks, including classification, segmentation, and detection. However, such models often rely on architectural modifications, hierarchical attention mechanisms, and large-scale pre-training to stabilize training and improve data efficiency. Table 4 summarizes recent surveys and transformer-based architectures published between 2023 and 2025, representing the main methodological trends in transformer-based image analysis.

While many of these studies focus on segmentation, they collectively highlight the importance of architectural inductive biases and domain adaptations in medical imaging.

3. Materials and Methods

3.1. Data Acquisition and Preprocessing

3.1.1. Dataset Description

This study used the ISIC 2019 Challenge dataset, a comprehensive collection of dermoscopic images for multi-class skin lesion classification. The dataset is freely accessible via the official ISIC repository and consists of two parts: a labeled training set and a test set. Training set contains 25,331 dermoscopic images with their ground truth labels, whereas the test set contains 8238 images. The images are provided in JPEG format, vary in resolution, and are accompanied by one-hot-encoded labels.

The dataset comprises eight distinct diagnostic classes representing common skin lesions encountered in clinical dermatology: Melanoma (MEL), Nevus (NV), Basal Cell Carcinoma (BCC), Actinic Keratosis (AK), Benign Keratosis (BKL), Dermatofibroma (DF), Vascular Lesion (VASC), and Squamous Cell Carcinoma (SCC). The original dataset also includes an “Unknown” (UNK) class, which was removed from both the training and test sets during preprocessing to focus on the eight clinically relevant diagnostic classes.

3.1.2. Data Preprocessing and Quality Control

Label preprocessing converted the one-hot encoded ground-truth labels into integer class indices. Each image’s diagnostic category was determined by identifying the index of the maximum value in its one-hot encoded label vector, yielding a single integer label in the range 0 to 7, corresponding to the eight diagnostic classes. The conversion was verified to ensure that no labels were incorrectly assigned during preprocessing.

All dermoscopic images were preprocessed using a standard pipeline based on PyTorch’s torchvision.transforms module (v. 2.8), in line with the conventional computer-vision practice for medical image analysis. The pipeline comprised the following steps: (1) resizing all images to 224 × 224 pixels using bilinear interpolation; (2) converting the images to PyTorch tensors and scaling pixel values to the range [0, 1]; and (3) normalizing the images using ImageNet statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]).

3.1.3. Dataset Partitioning Strategy and Reproducibility Controls

The training dataset was split using stratified random sampling to preserve the class distribution. A fixed random seed (42) was used to ensure reproducibility across all experiments. The original 25,331 training images were split into 20,264 for training (80%) and 5067 for validation (20%) using the scikit-learn’s train_test_split function, stratified on the diagnostic class labels. To ensure deterministic behavior during training, the same random seed was used for both the data split and the data loading. The independent test set of 8238 images was kept separate and used exclusively for final performance evaluation, ensuring no data leakage between training and testing. This three-way split (train/validation/test) follows standard machine-learning practice, enabling hyperparameter optimization while preserving unbiased performance estimates.

3.1.4. Class Distribution Analysis

The ISIC 2019 dataset exhibits substantial class imbalance across the training, validation, and test sets (Table 5), as is typical of real-world medical imaging datasets. Nevus (NV) is the largest class, accounting for approximately 50.8% of the training set, followed by Melanoma (MEL) at 17.9%. The remaining classes occur progressively less often, with Dermatofibroma (DF) representing only 0.9%. This yields an approximate ratio of 56:1 between the largest and smallest classes, a pattern that is consistent across all three splits. To assess the effect of class imbalance on transformer performance, controlled experiments were conducted under two conditions: the original (imbalanced) data and a class-balanced variant.

Stratified sampling preserved the class distribution between the training and validation sets, with differences of less than 0.1% for all diagnostic classes. The independent test set, however, follows a different distribution, with a notably higher proportion of Melanoma cases (41.0% vs. 17.9% in training), reflecting the challenge-specific composition designed by the ISIC organizers.

3.1.5. Experimental Reproducibility Framework

To account for stochastic variation in neural-network training, the experiments were repeated with multiple random seeds. Three seeds (42, 123, 456) were used for each hyperparameter configuration to assess training stability and reproducibility. For each run, torch.manual_seed() and numpy.random.seed() were set to the configured value before model initialization, ensuring reproducible weight initialization and reproducible stochastic operations during training.

This multi-seed strategy was intended to distinguish genuine hyperparameter effects from random variation in training outcomes. This distinction is particularly important given the known sensitivity of transformer architectures to initialization. Where possible, deterministic settings were enabled to minimize non- deterministic behavior during GPU training.

3.1.6. Data Loading and Batch Processing

Data loading was implemented using PyTorch’s Dataset and DataLoader classes. A custom ISIC Dataset class handled file-path resolution, image loading with PIL (Python Imaging Library), and the application of preprocessing transforms. Images were loaded on demand during training to improve memory efficiency given the large dataset size; this eliminated the need to store preprocessed copies of the data and allowed on-the-fly transformation during batch preparation. The training and validation data loaders used SubsetRandomSampler instances initialized with the stratified split indices and fixed random seeds to ensure reproducible batch ordering across replications. The batch size was set to 32 across all experiments, balancing GPU memory constraints with the need for stable gradient estimates. This batch size provided a good compromise between computational efficiency and gradient stability, particularly given the large parameters count of the ViT-Base architecture. The test data loader processed samples sequentially without shuffling to ensure consistent evaluation ordering across all experiments.

3.2. Model Architecture

3.2.1. Vision Transformer Implementation

The Vision Transformer architecture used in this study follows the ViT-Base configuration, adapted for skin lesion classification on the ISIC dataset. The model comprises four main components: patch embedding, transformer encoder layers, layer normalization, and a classification head. Each dermoscopic image is divided into fixed-size, non-overlapping patches, which are linearly projected into a high-dimensional embedding space. A learnable classification (CLS) token is prepended to the sequence of patch tokens, and positional embeddings are added to retain spatial information. The resulting token sequence is then fed into the transformer encoder layers, and the output corresponding to the CLS token is used to aggregate the global image representation for classification. Figure 1 shows the resulting architecture.

The patch embedding module converted each input image into a sequence of patch tokens through a convolutional projection layer. Input dermoscopic images of size 224 × 224 × 3 were divided into non-overlapping patches of configurable size (16 × 16 or 32 × 32 pixels, varies across experiments), yielding 196 or 49 patches, respectively. Each patch was linearly projected to a 768-dimensional embedding space using a convolutional layer with kernel size and stride equal to the patch size. A learnable classification (CLS) token was prepended to the sequence, and learnable positional embeddings were added to all patch tokens to preserve spatial information.

The transformer encoder consisted of 12 identical layers, each containing a multi-head self-attention (MHSA) and multilayer perceptron (MLP) block, both with residual connections. The MHSA module used a configurable number of attention heads (set to 8 in our experiments) with 768-dimensional embeddings, yielding 96 dimensions per head. Each MLP block followed a standard architecture with an expansion factor of 4: a linear layer expanding to 3072 dimensions, GELU activation, dropout regularization, and a second linear layer projecting back to 768 dimensions.

Layer normalization was applied before each MHSA and MLP block, following the pre-norm architecture. Dropout was applied within the MHSA and MLP blocks, with a configurable rate (0.1 or 0.3, varied across experiments). The final classification head comprised layer normalization followed by a linear projection from the 768-dimensional CLS token representation to 8 output logits, one per diagnostic class. The model contained approximately 85.8 million parameters (85,804,808 for the 16 × 16 patch size and 87,461,384 for the 32 × 32 patch size), all of which were trainable. The small variation in parameter count arose from the patch-size-dependent layers in the model.

3.2.2. Classification Head and Output Objective

The severe class imbalance in the ISIC dataset required a specialized loss-function design to prevent the model from learning trivial solutions that favor the dominant classes. Class weighting was implemented using scikit-learn’s compute_class_weight function with the ‘balanced’ strategy, which computes weights inversely proportional to class frequencies in the training data.

Class weights were calculated as n_samples/(n_classes × np.bincount(y)), where n_samples is the total number of training samples, n_classes is 8 (the number of diagnostic classes), and np.bincount(y) gives the per-class sample counts. These weights were converted to PyTorch tensors and passed to the CrossEntropyLoss function, ensuring that errors on rare classes (e.g., Dermatofibroma and Vascular Lesion) received proportionally higher penalties during backpropagation.

The weighted CrossEntropyLoss function was applied to the raw logits from the classification head, with softmax normalization handled internally by the loss function. This approach ensured stronger learning signals for underrepresented classes while using PyTorch’s optimized loss implementation.

3.2.3. Model Initialization and Configuration

All model parameters were initialized from scratch using PyTorch’s default layer-specific initialization schemes; no pre-trained weights were used. The CLS token and positional embeddings were initialized from a normal distribution with zero mean and unit variance. Linear layers were initialized using PyTorch’s default Kaiming uniform initialization, with bias terms set to zero unless otherwise stated. Layer normalization parameters were set to unit scale (γ = 1) and zero bias (β = 0), which are PyTorch’s defaults.

Architectural choices such as patch size, dropout rate, and the number of attention heads were made configurable to allow exploration of the design space. Due to computational limitations, only a subset of the planned hyperparameter combinations was explored. The learning rate was fixed at 5 × 10⁻⁴, and the number of attention heads was set to 8 across all experiments.

3.3. Training Procedure

3.3.1. Optimization Configuration

Model training used the AdamW optimizer, a variant of Adam with decoupled weight decay that has demonstrated superior performance for transformer architectures. The learning rate was fixed at 5 × 10⁻⁴ across all experiments, and a weight decay coefficient of 0.01 was applied to all trainable parameters for regularization. Default values were used for the remaining AdamW parameters (β₁ = 0.9, β₂ = 0.999, ε = 1 × 10⁻⁸) as implemented in PyTorch.

Gradient clipping was used to improve training stability: gradients were clipped to a maximum L2 norm of 1.0 using torch.nn.utils.clip_grad_norm_. This prevented exploding gradients, which is a particular concern for transformer architectures trained on limited datasets, where gradient magnitudes can vary considerably.

3.3.2. Training Regimen and Early Stopping

Training proceeded for a maximum of 20 epochs per experiment, with early stopping applied to prevent overfitting and reduce computational overhead. The early stopping mechanism monitored validation accuracy with a patience of 5 epochs; training was terminated if validation accuracy did not improve for 5 consecutive epochs. The model state with the highest validation accuracy was preserved and used for the final test evaluation.

In each epoch, the training data was divided into batches of 32 samples. For each batch, gradients were computed and the model parameters were updated. Training loss and accuracy were recorded iteratively to monitor progress. At the end of each epoch, the model was evaluated on the validation set; for evaluation, the model was switched to evaluation mode, which deactivated dropout.

3.3.3. Class-Weighting Strategy for Imbalanced Training

The ISIC 2019 dataset has an extremely long-tailed class distribution that requires explicit mitigation during model training. To address this imbalance at the objective level, a class-weighted cross-entropy loss function was used.

Class weights were computed using scikit-learn’s compute_class_weight function with the ‘balanced’ strategy, which assigns each class weight inversely proportional to its frequency in the training set. The class weights are computed as:

ω_{c} = \frac{N}{{C \times n}_{c}}

(1)

where

N

is the total number of training data samples, C is the total number of classes, and

n_{c}

is the number of training samples in each class

c

. Those weights were converted to PyTorch tensors and passed to the CrossEntropyLoss function. No resampling strategies were applied; the original class distribution was preserved to reflect the realistic imbalance found in clinical data.

3.3.4. Hardware and Computational Environment

All experiments were conducted in a Google Colab environment (Google LLC, Mountain View, CA, USA) using an NVIDIA Tesla T4 GPU (16 GB memory; NVIDIA Corporation, Santa Clara, CA, USA) with CUDA acceleration. Owing to the substantial memory footprint of ViT-Base and the large image dataset, torch.cuda.empty_cache() was called after each experiment to release cached GPU memory.

Training time varied from approximately 60 to 134 min per run, depending on when early stopping was triggered. Each epoch consisted of 634 training and 159 validation batches.

3.3.5. Performance Evaluation Protocol

Model performance was evaluated using overall test accuracy and two F1-based metrics: the macro-averaged F1 score, which is the unweighted mean of the per-class F1 scores across all eight classes, and the weighted F1 score, in which the per-class F1 scores are averaged weighted by the support (number of samples) of each class. Per-class F1 scores were also computed to assess performance on individual classes, which are heavily imbalanced in this diagnosis setting.

Confusion matrices were also computed to provide an overview of the model’s predictions across diagnostic classes. All metrics were computed using scikit-learn.

3.3.6. Hyperparameter Study and Statistical Analysis

A 2 × 2 × 2 × 2 factorial design was originally planned, varying learning rate (5 × 10⁻⁴, 1 × 10⁻³), number of attention heads (8, 12), patch size (16, 32), and dropout rate (0.1, 0.3), with three seed replications per condition (seeds 42, 123, 456), totaling 48 planned runs. Owing to the substantial training time per run (60–134 min) and limited computational resources, the design was reduced to a 2 × 2 factorial varying only patch size and dropout rate, with the learning rate fixed at 5 × 10⁻⁴ and the number of attention heads fixed at 8. Eleven runs were completed, covering all four patch–dropout combinations with 2–3 seed replications each.

Training and validation loss and accuracy curves were monitored during all experiments to detect overfitting. The primary evaluation metric was test accuracy, with the macro-averaged F1 score, weighted F1 score, and overfitting gap (the difference between validation and test accuracy) reported as secondary metrics. Independent-samples t-tests were used to compare configurations, with Cohen’s d reported as the effect size where differences were significant. Pearson correlation coefficients were computed to examine relationships between hyperparameters and performance metrics.

4. Results

4.1. Training Stability and Convergence Analysis

In the hyperparameter optimization study, training stability problems were observed across all 11 experiments. None reached the maximum limit of 20 epochs; early stopping was triggered in every run because validation accuracy had either plateaued or begun to decline.

Training duration averaged 8.9 ± 3.0 epochs across all conditions before early stopping was triggered, indicating rapid saturation. Training length ranged from 6 to 17 epochs, with substantial variation even within identical hyperparameter configurations across seeds. This variability suggests that training instability was not solely attributable to specific hyperparameter choices but reflected a more fundamental challenge in applying Vision Transformers to the ISIC skin lesion dataset.

Overfitting severity was quantified through the validation-test accuracy gap, which averaged 22.7 ± 7.9% across all experiments. Seven of eleven (63.6%) showed severe overfitting, defined as gaps exceeding 20 percentage points; three more (27.3%) showed moderate overfitting, defined as gaps of 10–20 percentage points. Only one experiment had a gap below 10 percentage points, highlighting the pervasive nature of generalization failure across hyperparameter configurations.

This pervasive overfitting suggests that the instability cannot be attributed solely to suboptimal hyperparameter choices. It more likely reflects the structural difficulty of training a complex transformer architecture from scratch under severe class imbalance and limited data. The absence of locality-based inductive biases, combined with high sensitivity to initialization, appears to make optimization particularly difficult, leading to unstable training and poor generalization to minority classes.

Several regularization approaches could help mitigate this instability, including more aggressive data augmentation, higher weight decay, higher dropout (with appropriate tuning), label smoothing, and imbalance-aware loss functions such as focal loss. The architectural inductive biases provided by hybrid CNN-Transformer architectures, or those acquired through pre-training on large-scale datasets, could also yield more stable feature representations under low-resource conditions. Such investigations are beyond the scope of the present study, which is intentionally focused on the standard ViT-Base model under realistic class imbalance.

4.2. Hyperparameter Effect Analysis

4.2.1. Statistical Comparison of Main Effects

The findings reveal substantial run-to-run variability in model performance, despite the reduced search space and fixed data splits (Table 6). Test accuracy varied from 8.0% and 37.8%, driven primarily by the random seed rather than by differences in training dynamics. Overfitting was persistent in all but one configuration, with validation–test gaps averaging above 20 percentage points.

This variability indicates that the optimization landscape is highly sensitive to initialization, especially when training from scratch on medical data that is both limited in size and imbalanced. In such settings, the weak inductive biases of ViT, combined with the large parameter count of ViT-Base, can amplify stochastic effects during training, causing training trajectories to vary considerably across seeds, particularly in generalization behavior. The result is substantial variability in test accuracy and macro-F1 scores even within identical hyperparameter configurations.

No clear pattern emerged between performance and either patch size or dropout setting. Larger patch sizes (32 × 32) sometimes yielded higher peak accuracy, but this effect was inconsistent across seeds. Increasing the dropout rate to 0.3 appeared to exacerbate overfitting rather than mitigate it, though this trend was not statistically significant. The consistently low macro-F1 scores across all configurations further highlight the difficulty of training standard ViT models directly on highly imbalanced medical imaging datasets.

4.2.2. Patch Size Effects

Statistical analysis of patch-size effects on test accuracy revealed no significant difference between 16 × 16 and 32 × 32 patches (Figure 2). The 16 × 16 patch size achieved a mean test accuracy of 23.5 ± 7.3% (n = 6), whereas the 32 × 32 patch size yielded 22.3 ± 11.6% (n = 5) (t = 0.218, p = 0.832, Cohen’s d = 0.132). The overlapping confidence intervals indicated substantial uncertainty in the effect magnitude, reflecting the high inter-experiment variability.

In each box, the central horizontal line denotes the median, and the box edges mark the first and third quartiles (interquartile range, IQR). The vertical whiskers extend to the most extreme values within 1.5 × IQR, and the overlaid dots represent individual experimental runs, with points beyond the whiskers indicating outliers. The colored marker within each box indicates the mean. Colors distinguish the configurations within each panel (16 × 16 vs. 32 × 32 patch size; dropout 0.1 vs. 0.3) and carry no additional meaning. At small sample sizes, the notch (95% confidence interval of the median) may extend beyond the box edges; this is a rendering artifact of notched box plots and does not indicate a data anomaly.

Despite comparable final test performance, patch size had a noticeable impact on training dynamics. Experiments with 32 × 32 patches required substantially longer training (11.0 ± 4.8 epochs) than those with 16 × 16 patches (7.2 ± 0.6 epochs), suggesting greater optimization difficulty for larger patches. This longer training translated into higher computational cost: 32 × 32 patch experiments averaged 87.6 min, compared with 72.7 min for 16 × 16 patches.

4.2.3. Dropout Rate Effects

Varying the dropout rate showed a non-significant trend favoring lower values (Figure 2). A dropout rate of 0.1 achieved a mean test accuracy of 25.0 ± 7.2% (n = 6), whereas a rate of 0.3 yielded 20.5 ± 11.1% (n = 5) (t = 0.811, p = 0.438, Cohen’s d = 0.491). However, a positive correlation between dropout rate and overfitting gap was observed (r = 0.507, p ≈ 0.11), suggesting that higher dropout rates may be associated with larger validation–test gaps, although this relationship did not reach statistical significance.

4.2.4. Reproducibility Assessment

Performance varied substantially across hyperparameter configurations and seeds, with an average coefficient of variation (CV) of 30.2%. Variability was largest for the 32 × 32 patch size with 0.3 dropout (CV = 44.2%) and smallest for the 16 × 16 patch size with 0.1 dropout (CV = 10.0%). This substantial variability reduced confidence in isolating the effects of individual hyperparameters and indicated that the outcomes were driven largely by random variation rather than by the hyperparameter settings.

4.3. Performance Metrics and Class-Wise Analysis

4.3.1. Model Performance Distribution

Test accuracy across all experiments ranged from 8.0% to 37.8%, with a mean of 23.0 ± 9.0%. The best-performing experiment (ID 5) achieved 37.8% with a 16 × 16 patch size, 0.3 dropout, and seed 123, whereas the worst (8.0%, ID 10) used a 32 × 32 patch size, 0.3 dropout, and seed 42. This 4.7-fold range within a limited hyperparameter space highlights the pronounced sensitivity of training outcomes to random initialization.

Detailed confusion matrix analysis of the best and worst performing experiments revealed contrasting error patterns. The best-performing experiment showed a relatively balanced accuracy distribution across the major classes, indicating that the classifier handled them reasonably well. In contrast, the worst-performing experiment exhibited a strong prediction bias towards a rare class, Actinic Keratosis (AK), even though AK comprised only 3.4% of the training set.

4.3.2. Class-Specific Performance Analysis

Performance varied considerably across diagnosis classes, consistent with the class imbalance in the training set (Table 7). Nevus (NV), the largest class (50.8% of the training set), was also the best-performing class, with a mean F1 score of 0.449. The second-best-performing class was Melanoma (MEL) with a mean F1 score of 0.181.

For the three least frequent classes (DF, VASC, and SCC), F1-scores were consistently below 0.05 across all experimental conditions. Despite class-weighted optimization, the model failed to learn stable representations for these classes, instead favoring the most frequent ones. These class-wise results show that class weighting alone does not mitigate the effects of extreme data imbalance, particularly when data are limited.

4.4. Correlation Analysis and Training Dynamics

4.4.1. Performance-Overfitting Relationships

Correlation analysis was used to examine relationships between the experimental variables and performance metrics (Figure 3). As expected, the three performance metrics—test accuracy, macro-F1, and weighted-F1—were strongly intercorrelated (r = 0.90–0.98), since they summarize the same test predictions. Test accuracy was also strongly negatively correlated with the overfitting gap (r = −0.920); however, because the gap is defined using test accuracy, this relationship—and the related correlations of the gap with macro-F1 (r = −0.796) and weighted-F1 (r = −0.897)—is largely definitional and is not interpreted further. More informatively, validation accuracy was only moderately correlated with test accuracy (r = 0.481, not significant, p ≈ 0.13), indicating that validation performance was an unreliable predictor of test performance, consistent with the high run-to-run instability reported above.

Among the hyperparameters, neither patch size (r = −0.073) nor dropout rate (r = −0.261) was significantly correlated with test accuracy or F1, reinforcing that performance was dominated by random initialization rather than by hyperparameter choices. The only significant hyperparameter relationship was between patch size and the number of epochs trained (r = 0.657, p < 0.05), with larger patches requiring longer training. Training time was, in turn, strongly associated with the number of epochs (r = 0.946) and showed only a moderate, non-significant association with test accuracy (r = 0.454); the latter likely reflects the interaction between training duration and early stopping rather than a causal effect.

4.4.2. Training Curve Characteristics

Analysis of the training curves revealed several consistent trends across all experiments. In general, validation accuracy increased during the initial epochs, whereas training accuracy continued to improve even after validation accuracy had plateaued or begun to decline. This divergence, typically beginning within the first three to five epochs, is an early sign of overfitting. None of the experiments showed the stable loss convergence expected of well-regularized training; instead, the instability appears to stem from several interacting factors: the limited amount of training data, the weak inductive biases of the architecture, and a mismatch between plain Vision Transformers and this dermoscopic classification task. Because the experiments used multiple random seeds and hyperparameter configurations, the training curves varied considerably across runs, so presenting individual curves could be misleading. Training behavior is therefore described using aggregate measures (epochs to early stopping, the validation–test performance gap, and cross-experiment comparisons) to better characterize the observed instability.

5. Discussion

5.1. Training Stability and Architectural Mismatch

The systematic failure of all 11 experiments to achieve convergence represents a fundamental challenge in applying Vision Transformers to medical imaging datasets with limited scale and severe class imbalance. The consistent early stopping within 8.9 ± 3.0 epochs, regardless of hyperparameter configuration, suggests that this instability was not merely a consequence of suboptimal hyperparameter selection but reflects a deeper architectural incompatibility between the ViT design and the ISIC skin lesion classification task.

Vision Transformers were originally developed and validated on large-scale natural image datasets, particularly ImageNet, with approximately 14 million images across 21,000 categories. The architectural assumptions underlying the ViT design, including patch-based tokenization, multi-head attention, and deep transformer layers, were optimized for settings with abundant training data and relatively balanced classes. In contrast, the ISIC 2019 dataset provides only 25,331 training images across eight highly imbalanced diagnostic classes, roughly a 550-fold reduction in scale relative to the ViT pre-training context.

The rapid onset of overfitting across all experiments, with validation-test accuracy gaps averaging 22.7 ± 7.9%, indicates that the models exhausted their capacity to learn generalizable representations within the first few epochs. This was particularly pronounced in the experiments with severe overfitting (63.6% of cases), where validation performance plateaued while training accuracy continued to improve, suggesting that the models memorized training examples rather than learning robust diagnostic features. Notably, training accuracy continued to rise even as validation and test performance stagnated, indicating that the limitation lay in generalization rather than in model capacity.

5.2. Hyperparameter Insensitivity and Optimization Challenges

No significant differences were found for patch size (p = 0.832) or dropout rate (p = 0.438), indicating that performance was largely insensitive to hyperparameters generally assumed to influence regularization and generalization. This non-significance should be interpreted cautiously, given the small number of runs per condition (n = 5–6); nonetheless, any hyperparameter effects were clearly small relative to the seed-driven variability that dominated the results. Overall, the instability appears to stem from deeper architectural and data-related rather than from suboptimal hyperparameter tuning.

A positive but non-significant correlation was also observed between dropout rate and the overfitting gap (r = 0.507, p ≈ 0.11). Although this trend did not reach significance, its direction is notable: dropout is intended to reduce overfitting by randomly deactivating neurons during training, yet here higher dropout was associated with larger, not smaller, generalization gaps. If genuine, this pattern would be consistent with the limited training data, where aggressive regularization may prevent the model from learning even basic discriminative features.

The high cross-seed variability (mean coefficient of variation of 30.2%) indicates that stochastic factors had a greater influence on the results than the choice of hyperparameters. This suggests that training outcomes were governed largely by random initialization rather than by deliberate architectural or hyperparameter choices, which poses a serious challenge for clinical deployment, where reliability and reproducibility are essential.

5.3. Class Imbalance and Medical Domain Challenges

While this study did not investigate data augmentation, augmentation strategies tailored to dermoscopic imagery have been shown to improve performance in similar settings [42]. Given the severe overfitting and limited training data, such methods represent a promising direction for future work.

The limited performance on rare diagnostic classes represents a major challenge for transformer-based approaches in this setting. Despite class-balanced loss weighting, the F1 score for the least frequent classes (Dermatofibroma, Vascular Lesion, and Squamous Cell Carcinoma) was consistently below 0.05 across all experimental configurations, indicating that the models failed to classify samples from these classes regardless of hyperparameter choices. This points to a fundamental limitation of standard imbalance mitigation strategies in this regime.

The extreme class imbalance ratio of 56:1 between Nevus (the most common class, 50.8%) and Dermatofibroma (the least common, 0.9%) creates a learning environment in which the models receive minimal exposure to minority classes during training. Some of these minority classes, such as Squamous Cell Carcinoma, correspond to clinically important diagnostic decisions. The systematic bias of the models toward the majority classes observed here indicates that they are unsuitable for clinical deployment.

Furthermore, the architectural premise of Vision Transformers may be fundamentally unsuited to dermoscopic image analysis. Such analysis depends on subtle textural and chromatic features that require discrimination at a finer scale than the object-level recognition for which ViT was designed. The patch-based tokenization strategy may therefore be disadvantageous in this setting.

5.4. Computational Limitations and Methodological Constraints

The incomplete factorial design, covering only 11 of 48 planned runs (4 of 16 planned hyperparameter configurations), represents a substantial limitation that constrains the generalizability of these findings. Computational resource limitations on the Google Colab infrastructure required fixing the learning rate (5 × 10⁻⁴) and the number of attention heads (8), preventing exploration of potentially critical hyperparameter interactions. The limited range of tested configurations may have missed hyperparameter combinations that could have yielded improved performance.

The small number of random seed replications (2–3 per condition) further limits the statistical power for detecting hyperparameter effects, particularly given the high inter-experiment variability. Robust assessment of training stability and reproducibility would require substantially larger sample sizes, on the order of 10–15 replications per condition, to distinguish genuine hyperparameter effects from stochastic variation. The high coefficient of variation observed across conditions suggests that the present design was underpowered to detect meaningful differences between hyperparameter settings.

Furthermore, relying solely on validation accuracy for early stopping may have terminated some experiments before convergence, which might have been reached with more extensive training. However, the consistent plateaue in the validation accuracy curves across experiments suggests that extended training would more likely have exacerbated overfitting than improved generalization.

5.5. Clinical Implications and Model Reliability

The extreme variability in model performance across identical hyperparameter configurations raises fundamental questions about the reliability and clinical deployability of Vision Transformer approaches for medical image classification. In clinical settings, diagnostic models must demonstrate consistent and predictable behavior to earn the trust of healthcare practitioners and regulatory authorities. The observed 4.7-fold performance range (8.0% to 37.8% test accuracy) across the tested experiments would raise serious concerns about clinical reliability.

Moreover, the systematic failure to detect rare but clinically important conditions represent a critical safety concern. Medical diagnostic models that exhibit high sensitivity for common conditions while missing rare diseases could create false confidence in automated screening systems, potentially leading to delayed diagnosis and adverse patient outcomes. The performance patterns observed here suggest that ViT-based approaches, at least under the conventional training strategies investigated in this study, may not yet provide sufficient reliability for applications requiring robust detection of minority classes in medical image classification.

The training instability also has implications for model maintenance and updates in clinical environments. Healthcare institutions implementing automated diagnostic systems require models that can be retrained on new data while maintaining consistent performance. The high sensitivity to initialization observed in this study would make such systems difficult to maintain and update reliably.

5.6. Comparison with Literature and Methodological Insights

The findings of the present study contrast markedly with successful applications of Vision Transformers in medical imaging reported in the recent literature. Careful examination of these successful implementations reveals several critical differences that may explain the divergent outcomes: many successful ViT applications in medical imaging have used substantially larger datasets, employed sophisticated pre-training strategies, or focused on binary classification tasks that avoid the severe class imbalance encountered in multi-class skin lesion classification.

For instance, studies reporting successful ViT performance in radiology applications typically use datasets with hundreds of thousands of images and employ transfer learning from models pre-trained on large-scale natural image corpora. The scale disparity between these successful applications and the present study highlights the data-hungry nature of transformer architectures and suggests that ViT approaches may require substantially larger training datasets than traditional CNN architectures to achieve stable training.

The methodological insights from this hyperparameter analysis contribute to the growing understanding of transformer limitations in specialized domains. The observation that conventional regularization (dropout) and architectural choices (patch size) yielded no detectable benefit in this context informs future research directions and highlights the need for domain-specific adaptations of transformer architectures for medical applications.

5.7. Study Limitations and Future Research Directions

The present study has several limitations. The incomplete factorial design prevented comprehensive exploration of hyperparameter interactions, particularly those involving learning rate and the number of attention heads. Future studies should prioritize more extensive hyperparameter searches, potentially using automated optimization techniques to efficiently explore larger parameter spaces within computational constraints.

The exclusive focus on standard ViT architectures, without exploration of recent variants designed for limited data scenarios, represents another limitation. Emerging approaches such as hybrid CNN–ViT architectures, compact ViT variants and domain-adaptive transformer designs may help mitigate some of the limitations observed here. Similarly, data augmentation techniques designed for medical images and transfer learning strategies using medical image pre-trained models could help mitigate the data scarcity challenges encountered.

Future research should also investigate alternative approaches to address class imbalance beyond simple loss-function weighting, including sampling strategies, synthetic data generation, or hierarchical classification schemes that group related diagnostic classes. Systematic evaluation of ensemble methods combining multiple models or architectures could also improve reliability and reduce sensitivity to initialization.

5.8. Implications for Medical AI Development

These findings have broader implications for the development and deployment of artificial intelligence systems in medical imaging. The training instability and hyperparameter insensitivity highlight the importance of rigorous experimental validation before deploying deep learning models in clinical contexts. The high variability across random seeds suggests that single-model implementations may be insufficient for reliable clinical deployment without additional stabilization strategies such as ensembling or controlled initialization.

Furthermore, the systematic failure to achieve meaningful performance on rare diagnostic classes underscores the need for specialized approaches to severe class imbalance in medical datasets. Standard machine learning techniques developed for balanced datasets may require substantial modification or replacement when applied to medical classification tasks where minority classes may carry clinical importance.

The computational resource requirements and training instability observed in this study also have practical implications for healthcare institutions considering AI deployment. The need for extensive hyperparameter exploration and multiple training runs to achieve reliable results may exceed the computational resources available in many clinical settings, requiring cloud-based solutions or investment in specialized hardware.

These findings argue for continued development of architectures specifically designed for medical imaging, rather than direct adaptation of models developed for natural images. The fundamental differences between medical and natural images, including subtle textural features, extreme class imbalance, and limited data availability, suggest that purpose-built architectures may be necessary to achieve reliable clinical performance.

5.9. Theoretical Considerations on Training Stability

The observed instability can be partially explained by considering the inductive bias of the model and the complexity of the training data [23]. Convolutional neural networks have an inherent inductive bias toward locality and translation invariance, which aligns well with the fine-grained texture features in the dermoscopic images [43]. The Vision Transformer, by contrast, relies on global self-attention, which, in the absence of locality constraints, may increase the sample complexity required for stable learning.

Transformer-based architectures have shown favorable scaling properties on large-scale datasets, whereas their behavior on smaller-scale datasets remains less well understood. Under extreme class imbalance, loss gradients may be dominated by the majority classes, leading to unstable representation learning.

From an optimization perspective, the model’s high sensitivity to random-seed initialization may reflect the presence of multiple unstable local minima in the parameter space, particularly under long-tailed distributions. In such cases, the model may converge to parameter settings biased toward the majority classes while failing to learn discriminative representations for the minority classes. Together, these theoretical considerations provide a plausible account of the instability observed across experimental configurations.

6. Conclusions

The performance of the standard Vision Transformer architecture (ViT-Base) trained from scratch was investigated for multi-class dermoscopic image classification under the realistic clinical conditions of limited data availability and extreme class imbalance, using the ISIC 2019 dataset. Across all experiments, substantial training instability, overfitting, and poor generalization were observed.

Performance was highly sensitive to random-seed initialization, with substantial variability in test accuracy and consistently low macro-F1 scores. Changes to patch size and dropout rate did not produce detectable improvements; higher dropout was associated with larger overfitting gaps rather than smaller ones, although this trend did not reach statistical significance. These findings indicate that standard ViT-Base models trained from scratch may be limited on data-scarce, class-imbalanced medical image classification tasks. The results apply specifically to the train-from-scratch setting and do not extend to pre-trained transformers or hybrid CNN-Transformer architectures, which may benefit from better inductive biases and optimization. Further work is needed to determine whether large-scale pre-training or hybrid CNN-Transformer designs can alleviate the instability observed here. More broadly, this study highlights the need to critically examine the assumptions of standard deep learning architectures developed on large-scale natural images when applying them to the limited-data settings of clinical imaging.

Author Contributions

Conceptualization, S.A.; methodology, S.A.; formal analysis, S.A.; investigation, S.A.; resources, S.A.; data curation, S.A.; writing—original draft preparation, S.A., P.D. and I.B.; writing—review and editing, S.A., P.D. and I.B.; visualization, S.A.; supervision, P.D. and I.B.; project administration, P.D. and I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Projects Commission of Aydin Adnan Menderes University, Research Projects Supporting Publication Continuity, (MF-26006; Principal Investigator: Pinar Demircioglu).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

The authors thank Scientific Research Projects Coordination Unit (BAP) of Aydin Adnan Menderes University for institutional support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
Aksoy, S. Multi-Input Melanoma Classification Using MobileNet-V3-Large Architecture. J. Autom. Mob. Robot. Intell. Syst. 2025, 19, 73–84. [Google Scholar] [CrossRef]
Brinker, T.J.; Hekler, A.; Enk, A.H.; Klode, J.; Hauschild, A.; Berking, C.; Schilling, B.; Haferkamp, S.; Schadendorf, D.; Holland-Letz, T.; et al. Deep Learning Outperformed 136 of 157 Dermatologists in a Head-to-Head Dermoscopic Melanoma Image Classification Task. Eur. J. Cancer 2019, 113, 47–54. [Google Scholar] [CrossRef] [PubMed]
ISIC International Skin Imaging Collaboration. Available online: https://www.isic-archive.com/ (accessed on 19 May 2026).
Hekler, A.; Kather, J.N.; Krieghoff-Henning, E.; Utikal, J.S.; Meier, F.; Gellrich, F.F.; Upmeier Zu Belzen, J.; French, L.; Schlager, J.G.; Ghoreschi, K.; et al. Effects of Label Noise on Deep Learning-Based Skin Cancer Classification. Front. Med. 2020, 7, 177. [Google Scholar] [CrossRef]
Rotemberg, V.; Kurtansky, N.; Betz-Stablein, B.; Caffery, L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.; Guitera, P.; Gutman, D.; et al. A Patient-Centric Dataset of Images and Metadata for Identifying Melanomas Using Clinical Context. Sci. Data 2021, 8, 34. [Google Scholar] [CrossRef]
Bisla, D.; Choromanska, A.; Berman, R.S.; Stein, J.A.; Polsky, D. Towards Automated Melanoma Detection with Deep Learning: Data Purification and Augmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Long Beach, CA, USA, 15–20 June 2019; pp. 2720–2728. [Google Scholar]
Bissoto, A.; Fornaciali, M.; Valle, E.; Avila, S. (De) Constructing Bias on Skin Lesion Datasets. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Long Beach, CA, USA, 15–20 June 2019; pp. 2766–2774. [Google Scholar]
Hasan, M.K.; Elahi, M.T.E.; Alam, M.A.; Jawad, M.T.; Martí, R. DermoExpert: Skin Lesion Classification Using a Hybrid Convolutional Neural Network through Segmentation, Transfer Learning, and Augmentation. Inform. Med. Unlocked 2022, 28, 100819. [Google Scholar] [CrossRef]
Brinker, T.J.; Hekler, A.; Enk, A.H.; Berking, C.; Haferkamp, S.; Hauschild, A.; Weichenthal, M.; Klode, J.; Schadendorf, D.; Holland-Letz, T.; et al. Deep Neural Networks Are Superior to Dermatologists in Melanoma Image Classification. Eur. J. Cancer 2019, 119, 11–17. [Google Scholar] [CrossRef]
Le, D.N.T.; Le, H.X.; Ngo, L.T.; Ngo, H.T. Transfer Learning with Class-Weighted and Focal Loss Function for Automatic Skin Cancer Classification. arXiv 2020, arXiv:2009.05977. [Google Scholar] [CrossRef]
Carcagnì, P.; Leo, M.; Cuna, A.; Mazzeo, P.L.; Spagnolo, P.; Celeste, G.; Distante, C. Classification of Skin Lesions by Combining Multilevel Learnings in a DenseNet Architecture. In Image Analysis and Processing—ICIAP 2019; Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11751, pp. 335–344. ISBN 978-3-030-30641-0. [Google Scholar]
Tschandl, P.; Codella, N.; Akay, B.N.; Argenziano, G.; Braun, R.P.; Cabo, H.; Gutman, D.; Halpern, A.; Helba, B.; Hofmann-Wellenhof, R.; et al. Comparison of the Accuracy of Human Readers versus Machine-Learning Algorithms for Pigmented Skin Lesion Classification: An Open, Web-Based, International, Diagnostic Study. Lancet Oncol. 2019, 20, 938–947. [Google Scholar] [CrossRef] [PubMed]
Ratul, M.A.R.; Mozaffari, M.H.; Lee, W.-S.; Parimbelli, E. Skin Lesions Classification Using Deep Learning Based on Dilated Convolution. bioRxiv 2019. [Google Scholar] [CrossRef]
Gessert, N.; Nielsen, M.; Shaikh, M.; Werner, R.; Schlaefer, A. Skin Lesion Classification Using Ensembles of Multi-Resolution EfficientNets with Meta Data. MethodsX 2020, 7, 100864. [Google Scholar] [CrossRef]
Kassem, M.A.; Hosny, K.M.; Fouad, M.M. Skin Lesions Classification into Eight Classes for ISIC 2019 Using Deep Convolutional Neural Network and Transfer Learning. IEEE Access 2020, 8, 114822–114832. [Google Scholar] [CrossRef]
Rezvantalab, A.; Safigholi, H.; Karimijeshni, S. Dermatologist Level Dermoscopy Skin Cancer Classification Using Different Deep Learning Convolutional Neural Networks Algorithms. arXiv 2018, arXiv:1810.10348. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, J.; Xia, Y.; Shen, C. A Mutual Bootstrapping Model for Automated Skin Lesion Segmentation and Classification. IEEE Trans. Med. Imaging 2020, 39, 2482–2493. [Google Scholar] [CrossRef] [PubMed]
Hasan, M.K.; Dahal, L.; Samarakoon, P.N.; Tushar, F.I.; Martí, R. DSNet: Automatic Dermoscopic Skin Lesion Segmentation. Comput. Biol. Med. 2020, 120, 103738. [Google Scholar] [CrossRef]
Goyal, M.; Oakley, A.; Bansal, P.; Dancey, D.; Yap, M.H. Skin Lesion Segmentation in Dermoscopic Images with Ensemble Deep Learning Methods. IEEE Access 2020, 8, 4171–4181. [Google Scholar] [CrossRef]
Tang, P.; Liang, Q.; Yan, X.; Xiang, S.; Sun, W.; Zhang, D.; Coppola, G. Efficient Skin Lesion Segmentation Using Separable-Unet with Stochastic Weight Averaging. Comput. Methods Programs Biomed. 2019, 178, 289–301. [Google Scholar] [CrossRef]
ISIC Challenge 2018 Dataset. Available online: https://challenge.isic-archive.com/landing/2018/ (accessed on 1 April 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. arXiv 2020, arXiv:2012.12877. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Computer Vision—ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2023; Volume 13803, pp. 205–218. ISBN 978-3-031-25065-1. [Google Scholar]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 Dataset, a Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar] [CrossRef]
King, G.; Zeng, L. Logistic Regression in Rare Events Data. Polit. Anal. 2001, 9, 137–163. [Google Scholar] [CrossRef]
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling Representation and Classifier for Long-Tailed Recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar] [CrossRef]
Azad, R.; Kazerouni, A.; Heidari, M.; Aghdam, E.K.; Molaei, A.; Jia, Y.; Jose, A.; Roy, R.; Merhof, D. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review. Med. Image Anal. 2024, 91, 103000. [Google Scholar] [CrossRef]
Xiao, H.; Li, L.; Liu, Q.; Zhu, X.; Zhang, Q. Transformers in Medical Image Segmentation: A Review. Biomed. Signal Process. Control 2023, 84, 104791. [Google Scholar] [CrossRef]
Pu, Q.; Xi, Z.; Yin, S.; Zhao, Z.; Zhao, L. Advantages of Transformer and Its Application for Medical Image Segmentation: A Survey. Biomed. Eng. OnLine 2024, 23, 14. [Google Scholar] [CrossRef]
Sun, G.; Pan, Y.; Kong, W.; Xu, Z.; Ma, J.; Racharak, T.; Nguyen, L.-M.; Xin, J. DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation. Front. Bioeng. Biotechnol. 2024, 12, 1398237. [Google Scholar] [CrossRef]
Shi, Y.; Li, F.; Zhao, S.; Yu, H.; Chen, X.; Liu, Q. IAP-TransUNet: Integration of the Attention Mechanism and Pyramid Pooling for Medical Image Segmentation. Front. Neurorobot. 2025, 19, 1706626. [Google Scholar] [CrossRef]
Zeng, Z.; Xiao, J.; Yi, S.; Liu, Q.; Zhu, Y. M3-TransUNet: Medical Image Segmentation Based on Spatial Prior Attention and Multi-Scale Gating. J. Imaging 2025, 12, 15. [Google Scholar] [CrossRef] [PubMed]
Krishnan, P.T.; Krishnadoss, P.; Khandelwal, M.; Gupta, D.; Nihaal, A.; Kumar, T.S. Enhancing Brain Tumor Detection in MRI with a Rotation Invariant Vision Transformer. Front. Neuroinform. 2024, 18, 1414925. [Google Scholar] [CrossRef]
Sankari, C.; Jamuna, V.; Kavitha, A.R. Hierarchical Multi-Scale Vision Transformer Model for Accurate Detection and Classification of Brain Tumors in MRI-Based Medical Imaging. Sci. Rep. 2025, 15, 38275. [Google Scholar] [CrossRef]
Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in Medical Imaging: A Survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef] [PubMed]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? arXiv 2021, arXiv:2108.08810. [Google Scholar] [CrossRef]

Figure 1. Architectural configuration of the standard Vision Transformer (ViT-Base) model employed in this study.

Figure 2. Box plots illustrating the effect of patch size and dropout rate on test accuracy across experimental configurations.

Figure 3. Performance metrics correlation matrix.

Table 1. Positioning of the present study relative to representative CNN- and ViT-based approaches in dermoscopic image classification.

Study Type	Pretraining	Hybrid CNN–ViT	Imbalance Handling	Stability Analysis	Proposed Study
CNN-based	Yes	Not applicable	Weighted/Focal	No	–
Hybrid CNN-ViT	Yes	Yes	Yes	No	–
Pure ViT (pretrained)	Yes	No	Limited	No	–
This work	No	No	Weighted CE	Yes (multi-seed)	√

Table 2. Representative transformer backbones and medical imaging adaptations.

Refs	Domain/Task	Key Idea	Typical Training Requirement
ViT [23]	Natural images/classification	Patch tokenization + global self-attention.	Large-scale pretraining or large supervised data.
DeiT [24]	Natural images/classification	Data-efficient training via transformer-specific distillation.	ImageNet-scale data; distillation improves efficiency.
Swin Transformer [25]	General vision backbone	Hierarchical shifted-window attention for locality and scalability.	Standard supervised training; widely used as backbone.
TransUNet [26]	Medical imaging/segmentation	Hybrid CNN features + transformer encoder within U-Net decoder.	Benefits from pretrained encoders and strong locality priors.
Swin-Unet [27]	Medical imaging/segmentation	U-shaped network built from Swin blocks for local-global fusion.	Often trained with augmentation; benefits from hierarchical windows.

Table 3. Common objective-level strategies for addressing extreme class imbalance in long-tailed vision and medical image classification.

Method	Core Mechanism	Applicability	Practical Limitations
Weighted cross-entropy [30]	Re-weights classes inversely proportional to class frequency.	Mild-to-moderate imbalance; simple baseline.	Can over-compensate and destabilize optimization.
Focal loss [29]	Down-weights well-classified samples and focuses learning on hard examples.	Foreground/background or long-tailed settings.	Needs tuning (γ, α); may hurt calibration.
Class-balanced loss [31]	Uses effective number of samples to compute weights.	Long-tailed datasets with moderate label noise.	Still limited by extremely scarce minority samples.
Re-sampling (over/under) [32]	Adjusts mini-batch class composition through over- or under-sampling.	Improves minority exposure without changing loss.	Risk of overfitting minority; duplicates amplify noise.
ATwo-stage training/fine-tuning [33]	Representation learning followed by class-balanced fine-tuning.	When representation learning is data-limited.	May require careful scheduling and validation.

Table 4. Recent transformer-based studies in medical image analysis.

Study	Task	Transformer Model	Methodological Contribution
Azad et al. (2024) [34]	Multi-task medical image analysis (survey)	Various ViT-based models	Comprehensive review of transformers in medical imaging (classification, segmentation, detection).
Xiao et al. (2023) [35]	Medical image segmentation (review)	ViT, Swin, hybrid models	Systematic analysis of transformer architectures for segmentation tasks.
Pu et al. (2024) [36]	Medical image segmentation	Transformer-based networks	Discusses advantages of transformers over CNNs in medical segmentation.
DA-TransUNet (2024) [37]	Medical image segmentation	Dual-attention TransUNet	Integrates channel and positional attention for improved feature representation.
IAP-TransUNet (2025) [38]	Medical image segmentation	Lightweight TransUNet variant	Improves efficiency using attention pyramids and depthwise convolutions.
M3-TransUNet (2025) [39]	Medical image segmentation	Multi-scale TransUNet	Enhances multi-scale feature fusion for better boundary delineation.
Krishnan et al. (2024) [40]	Brain tumor classification (MRI)	Rotation-invariant ViT	Introduces rotation invariance into ViT for robust MRI analysis.
Sankari et al. (2025) [41]	Brain tumor detection	Domain-informed ViT	Incorporates clinical domain knowledge into transformer architecture.

Table 5. Class distribution across training, validation, and test partitions of the ISIC 2019 dataset.

Diagnostic Class	Training Set		Validation Set		Test Set
	Count	%	Count	%	Count	%
MEL (Melanoma)	3618	17.9	904	17.8	3374	41.0
NV (Nevus)	10,300	50.8	2575	50.8	2495	30.3
BCC (Basal Cell Carcinoma)	2658	13.1	665	13.1	975	11.8
AK (Actinic Keratosis)	694	3.4	173	3.4	374	4.5
BKL (Benign Keratosis)	2099	10.4	525	10.4	660	8.0
DF (Dermatofibroma)	191	0.9	48	0.9	91	1.1
VASC (Vascular)	202	1.0	51	1.0	104	1.3
SCC (Squamous Cell Carcinoma)	502	2.5	126	2.5	165	2.0
Total	20,264	100.0	5067	100.0	8238	100.0

Table 6. Complete experimental results showing hyperparameter configurations and performance outcomes.

Exp ID	Patch Size	Dropout	Random Seed	Test Accuracy (%)	Val Accuracy (%)	Overfitting Gap (%)	Macro F1	Training Time (min)	Epochs
1	16	0.1	42	18.6	42.0	23.3	0.114	70.2	6
2	16	0.1	123	20.7	40.4	19.7	0.101	69.3	7
3	16	0.1	456	22.7	46.0	23.3	0.125	78.7	8
4	16	0.3	42	18.2	49.6	31.5	0.097	59.9	6
5	16	0.3	123	37.8	50.5	12.7	0.161	69.6	7
6	16	0.3	456	23.2	50.7	27.5	0.123	88.5	9
7	32	0.1	42	20.9	45.2	24.3	0.132	80.0	10
8	32	0.1	123	29.8	46.3	16.5	0.149	72.3	9
9	32	0.1	456	37.4	45.4	8.0	0.142	133.8	17
10	32	0.3	42	8.0	41.8	33.7	0.044	80.0	10
11	32	0.3	123	15.4	44.3	29.0	0.089	72.1	9

Table 7. Class-wise performance analysis showing relationship between training prevalence and F1 scores.

Diagnostic Class	Training %	Mean F1 Score	Std F1	Performance Rank
NV (Nevus)	50.8	0.449	0.051	1
MEL (Melanoma)	17.9	0.181	0.153	2
BCC (Basal Cell Carcinoma)	13.1	0.109	0.084	3
AK (Actinic Keratosis)	3.4	0.077	0.041	4
SCC (Squamous Cell Carcinoma)	2.5	0.039	0.038	5
BKL (Benign Keratosis)	10.4	0.034	0.044	6
DF (Dermatofibroma)	0.9	0.023	0.033	7
VASC (Vascular)	1.0	0.017	0.028	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aksoy, S.; Demircioglu, P.; Bogrekci, I. Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification. Dermato 2026, 6, 22. https://doi.org/10.3390/dermato6020022

AMA Style

Aksoy S, Demircioglu P, Bogrekci I. Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification. Dermato. 2026; 6(2):22. https://doi.org/10.3390/dermato6020022

Chicago/Turabian Style

Aksoy, Serra, Pinar Demircioglu, and Ismail Bogrekci. 2026. "Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification" Dermato 6, no. 2: 22. https://doi.org/10.3390/dermato6020022

APA Style

Aksoy, S., Demircioglu, P., & Bogrekci, I. (2026). Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification. Dermato, 6(2), 22. https://doi.org/10.3390/dermato6020022

Article Menu

Systematic Failure of Vision Transformers in Imbalanced Skin Lesion Classification

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Acquisition and Preprocessing

3.1.1. Dataset Description

3.1.2. Data Preprocessing and Quality Control

3.1.3. Dataset Partitioning Strategy and Reproducibility Controls

3.1.4. Class Distribution Analysis

3.1.5. Experimental Reproducibility Framework

3.1.6. Data Loading and Batch Processing

3.2. Model Architecture

3.2.1. Vision Transformer Implementation

3.2.2. Classification Head and Output Objective

3.2.3. Model Initialization and Configuration

3.3. Training Procedure

3.3.1. Optimization Configuration

3.3.2. Training Regimen and Early Stopping

3.3.3. Class-Weighting Strategy for Imbalanced Training

3.3.4. Hardware and Computational Environment

3.3.5. Performance Evaluation Protocol

3.3.6. Hyperparameter Study and Statistical Analysis

4. Results

4.1. Training Stability and Convergence Analysis

4.2. Hyperparameter Effect Analysis

4.2.1. Statistical Comparison of Main Effects

4.2.2. Patch Size Effects

4.2.3. Dropout Rate Effects

4.2.4. Reproducibility Assessment

4.3. Performance Metrics and Class-Wise Analysis

4.3.1. Model Performance Distribution

4.3.2. Class-Specific Performance Analysis

4.4. Correlation Analysis and Training Dynamics

4.4.1. Performance-Overfitting Relationships

4.4.2. Training Curve Characteristics

5. Discussion

5.1. Training Stability and Architectural Mismatch

5.2. Hyperparameter Insensitivity and Optimization Challenges

5.3. Class Imbalance and Medical Domain Challenges

5.4. Computational Limitations and Methodological Constraints

5.5. Clinical Implications and Model Reliability

5.6. Comparison with Literature and Methodological Insights

5.7. Study Limitations and Future Research Directions

5.8. Implications for Medical AI Development

5.9. Theoretical Considerations on Training Stability

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI