Figure 1.
Visual overview of the two sequential training strategies investigated in this study. Left (SST_SC): the model is first trained on segmentation (Ia), then transferred to binary classification (IIa), and finally fine-tuned for multiclass classification (IIIa). Right (SST_CS): training starts with binary classification (Ib), followed by multiclass classification (IIb), and ends with segmentation (IIIb). In both settings, the SWIN backbone is shared across tasks, and task-specific heads are modularly swapped to enable sequential transfer learning.
Figure 1.
Visual overview of the two sequential training strategies investigated in this study. Left (SST_SC): the model is first trained on segmentation (Ia), then transferred to binary classification (IIa), and finally fine-tuned for multiclass classification (IIIa). Right (SST_CS): training starts with binary classification (Ib), followed by multiclass classification (IIb), and ends with segmentation (IIIb). In both settings, the SWIN backbone is shared across tasks, and task-specific heads are modularly swapped to enable sequential transfer learning.
Figure 2.
Representative skin lesion images from the HAM dataset, annotated with their respective diagnostic classes. In order, from left to right, we have Nevus (NV), Benign Keratosis-like Lesions (BKL), Melanoma (MEL), Basal Cell Carcinoma (BCC), Dermatofibroma (DF), Actinic Keratosis/Intraepithelial Carcinoma (AKIEC), and Vascular Lesions (VASC).
Figure 2.
Representative skin lesion images from the HAM dataset, annotated with their respective diagnostic classes. In order, from left to right, we have Nevus (NV), Benign Keratosis-like Lesions (BKL), Melanoma (MEL), Basal Cell Carcinoma (BCC), Dermatofibroma (DF), Actinic Keratosis/Intraepithelial Carcinoma (AKIEC), and Vascular Lesions (VASC).
Figure 3.
Visual comparison of segmentation results for a representative lesion from the test set, evaluated under two sequential learning configurations. Each row presents, from left to right, the original input image, the predicted segmentation map from the best-performing model, and the ground-truth mask. (a) SST_SC (Our_D): segmentation followed by classification. (b) SST_CS (Our_C): classification followed by segmentation.
Figure 3.
Visual comparison of segmentation results for a representative lesion from the test set, evaluated under two sequential learning configurations. Each row presents, from left to right, the original input image, the predicted segmentation map from the best-performing model, and the ground-truth mask. (a) SST_SC (Our_D): segmentation followed by classification. (b) SST_CS (Our_C): classification followed by segmentation.
Figure 4.
Confusion matrices, per-class ROC curves (with legend), and macro/weighted ROC curves (
top to
bottom) for the models Our_A (
left) and Our_B (
right), as reported in
Table 1. Each row corresponds to an experimental setup using the best validation checkpoint. In the confusion matrices (
top), the true labels are on the vertical axis and the predicted labels on the horizontal axis; class indices 0–6 correspond to MEL, NV, BCC, AKIEC, BKL, DF, and VASC. The middle row displays per-class ROC curves, showing True Positive Rate (TPR, sensitivity) versus False Positive Rate (FPR), while the bottom row reports macro and weighted average ROC curves, plotting sensitivity against specificity.
Figure 4.
Confusion matrices, per-class ROC curves (with legend), and macro/weighted ROC curves (
top to
bottom) for the models Our_A (
left) and Our_B (
right), as reported in
Table 1. Each row corresponds to an experimental setup using the best validation checkpoint. In the confusion matrices (
top), the true labels are on the vertical axis and the predicted labels on the horizontal axis; class indices 0–6 correspond to MEL, NV, BCC, AKIEC, BKL, DF, and VASC. The middle row displays per-class ROC curves, showing True Positive Rate (TPR, sensitivity) versus False Positive Rate (FPR), while the bottom row reports macro and weighted average ROC curves, plotting sensitivity against specificity.
Figure 5.
Confusion matrices, per-class ROC curves (with legend), and macro/weighted ROC curves (
top to
bottom) for models Our_C (
left) and Our_D (
right), as reported in
Table 1. Each row represents an experimental setup using the best validation checkpoint. In the confusion matrices (
top), the true labels are on the vertical axis and the predicted labels on the horizontal axis; classes 0–6 correspond to MEL, NV, BCC, AKIEC, BKL, DF, and VASC. The middle row displays per-class ROC curves, showing True Positive Rate (TPR, sensitivity) versus False Positive Rate (FPR), while the bottom row reports macro and weighted average ROC curves, plotting sensitivity against specificity.
Figure 5.
Confusion matrices, per-class ROC curves (with legend), and macro/weighted ROC curves (
top to
bottom) for models Our_C (
left) and Our_D (
right), as reported in
Table 1. Each row represents an experimental setup using the best validation checkpoint. In the confusion matrices (
top), the true labels are on the vertical axis and the predicted labels on the horizontal axis; classes 0–6 correspond to MEL, NV, BCC, AKIEC, BKL, DF, and VASC. The middle row displays per-class ROC curves, showing True Positive Rate (TPR, sensitivity) versus False Positive Rate (FPR), while the bottom row reports macro and weighted average ROC curves, plotting sensitivity against specificity.
Figure 6.
Grad-CAM visualizations on two representative HAM cases comparing SST_SC and SST_CS. From left to right: input image with the ground-truth mask overlaid in red for visualization purposes, predicted binary mask, and Grad-CAM maps for the two sequential configurations.
Figure 6.
Grad-CAM visualizations on two representative HAM cases comparing SST_SC and SST_CS. From left to right: input image with the ground-truth mask overlaid in red for visualization purposes, predicted binary mask, and Grad-CAM maps for the two sequential configurations.
Figure 7.
Temporal evolution of Grad-CAM activations for two representative HAM cases shown in
Figure 6. Each row corresponds to one lesion. From left to right: input with ground-truth mask overlay, and Grad-CAM maps extracted at early, mid, late, and best training epochs for SST_SC configuration.
Figure 7.
Temporal evolution of Grad-CAM activations for two representative HAM cases shown in
Figure 6. Each row corresponds to one lesion. From left to right: input with ground-truth mask overlay, and Grad-CAM maps extracted at early, mid, late, and best training epochs for SST_SC configuration.
Figure 8.
t-SNE projections of the latent features on the external HAMt test set. (Left) Swin Transformer trained for classification only. (Right) Sequential Swin Transformer (SST) trained with the classification–segmentation order. Each color corresponds to one of the seven diagnostic classes.
Figure 8.
t-SNE projections of the latent features on the external HAMt test set. (Left) Swin Transformer trained for classification only. (Right) Sequential Swin Transformer (SST) trained with the classification–segmentation order. Each color corresponds to one of the seven diagnostic classes.
Figure 9.
t-SNE projections of multiclass feature embeddings on the HAMt validation set at different training stages of the SST model. As training progresses, clusters become more compact and inter-class separation increases, confirming the effectiveness of sequential transfer learning in shaping the latent space.
Figure 9.
t-SNE projections of multiclass feature embeddings on the HAMt validation set at different training stages of the SST model. As training progresses, clusters become more compact and inter-class separation increases, confirming the effectiveness of sequential transfer learning in shaping the latent space.
Figure 10.
Representative gastrointestinal images from the Kvasir dataset, each annotated with its corresponding class label.
Figure 10.
Representative gastrointestinal images from the Kvasir dataset, each annotated with its corresponding class label.
Figure 11.
Confusion matrices, per-class ROC curves (with legend), and macro/weighted ROC curves (from
top to
bottom) for the models Our_E (
left column) and Our_F (
right column), as reported in
Table 6. Each row corresponds to the respective experimental setup, based on the best-performing checkpoint on the validation set. In the confusion matrices (top row), true labels are shown on the vertical axis and predicted labels on the horizontal axis; classes are indexed from 0 to 6, corresponding to MEL, NV, BCC, AKIEC, BKL, DF, and VASC, respectively. The ROC curves in the middle row display the True Positive Rate (TPR, sensitivity) on the y-axis versus the False Positive Rate (FPR) on the x-axis for each class. In the bottom row, macro and weighted average ROC curves are reported with the y-axis representing sensitivity and the x-axis representing specificity.
Figure 11.
Confusion matrices, per-class ROC curves (with legend), and macro/weighted ROC curves (from
top to
bottom) for the models Our_E (
left column) and Our_F (
right column), as reported in
Table 6. Each row corresponds to the respective experimental setup, based on the best-performing checkpoint on the validation set. In the confusion matrices (top row), true labels are shown on the vertical axis and predicted labels on the horizontal axis; classes are indexed from 0 to 6, corresponding to MEL, NV, BCC, AKIEC, BKL, DF, and VASC, respectively. The ROC curves in the middle row display the True Positive Rate (TPR, sensitivity) on the y-axis versus the False Positive Rate (FPR) on the x-axis for each class. In the bottom row, macro and weighted average ROC curves are reported with the y-axis representing sensitivity and the x-axis representing specificity.
Table 1.
Benchmark results of the SST model on the HAM dataset. TA denotes test accuracy, with the Dataset column distinguishing between HAM (full dataset) and HAMt (external test set). Jaccard and Dice scores evaluate segmentation performance, while TAb and TAm report binary and multiclass classification accuracy, respectively. TPm, TRm, and TF1m correspond to multiclass precision, recall, and F1 score. All values are averaged over five runs.
Table 1.
Benchmark results of the SST model on the HAM dataset. TA denotes test accuracy, with the Dataset column distinguishing between HAM (full dataset) and HAMt (external test set). Jaccard and Dice scores evaluate segmentation performance, while TAb and TAm report binary and multiclass classification accuracy, respectively. TPm, TRm, and TF1m correspond to multiclass precision, recall, and F1 score. All values are averaged over five runs.
| Author | Dataset | Jaccard (%) | Dice (%) | TAb (%) | TAm (%) | TPm (%) | TRm (%) | TF1m (%) |
|---|
| Our_A | HAM | | | | | | | |
| Our_B | HAM | | | | | | | |
| Our_C | HAM+Ht | | | | | | | |
| Our_D | HAM+Ht | | | | | | | |
Table 2.
Centroid-based quantification for the segmentation-followed-by-classification configuration. The reported distances are computed on the validation set at representative training epochs.
Table 2.
Centroid-based quantification for the segmentation-followed-by-classification configuration. The reported distances are computed on the validation set at representative training epochs.
| Epoch | Mean Inter-Class Distance | Closest Class Pair | Minimum Distance |
|---|
| 1 | 28.64 | AKIEC–DF | 8.05 |
| 6 | 44.98 | AKIEC–BCC | 13.97 |
| 18 | 47.01 | AKIEC–BCC | 18.50 |
| 32 | 53.42 | BKL–DF | 22.57 |
Table 3.
Segmentationbenchmark on the HAM dataset. All values are percentages. Asterisks (*) denote values as originally reported. The Jaccard index refers to the Intersection over Union (IoU) metric.
Table 3.
Segmentationbenchmark on the HAM dataset. All values are percentages. Asterisks (*) denote values as originally reported. The Jaccard index refers to the Intersection over Union (IoU) metric.
| Method | Jaccard (%) | Dice (%) |
|---|
| GAN-UNET [15] | 77.00 | 85.20 |
| SkinSAM [19] | 78.43 * | 88.79 * |
| DenseNet121-UNET [15] | 83.50 | 89.70 |
| DeepLabV3+ [15] | 82.80 | 89.80 |
| CA-Net [16] | – | 92.08 |
| BAT [18] | 84.30 | 92.10 |
| Polar Image Transformation [41] | 87.43 | 92.53 |
| SST_SC (Our_A) | | |
| SST_CS (Our_B) | | |
| SST_CS (HAM+Ht, Our_C) | | |
| SST_SC (HAM+Ht, Our_D) | | |
Table 4.
Multiclass classification benchmark results on the HAM dataset (HAM_b stay for the use of HAM as a binary dataset). The “Split” column indicates the dataset partitioning into training, validation, and test sets. For entries with parentheses (e.g., “80 (90/10) – 20”), the portion inside represents the training/validation split, while the value outside refers to the test set. An asterisk (*) indicates that no test set was used or reported. TA_multiclass (%) reports the accuracy for multiclass classification.
Table 4.
Multiclass classification benchmark results on the HAM dataset (HAM_b stay for the use of HAM as a binary dataset). The “Split” column indicates the dataset partitioning into training, validation, and test sets. For entries with parentheses (e.g., “80 (90/10) – 20”), the portion inside represents the training/validation split, while the value outside refers to the test set. An asterisk (*) indicates that no test set was used or reported. TA_multiclass (%) reports the accuracy for multiclass classification.
| Author | Method | Dataset | Split | TA_Multiclass (%) |
|---|
| Mushtaq et al. [43] | Ensemble VGG16 | HAM | 80–20 + 15% not duplicate in Test | 89 |
| Jain et al. [44] | TL on Xception | HAM | 80 (90–10) – 20 | 89.66 |
| Himel et al. [33] | SAM + ViT (binary–multiclass transfer) | HAM_bin | 80–20 * | 91.80 |
| Chaturvedi et al. [45] | InceptionV3 | HAM | 88–12 * | 91.56 |
| Manzoor et al. [32] | VGG16–UNet + EfficientFormerV2/SwiftFormer | HAM | 80–20 * | 92.50 |
| Chaturvedi et al. [45] | InceptionResNetV2 | HAM | 88–12 * | 93.20 |
| Shetty et al. [46] | CNN models | HAM | 80–20 | 95.18 |
| Aladhadh et al. [42] | Medical Vision Transformer | HAM | 70–20–10 | 96.14 |
| Lan et al. [26] | FixCaps | HAM | 85–15 | 96.49 |
| Exp1 | FixCaps | HAM+HAMt | 80 + HAMt | 76.19 |
| Exp2 | FixCaps | HAM+HAMt | 80 + HAMt | 75.28 |
| Exp3 | FixCaps | HAM+HAMt | 80 + HAMt | 75.60 |
| Our_A | SST_SC | HAM | 75–15–10 | |
| Our_B | SST_CS | HAM | 75–15–10 | |
| Gallazzi et al. [4] | Swin Transformer | Large Dataset | 80–20 + HAMt | 86.37 |
| Our_C | SST_CS | HAM+HAMt | 80–20 + HAMt | |
| Our_D | SST_SC | HAM+HAMt | 80–20 + HAMt | |
Table 5.
Comparative results between our previous preprocessing-based pipeline [
6]—which used segmentation outputs as inputs to classification—and the proposed sequential SST models trained on HAM and tested on HAMt. The first three rows report the baseline strategy, while SST_CS and SST_SC represent the joint sequential learning configurations from
Table 1. Metrics include segmentation performance (Jaccard %) and multiclass classification results: accuracy (TAm %), precision (TPm), recall (TRm), and F1 score (TF1m).
Table 5.
Comparative results between our previous preprocessing-based pipeline [
6]—which used segmentation outputs as inputs to classification—and the proposed sequential SST models trained on HAM and tested on HAMt. The first three rows report the baseline strategy, while SST_CS and SST_SC represent the joint sequential learning configurations from
Table 1. Metrics include segmentation performance (Jaccard %) and multiclass classification results: accuracy (TAm %), precision (TPm), recall (TRm), and F1 score (TF1m).
| Author | Method | Jaccard (%) | TAm (%) | TPm (%) | TRm (%) | TF1m (%) |
|---|
| Gallazzi et al. [6] | HAM+YOLO | 77.00 | 84.01 | 84.09 | 84.12 | 83.53 |
| Gallazzi et al. [6] | HAM+DeepLabV3 | 81.66 | 84.12 | 84.38 | 84.12 | 83.75 |
| Gallazzi et al. [6] | HAM+ST | 82.75 | 84.13 | 84.41 | 84.12 | 83.58 |
| Our_C | SST_CS | | | | | |
| Our_D | SST_SC | | | | | |
Table 6.
Overall benchmark results of the SST model on the Kvasir datasets. TA denotes test accuracy. Jaccard and Dice scores evaluate segmentation performance, while TAb and TAm indicate binary and multiclass classification accuracy. TPm, TRm, and TF1m represent multiclass precision, recall, and F1 score.
Table 6.
Overall benchmark results of the SST model on the Kvasir datasets. TA denotes test accuracy. Jaccard and Dice scores evaluate segmentation performance, while TAb and TAm indicate binary and multiclass classification accuracy. TPm, TRm, and TF1m represent multiclass precision, recall, and F1 score.
| Author | Jaccard (%) | Dice (%) | TAb (%) | TAm (%) | TPm (%) | TRm (%) | TF1m (%) |
|---|
| Our_E | | | | | | | |
| Our_F | | | | | | | |
Table 7.
Segmentation benchmark on the KvasirS dataset. Evaluation based on Jaccard and Dice metrics.
Table 7.
Segmentation benchmark on the KvasirS dataset. Evaluation based on Jaccard and Dice metrics.
| Method | Jaccard (%) | Dice (%) |
|---|
| DUCK-net [35] | 90.51 | 95.02 |
| EffiSegNet-B4 [36] | 90.56 | 94.83 |
| EffiSegNet-B6 [36] | 90.60 | 94.77 |
| EffiSegNet-B5 [36] | 90.65 | 94.88 |
| SST_CS (Our_E) | | |
| SST_SC (Our_F) | | |
Table 8.
Multiclass classification accuracy on the KvasirC dataset. TA_multiclass (%) indicates classification accuracy.
Table 8.
Multiclass classification accuracy on the KvasirC dataset. TA_multiclass (%) indicates classification accuracy.
| Method | TA_multiclass (%) |
|---|
| Multi-model classification [53] | 90.20 |
| Single Shot MultiBox Detector [54] | 90.40 |
| Transfer Learning framework [55] | 93.00 |
| Deep CNN-based SAM [56] | 93.19 |
| Spatial-attention ConvMixer [37] | 93.37 |
| SST_CS (Our_E) | |
| SST_SC (Our_F) | |