Abstract
Early and reliable detection of skin cancer is critical for improving patient outcomes and minimizing diagnostic uncertainty in dermatological practice. This study proposes an interpretable hybrid framework that integrates ConvMixer-based deep feature extraction with gradient boosting classifiers to perform multi-class skin lesion classification on the publicly available PAD-UFES-20 dataset. The dataset contains 2298 dermoscopic and clinical images with associated patient metadata (age, gender, and anatomical site), enabling a joint evaluation of demographic and anatomical factors influencing model performance. After data augmentation, normalization, and class balancing using Borderline-SMOTE, Image embeddings extracted via ConvMixer were integrated with patient metadata and subsequently classified using CatBoost, XGBoost, and LightGBM. Among these, CatBoost achieved the highest macro-AUC of 0.94 and macro-F1 of 0.88, with a melanoma sensitivity of 0.91, while maintaining good calibration (Brier score = 0.06). Grad-CAM and SHAP analyses confirmed that the model’s attention and feature importance correspond to clinically relevant lesion regions and attributes. The results highlight that age and body-region imbalances in the PAD-UFES-20 dataset modestly influence predictive behavior, emphasizing the importance of balanced sampling and stratified validation. Overall, the proposed ConvMixer–CatBoost framework provides a compact, explainable, and generalizable solution for AI-assisted skin cancer classification.
1. Introduction
Skin cancer remains one of the most prevalent malignancies worldwide and poses a significant public health challenge []. According to recent epidemiological estimates, non-melanoma skin cancers account for nearly one-third of all diagnosed malignancies globally [], while melanoma—although less frequent—contributes disproportionately to skin cancer–related mortality []. Early and accurate diagnosis is therefore critical for improving prognosis and enabling timely treatment intervention. However, visual diagnosis by dermatologists is subject to inter-observer variability and depends heavily on clinical experience, lighting conditions, and image quality [].
In recent years, advances in deep learning [] have enabled powerful image-based diagnostic models for medical imaging [,]. Convolutional neural networks (CNNs) [] have demonstrated expert-level performance in recognizing melanoma and other skin lesions []; However, their real-world deployment remains limited by data imbalance, insufficient transparency, and poor generalizability across demographic variations. Traditional CNN architectures also demand extensive computational resources and large-scale annotated datasets, which are not always available in real-world hospital settings. To address these challenges, hybrid frameworks that combine compact CNN feature extractors with interpretable machine learning classifiers have emerged as a promising direction.
Recent studies have explored several architectures, such as EfficientNet [], ResNet [], and Vision Transformers [], for diverse object-detection tasks, including lesion analysis; however, relatively few studies have investigated the potential of lightweight CNNs such as ConvMixer when combined with ensemble learning methods []. Boosting algorithms, including XGBoost [], LightGBM [], and CatBoost [], provide robust and scalable alternatives for tabular or mixed-feature inputs, offering explainability through feature importance while maintaining high predictive accuracy. Nevertheless, their integration with CNN-derived representations in dermatology remains underexplored.
The present study investigates an explainable hybrid framework that fuses ConvMixer-based deep features with boosting algorithms to classify six distinct categories of skin lesions using the publicly available PAD-UFES-20 dataset. Unlike previous work focused solely on image-based features, this study leverages both visual and patient-level metadata (e.g., gender, age, anatomical site, and clinical history) to improve classification robustness and interpretability. PAD-UFES-20 was selected due to its diverse inclusion of demographic and environmental factors, allowing a comprehensive examination of potential bias sources such as age, gender, and body region.
The main contributions of this research are as follows:
- We present a reproducible preprocessing pipeline with patient-level data splitting, data augmentation, and Borderline-SMOTE balancing to address class imbalance and data leakage issues.
- We develop a compact ConvMixer-based feature extractor integrated with XGBoost, LightGBM, and CatBoost classifiers, systematically comparing their performance and calibration.
- We analyze the demographic and anatomical characteristics of PAD-UFES-20 to interpret how gender, body region, and age distributions influence model performance.
- We provide an interpretable analysis using Grad-CAM and SHAP visualizations, linking model attention regions and feature importance to clinically relevant attributes.
2. Materials and Methods
The proposed study follows a structured workflow integrating data preprocessing, ConvMixer-based feature extraction, and classification using gradient boosting algorithms. An overview of the complete process is illustrated in Figure 1, summarizing all steps from dataset collection to model evaluation.
Figure 1.
End-to-end workflow: data acquisition, preprocessing (incl. Borderline-SMOTE), ConvMixer feature extraction, boosting-based classification, and evaluation.
The figure highlights the end-to-end pipeline used in this work, beginning with dataset characterization and concluding with model evaluation. Each methodological stage is detailed in the following subsections.
2.1. Dataset Description
The PAD-UFES-20 dataset [] comprises 2298 dermoscopic and clinical images collected from Brazilian public hospitals, accompanied by patient metadata in CSV format. Each record includes demographic attributes (age, gender), clinical factors (anatomical site, cancer history, skin cancer history), and environmental variables (smoking, drinking, pesticide exposure, Fitzpatrick skin type). The dataset covers six diagnostic classes: actinic keratoses (AKIEC), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanoma (MEL), and melanocytic nevi (NV). Representative metadata and sample lesion images are shown in Figure 2.
Figure 2.
Overview of the PAD-UFES-20 dataset used in this study: (a) selected metadata fields and sample records, (b) representative lesion images for six diagnostic categories.
2.2. Data Preprocessing and Augmentation
All images were resized to pixels and normalized to the [0, 1] range for computational consistency. Standard augmentations, including horizontal and vertical flips, rotations, and brightness variations, were applied exclusively to the training partition to enhance generalization. To address class imbalance, Borderline-SMOTE was applied only on training data, while validation and test sets were kept intact. Numeric features were standardized using z-score normalization, and categorical variables were label-encoded for compatibility with gradient boosting algorithms.
2.3. Data Partitioning and Validation Strategy
All preprocessing and resampling steps were performed strictly within the training folds to avoid information leakage. A patient-level split of 70% training, 10% validation, and 20% test data was implemented, ensuring that no image from the same patient appeared in multiple partitions. Model reliability was verified through five-fold repeated stratified cross-validation, and results were reported as mean ± standard deviation across folds.
2.4. Feature Extraction Using ConvMixer
The ConvMixer architecture was used as a lightweight convolutional feature extractor. It employs a patch-embedding layer followed by alternating depthwise and pointwise convolutional blocks enhanced with residual connections and GELU activations. Each input image produced a 256-dimensional feature vector encapsulating spatial and textural characteristics. These learned features were concatenated with patient metadata (age, gender, body region) to form a multimodal feature representation for subsequent classification.
2.5. Classification Using Gradient Boosting Models
Three boosting algorithms were compared—XGBoost, LightGBM, and CatBoost—due to their proven robustness and scalability. Hyperparameters such as learning rate (0.01–0.1), maximum tree depth (4–10), and number of estimators (100–500) were optimized via grid search with early stopping based on validation AUC. Among them, CatBoost demonstrated superior handling of categorical features and better calibration, yielding the most balanced performance across all lesion classes.
2.6. Evaluation Metrics & Implementation Details
Model performance was assessed using precision, recall, F1-score, and area under the ROC curve (AUC). Macro/micro AUC values summarized discrimination across six classes. Calibration (Brier score) and interpretability (SHAP, Grad-CAM) were also evaluated. All experiments were implemented in Python 3.11 using scikit-learn, CatBoost, XGBoost, and LightGBM on an NVIDIA RTX 3070 GPU (NVIDIA Corporation, Santa Clara, CA, USA), ensuring reproducibility via fixed seeds and patient-level splits.
3. Results
All experiments were conducted on the PAD-UFES-20 dataset using a patient-level 70/10/20 train–validation–test split repeated over five folds. The best-performing configuration was selected on the validation folds and evaluated on the unseen test set.
3.1. Dataset Characteristics
The PAD-UFES-20 dataset comprises dermoscopic and clinical images acquired from patients in public hospitals in Brazil. Figure 3 summarizes the demographic and anatomical distribution of the images. Basal cell carcinoma (BCC) and melanocytic nevi (NV) are the most prevalent categories, accounting for over half of all cases, while melanoma (MEL) remains the least represented. Panel (a) shows that lesion occurrence is comparable across genders, with a slightly higher prevalence of BCC in females. Panel (b) indicates that most lesions appear on the face, chest, and back—consistent with sun-exposed body regions. Panel (c) illustrates that lesion frequency increases with age, peaking between 50–70 years. These observations confirm the intrinsic class and demographic imbalance of PAD-UFES-20, justifying the use of Borderline-SMOTE and weighted loss adjustments in subsequent experiments.
Figure 3.
PAD-UFES-20 distributions by gender, body region, and age groups, illustrating class and demographic imbalance. The graph (a) displays the number of Patients of Specific Cancer Types by Gender. The graph (b) displays the number of Patients of Specific Cancer Types by Body Region. The graph (c) indicates the distribution of Skin Cancer Types Across Age Groups (10-Year Intervals).
3.2. Overall Classification Performance
The proposed ConvMixer–Boosting framework demonstrated consistent improvements across all major evaluation metrics. Among the three algorithms, CatBoost yielded the best overall results with a macro-AUC of 0.94, macro-F1 of 0.88, and melanoma (MEL) sensitivity of 0.91. XGBoost and LightGBM followed closely with macro-AUC values of 0.91 and 0.90, respectively. The improvement was statistically significant under paired t-tests () across cross-validation folds as shown in Table 1.
Table 1.
Summary of overall performance of boosting classifiers using ConvMixer features (test set, patient-level split).
3.3. Per-Class Results
Table 2 lists the per-class Precision, Recall, F1, and AUC for the best model (CatBoost). The highest scores were observed for Melanoma (AUC 0.96, F1 0.92) and Benign Keratosis (AUC 0.95), confirming effective separation of malignant and benign lesions.
Table 2.
Per-class metrics for CatBoost (patient-level test split).
Representative classification outcomes for the three boosting algorithms are presented in Figure 4. The examples illustrate both correct and incorrect predictions across common lesion categories. Correct classifications are marked in green, while misclassifications are shown in red. The CatBoost model demonstrates visibly fewer false predictions, particularly in the differentiation between melanocytic nevi (NV) and basal cell carcinoma (BCC), which were often confused by XGBoost and LightGBM. This qualitative comparison reinforces the quantitative findings reported in Table 2.
Figure 4.
Representative qualitative classification outcomes for the three boosting models on the PAD-UFES-20 test set. Each panel shows correctly and incorrectly classified lesion examples, with predicted and true labels indicated above each image (green = correct, red = misclassified). The models compared are (a) CatBoost, (b) XGBoost, and (c) LightGBM, all trained on ConvMixer-derived features. The figure illustrates inter-model differences in visual decision patterns and highlights CatBoost’s comparatively higher consistency across lesion categories.
3.4. Effect of Class-Balancing and Validation Strategy
To mitigate class imbalance, several strategies were tested, including class weighting, random under-sampling, and Borderline-SMOTE applied only on the training set. Borderline-SMOTE provided the highest and most stable performance (macro-AUC 0.94 ± 0.01) with minimal variance across folds (<1.2%).
3.5. Interpretability and Calibration
The interpretability analysis revealed strong alignment between learned representations and dermatological features. Grad-CAM visualizations highlighted attention toward lesion centers and border irregularities, whereas SHAP analysis identified color, texture, and asymmetry as the most influential attributes. The CatBoost model was well calibrated (Brier score = 0.06), indicating reliable probability estimates for decision-support applications.
4. Discussion
The experimental results demonstrate that combining ConvMixer-derived features with gradient boosting classifiers provides a robust and explainable approach for skin lesion classification on the PAD-UFES-20 dataset. This section interprets the observed trends in relation to the underlying dataset characteristics and model behavior, addressing both quantitative outcomes and clinical implications.
4.1. Dataset Influence and Model Performance
The PAD-UFES-20 dataset displays notable class and demographic imbalances, with BCC and NV being dominant and MEL and DF underrepresented. In the absence of balancing, XGBoost and LightGBM exhibited lower sensitivity toward the rarer classes. Applying Borderline-SMOTE on training folds improved macro-AUC by 3–5% and increased MEL sensitivity. Across models, CatBoost achieved the best trade-off (macro-AUC 0.94; macro-F1 0.88), likely due to its handling of categorical encodings and regularization. Residual errors concentrated in morphologically similar pairs (e.g., NV ↔ BCC; AKIEC ↔ SCC), are consistent with overlapping visual cues.
4.2. Demographic Insights and Clinical Relevance
Performance was comparable across genders (≤2% deviation), suggesting no systematic gender bias. Accuracy peaked in the 40–70 age group—the densest region of the data—while sparser youth/elderly groups showed modest declines, indicating a need for broader sampling. Grad-CAM emphasized lesion centers and border irregularities, and SHAP highlighted color/texture and metadata signals, aligning model focus with dermatological heuristics. Calibration (Brier = 0.06) supports use in triage or second-opinion settings where probability reliability matters.
4.3. Limitations and Future Work
Despite these encouraging findings, several limitations remain. First, PAD-UFES-20 is geographically and demographically constrained to Brazilian hospital populations, limiting generalizability to other skin tones and imaging environments. Second, the dataset contains an uneven distribution of lesion sites and lighting conditions that could bias the model toward common regions. Third, external validation on independent datasets and prospective testing in real-world settings are required before clinical translation. Future work will address these limitations through multi-center collaborations, inclusion of under-represented phenotypes, and comparative assessments.
4.4. Summary of Implications
Overall, the integration of ConvMixer feature representations with CatBoost classification yielded a strong balance between accuracy, interpretability, and calibration. The results emphasize the importance of controlling for dataset biases (gender, region, age) and validating models across diverse patient cohorts. Such methodological rigor is crucial for the responsible adoption of AI-based diagnostic support tools in dermatology.
5. Conclusions
This study shows that integrating ConvMixer-based feature extraction with gradient boosting classifiers—especially CatBoost—produces a robust and interpretable framework for multi-class skin lesion classification on the PAD-UFES-20 dataset. The approach achieved strong macro-AUC and melanoma sensitivity while maintaining reliable calibration and close alignment with dermatological feature relevance. Analysis of demographic and anatomical distributions confirmed that dataset imbalance—especially in age and lesion location—can influence predictive behavior and must be addressed through balanced sampling and stratified validation. Overall, the findings highlight the promise of explainable boosting frameworks for clinical decision support in dermatology and underscore the need for future multi-center validation across diverse populations.
Author Contributions
Conceptualization, D.J.; Methodology, D.J.; Software, D.J.; validation, R.H.A.; formal analysis, U.A.; writing—original draft preparation, R.H.A., D.J. and H.I.; writing—review and editing, R.H.A. and U.A.; supervision, R.H.A. and T.A.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original data presented in this study are openly available in Skin Cancer(PAD-UFES-20) data card in Kaggle at https://www.kaggle.com/datasets/mahdavi1202/skin-cancer (accessed on 29 October 2025).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Roky, A.H.; Islam, M.M.; Ahasan, A.M.F.; Mostaq, M.S.; Mahmud, M.Z.; Amin, M.N.; Mahmud, M.A. Overview of Skin Cancer Types and Prevalence Rates Across Continents. Cancer Pathog. Ther. 2025, 3, 89–100. [Google Scholar] [CrossRef] [PubMed]
- Huang, S.; Jiang, J.; Wong, H.; Zhu, P.; Ji, X.; Wang, D. Global Burden and Prediction Study of Cutaneous Squamous Cell Carcinoma from 1990 to 2030: A Systematic Analysis and Comparison with China. J. Glob. Health 2024, 14, 04093. [Google Scholar] [CrossRef] [PubMed]
- Zhou, L.; Zhong, Y.; Han, L.; Xie, Y.; Wan, M. Global, Regional, and National Trends in the Burden of Melanoma and Non-Melanoma Skin Cancer: Insights from the Global Burden of Disease Study 1990–2021. Sci. Rep. 2025, 15, 5996. [Google Scholar] [CrossRef] [PubMed]
- Kibriya, H.; Siddiqa, A.; Khan, W.Z. Melanoma Lesion Localization Using UNet and Explainable AI. Neural Comput. Appl. 2025, 37, 10175–10196. [Google Scholar] [CrossRef]
- Yang, G.; Luo, S.; Greer, P. Advancements in Skin Cancer Classification: A Review of Machine Learning Techniques in Clinical Image Analysis. Multimed. Tools Appl. 2025, 84, 9837–9864. [Google Scholar] [CrossRef]
- Babar, W.; Ali, R.H.; Faheem, A.; Mansoor, S.A. Using Convolutional Neural Networks for Enhanced Pneumonia Detection via Chest X-Rays. In Proceedings of the 2024 International Conference on IT and Industrial Technologies (ICIT), Islamabad, Pakistan, 10–12 December 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Ishaq, M.H.; Ali, R.H.; Koutaly, R.; Khan, T.A.; Ahmad, I. Enhanced Biometric Security through Infrared Vein Pattern Recognition. In Proceedings of the 2025 International Conference on Innovation in Artificial Intelligence and Internet of Things (AIIT), Berlin, Germany, 29–30 September 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Khan, A.; Ali, R.H.; Akmal, U.; Ramazan, A. ASL Recognition Using Deep Learning Algorithms. In Proceedings of the 2024 International Conference on IT and Industrial Technologies (ICIT), Islamabad, Pakistan, 10–12 December 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Singh, J.; Sandhu, J.K.; Kumar, Y. An Analysis of Detection and Diagnosis of Different Classes of Skin Diseases Using Artificial Intelligence-Based Learning Approaches with Hyper Parameters. In Archives of Computational Methods in Engineering; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1051–1078. [Google Scholar] [CrossRef]
- Van Thanh, H.; Quang, N.D.; Phuong, T.M.; Jo, K.-H.; Hoang, V.-D. A Compact Version of EfficientNet for Skin Disease Diagnosis Application. Neurocomputing 2025, 620, 129166. [Google Scholar] [CrossRef]
- Hassanie, S.; Gohar, A.; Ali, R.H.; Khan, T.A.; Ahmed, I.; Faiz, S. A Scalable AI Approach to Bird Species Identification for Conservation and Ecological Planning. IEEE Access 2025, 13, 159859–159871. [Google Scholar] [CrossRef]
- Krishna, G.S.; Supriya, K.; Sorgile, M. LesionAid: Vision Transformers-Based Skin Lesion Generation and Classification—A Practical Review. Multimed. Tools Appl. 2025, 84, 41405–41442. [Google Scholar] [CrossRef]
- Fırat, H.; Üzen, H. DXDSENet-CM Model: An Ensemble Learning Model Based on Depthwise Squeeze-and-Excitation ConvMixer Architecture for the Classification of Multi-Class Skin Lesions. Multimed. Tools Appl. 2025, 84, 9903–9938. [Google Scholar] [CrossRef]
- Ileri, K. Comparative Analysis of CatBoost, LightGBM, XGBoost, RF, and DT Methods Optimised with PSO to Estimate the Number of k-Barriers for Intrusion Detection in Wireless Sensor Networks. Int. J. Mach. Learn. Cybern. 2025, 16, 6937–6956. [Google Scholar] [CrossRef]
- Jain, E.; Singh, A. Optimizing Gradient Boosting Algorithms for Obesity Risk Prediction: A Comparative Analysis of XGBoost, LightGBM, and CatBoost Models. In Proceedings of the 2024 International Conference on Cybernation and Computation (CYBERCOM), New Delhi, India, 15–16 November 2024; pp. 320–324. [Google Scholar] [CrossRef]
- Zhang, L.; Jánošík, D. Enhanced Short-Term Load Forecasting with Hybrid Machine Learning Models: CatBoost and XGBoost Approaches. Expert Syst. Appl. 2024, 241, 122686. [Google Scholar] [CrossRef]
- Pacheco, A.G.C.; Lima, G.R.; Salomão, A.S.; Krohling, B.; Biral, I.P.; de Angelo, G.G.; Alves, F.C.R., Jr.; Esgario, J.G.M.; Simora, A.C.; Castro, P.B.C.; et al. PAD-UFES-20: A Skin Lesion Dataset Composed of Patient Data and Clinical Images Collected from Smartphones. Data Brief 2020, 32, 106221. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).