DermAI 1.0: A Robust, Generalized, and Novel Attention-Enabled Ensemble-Based Transfer Learning Paradigm for Multiclass Classification of Skin Lesion Images

Skin lesion classification plays a crucial role in dermatology, aiding in the early detection, diagnosis, and management of life-threatening malignant lesions. However, standalone transfer learning (TL) models failed to deliver optimal performance. In this study, we present an attention-enabled ensemble-based deep learning technique, a powerful, novel, and generalized method for extracting features for the classification of skin lesions. This technique holds significant promise in enhancing diagnostic accuracy by using seven pre-trained TL models for classification. Six ensemble-based DL (EBDL) models were created using stacking, softmax voting, and weighted average techniques. Furthermore, we investigated the attention mechanism as an effective paradigm and created seven attention-enabled transfer learning (aeTL) models before branching out to construct three attention-enabled ensemble-based DL (aeEBDL) models to create a reliable, adaptive, and generalized paradigm. The mean accuracy of the TL models is 95.30%, and the use of an ensemble-based paradigm increased it by 4.22%, to 99.52%. The aeTL models’ performance was superior to the TL models in accuracy by 3.01%, and aeEBDL models outperformed aeTL models by 1.29%. Statistical tests show significant p-value and Kappa coefficient along with a 99.6% reliability index for the aeEBDL models. The approach is highly effective and generalized for the classification of skin lesions.


Introduction
An abnormality or a change in the appearance of the skin is referred to as a lesion [1][2][3].In addition to moles, rashes, growths, discolorations, and other abnormalities, it can represent a variety of conditions.For several reasons, it is crucial to classify skin lesions [4][5][6][7][8].A timely diagnosis and treatment of skin cancer depends on early detection and accurate classification of skin lesions.Detecting and treating skin cancer early can save lives.Secondly, healthcare professionals may have limitations in terms of accuracy and consistency when inspecting skin lesions manually [9][10][11].To visually analyze skin lesions, dermatologists rely on their expertise and experience, but human error is common.By developing automatic classification systems using neural networks, we can enhance the accuracy and objectivity of skin lesion classification [12,13].
Convolutional Neural Networks (CNN) have been dominantly used in medical image classification and segmentation because of their property to automatically extract features that are different from a traditional neural network [14].But for CNN models to perform, they require large training data, and thus, pre-trained models on large-scale datasets have been used as base models and fine-tuned for specific classification called transfer learning (TL) [15,16].Various studies have been conducted on the use of pre-trained TL models for classification purposes [17][18][19][20][21], and they have performed well, but also these pre-trained models can have limitations like overfitting on a specific task [22,23].Ensemble-based deep learning (EBDL) represents a significant improvement in the field of deep learning (DL), providing better performance than standalone models [24,25].It enables the training of data of different types on different architectures to produce one single predictive output.To further enhance the performance of ensemble-based models, an attention mechanism is incorporated into the model architecture to increase its robustness and enable a more focused analysis of skin lesions.
We discussed the innovative application of aeTL models and aeEBDL approaches for skin lesion categorization.CNN models have traditionally been employed in medical image processing with promising results, although the requirement for large training datasets can be a disadvantage.To get around this, the work fixes them specifically for skin lesion classification by employing pre-trained models from large-scale datasets as foundation models.However, employing pre-trained models to overfit a certain job may have some downsides.The research introduces EBDL models, which mix different data kinds and architectures into a single predicted output to outperform solo TL models.Attention mechanisms are also included in the system for targeted classification of skin lesions.We designed various experiments for our study and achieved excellent performance on the models.We also preprocessed the dataset and balanced the classes to mitigate the bias in the models.Also, we performed statistical tests like the Wilcoxon p-test and Cohen's Kappa for system reliability and robustness.Lastly, we performed a thorough reliability analysis, using misclassification of the models over the entire dataset and averaging the rate for each type of model.For the validation, we developed three hypotheses in this study based on the investigation of skin lesions.Our first hypothesis is that EBDL models will perform better than solo TL models due to their ability to capture various representations.Second, we expected that integrating an attention mechanism will improve the system's performance over non-attention models.Attention models improve performance by allowing the model to focus on critical areas or skin lesion traits.Our final hypothesis is that attention-enabled EBDL (aeEBDL) models are superior to solo attention models due to the advantage of capturing multiple representations while focusing on salient elements.
In the spirit of our hypotheses, we developed seven TL models with six EBDL models using techniques like max voting, stack, and weighted average.An attention mechanism allows models to focus on salient features, thus improving accuracy.We embedded attention mechanisms into our TL models to envision seven new attention-enabled transfer learning (aeTL) models.Also, we incorporated three additional aeEBDL models.All of these models were evaluated on the HAM10000 dataset using five experimental protocols.
Figure 1 portrays the online system of our scheme.With the data acquisition from the HAM10000 dataset, we perform quality control, including data resizing, categorical label encoding, data augmentation, and class balancing.The processed images are then classified using our classifier system, incorporating TL and EBDL classifiers.Finally, we enhance the robustness, reliability, and stability of our system by employing crossvalidation protocols, receiver operating characteristics (ROC) curves, reliability analysis, and conducting statistical tests.
due to the advantage of capturing multiple representations while focusing on salient elements.
In the spirit of our hypotheses, we developed seven TL models with six EBDL models using techniques like max voting, stack, and weighted average.An attention mechanism allows models to focus on salient features, thus improving accuracy.We embedded attention mechanisms into our TL models to envision seven new attention-enabled transfer learning (aeTL) models.Also, we incorporated three additional aeEBDL models.All of these models were evaluated on the HAM10000 dataset using five experimental protocols.
Figure 1 portrays the online system of our scheme.With the data acquisition from the HAM10000 dataset, we perform quality control, including data resizing, categorical label encoding, data augmentation, and class balancing.The processed images are then classified using our classifier system, incorporating TL and EBDL classifiers.Finally, we enhance the robustness, reliability, and stability of our system by employing cross-validation protocols, receiver operating characteristics (ROC) curves, reliability analysis, and conducting statistical tests.This paper follows a systematic flow, beginning with Section 2, which describes the literature review.Section 3 holds the methodology of our approach, including the dataset we have used, types of the classifiers, and the experimental protocols.Section 4 has the results of our performed experiments.Section 5 covers the performance evaluation of our proposed models, followed by Section 6 with a discussion of our approach and benchmarking of the models.Finally, the conclusions are drawn in Section 7.

Literature Review
Cancer is a disease that occurs through the uncontrolled multiplication of body cells, occupying peripheral tissues.Although skin cancer occurs less frequently than many other types of cancer, it is highly important due to its high mortality rate [26].
For a long time, machine learning (ML) algorithms like k-Nearest Neighbors (KNN) and Decision Trees (DT) have been widely used as supervised ML algorithms for image classification [27].Sajid et al. [28] proposed a KNN-based classification system for skin lesions.The system incorporates filters to remove noise and extracts features from the lesion images.Their system successfully classifies the lesions into cancerous and non-cancerous categories.However, it is important to note that KNN systems are computationally expensive and do not perform well with high-dimensional inputs, which makes the algorithm expensive and unsuitable for skin lesion classification systems.Senan et al. [29] proposed a Support Vector Machine (SVM)-based system.They employed a Gaussian filter as part of the preprocessing process for image enhancement and utilized the active contour technique (ACT) to separate the lesion area from the healthy body.To extract features from the region of interest, they applied the Gray Level Co-occurrence Matrix (GLCM) This paper follows a systematic flow, beginning with Section 2, which describes the literature review.Section 3 holds the methodology of our approach, including the dataset we have used, types of the classifiers, and the experimental protocols.Section 4 has the results of our performed experiments.Section 5 covers the performance evaluation of our proposed models, followed by Section 6 with a discussion of our approach and benchmarking of the models.Finally, the conclusions are drawn in Section 7.

Literature Review
Cancer is a disease that occurs through the uncontrolled multiplication of body cells, occupying peripheral tissues.Although skin cancer occurs less frequently than many other types of cancer, it is highly important due to its high mortality rate [26].
For a long time, machine learning (ML) algorithms like k-Nearest Neighbors (KNN) and Decision Trees (DT) have been widely used as supervised ML algorithms for image classification [27].Sajid et al. [28] proposed a KNN-based classification system for skin lesions.The system incorporates filters to remove noise and extracts features from the lesion images.Their system successfully classifies the lesions into cancerous and non-cancerous categories.However, it is important to note that KNN systems are computationally expensive and do not perform well with high-dimensional inputs, which makes the algorithm expensive and unsuitable for skin lesion classification systems.Senan et al. [29] proposed a Support Vector Machine (SVM)-based system.They employed a Gaussian filter as part of the preprocessing process for image enhancement and utilized the active contour technique (ACT) to separate the lesion area from the healthy body.To extract features from the region of interest, they applied the Gray Level Co-occurrence Matrix (GLCM) method.While SVM can convert observations of possibly correlated variables into linearly uncorrelated variables using an orthogonal transformation, it may encounter difficulties in managing noisy input [30].
DL methods have enabled the development of intelligent medical-imaging-based diagnosis systems for medical image analysis.Lopez et al. [31] incorporated a VGG-net TL model for classifying skin lesions in their system.Serte et al. [32] proposed a deep CNN (DCNN) based on Gabor wavelets to detect malignant melanoma and seborrheic keratosis.
In their method, an input image is decomposed into seven directional sub-bands, and eight parallel CNNs are used to generate probabilistic predictions based on these sub-band images and the input image.The sum rule is employed for decision fusion to classify the skin lesion.
Mirunalini et al. [33] developed an automated classification system using the Incep-tionV3 model to classify skin lesions as malignant or benign, and they also considered the cause of cancer for image classification.Mahbod et al. [34] proposed an automatic classification system using three pre-trained deep models: AlexNet, VGG16, and ResNet18, as deep feature generators.They then trained an SVM classifier using these features and obtained classification results by fusing the outputs.Younis et al. [35] developed a fast and reliable model using a pre-trained MobileNet, achieving good categorical accuracy over the HAM10000 dataset.Additionally, EBDL models have played a major role in improving detection accuracy.Harangiet et al. [36] proposed aggregating robust CNN models into one neural architecture, achieving classification using the weighted output of the member CNN models, outperforming all individual CNN models.Another ensemble-based model was introduced by Shehzal et al. [37], combining the abilities of EfficientNetV2S and Swin Transformer models to detect the early focal zone of skin cancer.This ensemble-based construction effectively minimized loss and increased the accuracy of the model.
After conducting a literature review, we concluded that there is a need for studies with superior performance analysis, more generalization, and reliability.Further validation is required for these AI models, as they have the potential to play a significant role in skin lesion detection.

Methodology
In this section, we will discuss the dataset and data demography, overall architecture, TL-based classification, EBDL paradigm, the use of attention in TL models and ensemble-based models, along with training methods, experimental protocol, and performance metrics.

Dataset and Data Demography
Dermatoscopy is a diagnostic technique that has shown promise in improving the diagnosis of pigmented skin lesions.The HAM10000 dataset, which stands for "Human Against Machine with 10,000 training images" [38], comprises a collection of 10,015 dermatoscopic images obtained from diverse Australian and Austrian patients using various imaging modalities.The data acquisition was carried out by two institutions: Cliff Rosendahl in Queensland, Australia, and the Medical University of Vienna, Austria.The dataset encompasses a representative range of important diagnostic categories related to pigmented skin lesions.
The dataset visualized in Figure 2 is freely accessible for experimental purposes and serves as a valuable resource for research.It includes dermatoscopic images representing seven distinct classes: actinic keratoses and intraepithelial carcinoma/Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines/seborrheic keratoses and lichen-planus-like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas, and hemorrhage, vasc).The ground truth for this dataset was established through pathology confirmation, master agreement, or confocal microscopy [39].It is important to note that the images in the dataset may differ from what a layperson or end-user would provide in a real-world scenario.Images that were insufficiently magnified or out-of-focus were removed during the dataset curation process.
Diagnostics 2023, 13, x FOR PEER REVIEW 5 of 33 through pathology confirmation, master agreement, or confocal microscopy [39].It is important to note that the images in the dataset may differ from what a layperson or enduser would provide in a real-world scenario.Images that were insufficiently magnified or out-of-focus were removed during the dataset curation process.

Quality Control
The dataset used for this project consists of images with an initial size of 450 × 600 pixels, which were later resized to 112 × 112 pixels to ensure uniformity for training a DL model.The dataset includes various classes, but there is an imbalance in the distribution of these classes, with some having significantly more samples than the other.

Quality Control
The dataset used for this project consists of images with an initial size of 450 × 600 pixels, which were later resized to 112 × 112 pixels to ensure uniformity for training a DL model.The dataset includes various classes, but there is an imbalance in the distribution of these classes, with some having significantly more samples than the other.
One specific class, "nv" (melanocytic nevi), had an excessive number of samples and outweighed other classes.To address this issue, a class balancing technique was implemented using data augmentation.Additional synthetic data were generated for the underrepresented class by augmenting the existing images.Various augmentation techniques were utilized, including cropping, image expansion, flipping (horizontal, vertical, and diagonal), color filtration, image shifting, and blurring.Some augmentations were also applied to the other classes but to a lesser extent.
Finally, the image labels were encoded using categorical encoding.The categorical class names were converted to single digits between zero and six, representing the classes efficiently for multi-classification purposes.

Global Architecture of the Proposed System
The proposed overall architecture of DermAI 1.0 is depicted in Figure 3.In this system, we utilized and preprocessed the HAM10000 [38] dataset.We mapped meta-data to the corresponding images and ensured dataset cleanliness.The HAM10000 dataset contains 10,015 images, initially exhibiting an imbalance.To address this, we employed a one vs.all class balancing technique and augmented the dataset to increase the training data size for our models.Class weights were then applied to the processed data, enhancing the model's sensitivity.The dataset was split into training and testing sets, with 80% allocated for training and 20% for testing purposes.The training data were used to train the tuned TL models and EBDL models.Subsequently, attention mechanisms were incorporated into these models to further improve accuracy and reliability.

Architecture of the Classifiers
For building the skin lesion detection system, we utilized four different sets of clas- In this study, we hypothesized that the mean accuracy of the EBDL models would surpass that of the individual TL models.Additionally, we anticipated that the inclusion of attention mechanisms would lead to better performance in the aeTL models compared to the fine-tuned TL models alone.Furthermore, we hypothesized that aeEBDL methods would outperform aeTL models.To evaluate the performance of the models, we conducted scientific validation as well as statistical analysis and calculated metrics such as precision, recall, F1-score, and AUC.These evaluations allow us to assess the effectiveness and reliability of our proposed system in detecting and classifying skin lesions accurately.

Architecture of the Classifiers
For building the skin lesion detection system, we utilized four different sets of classifiers: solo TL models, EBDL models, aeTL models, and aeEBDL models.All of these figures are visualized in Figures 4-14.
for complex image recognition tasks.Lastly, Neural Architecture Search Network (NASNet) is an architecture designed using neural architecture search methods, demonstrating excellent performance on mobile devices and resource-constrained environments.Given the unique characteristics of each model, it is highly beneficial to experiment with them individually to determine their performance on our specific task.
Ensemble-based techniques play a crucial role in medical imaging, helping to strengthen weak learners and improve overall performance and generalization.By combining multiple models, EBDL methods leverage the diversity of individual TL models.In our study, we utilized ensemble-based techniques to enhance the models' capability to generalize.Each TL model had a different architecture, allowing them to learn distinct features from the data.
Among the EBDL methods we employed were softmax voting, weighted ensemble, and stack-based ensemble, with a specific focus on softmax voting.To create ensemblebased predictions using softmax voting, we collected predictions from each solo model and summed them along the prediction axis.The index of the highest value for each prediction determined the class with the majority votes, representing the ensemble prediction.By employing softmax voting, we effectively combined the predictions from multiple models, harnessing their collective knowledge.

Ensemble-Based Transfer Learning Models
Additionally, we utilized the weighted average ensemble technique in our study.This method involved assigning weights to the models based on their significance in contributing to the ensemble prediction.The predictions from each model were averaged along the prediction axis, and the index of the highest value indicated the class with the highest weighted average ensemble prediction.
Furthermore, we incorporated a stack-based ensemble method for our EBDL models.This approach involved stacking the predictions from our models together to create a composite input, which was then used to train a meta-model.In our research, we used Logistic Regression as the meta-model to generate our final ensemble prediction.This process allowed us to further enhance the predictive power of our EBDL models.

Solo Transfer Learning Models
TL [40] is considered a leading method for classification and offers several advantages over DL-based classification [17,41,42].In our study, we utilized seven state-of-the-art TL models: ResNet101 [43], MobileNet [44], InceptionV3 [45], EfficientNet B3 [46], EfficientNet B7 [46], DenseNet-201 [47], and NASNet Mobile [48].These models were pre-trained on the ImageNet dataset.By leveraging these advanced architectures as base models, we designed custom top layers for each model.These top layers consisted of a pair of dense layers with 4096 units, employing the ReLU activation function, followed by a dropout layer to mitigate overfitting.The final dense layer had 7 units, corresponding to the number of classes in our classification task.It utilized the softmax activation function to output probabilities for each class.
Using TL models like ResNet101, MobileNet, InceptionV3, EfficientNetB3, Efficient-NetB7, DenseNet201, and NASNetMobile can be highly effective.These models allow us to leverage pre-trained networks that were initially trained on large datasets to learn important features from the data.ResNet is a renowned deep residual network that utilizes skip connections.ResNet101, with its 101 layers, is particularly well-suited for complex image recognition tasks, especially in multiclass classification scenarios.MobileNet, on the other hand, is designed to be lightweight and efficient, making it ideal for processing on mobile devices and low-resource systems.It employs depth-wise separable convolutions to reduce the number of parameters and computations while maintaining reasonable accuracy, which is advantageous for embedded systems in lesion classification.
InceptionV3 incorporates multiple levels of feature extraction and boasts a fastprocessing speed, outperforming other architectures in certain cases.EfficientNet provides excellent feature extraction and strikes a good balance between model size and accuracy.Among its architecture, EfficientNetB3 and EfficientNetB7 have shown remarkable performance in challenging image classification tasks.DenseNet, with its feature reuse and gradient flow mechanisms, is also notable.DenseNet201, with its 201 layers, is well-suited for complex image recognition tasks.Lastly, Neural Architecture Search Network (NAS-Net) is an architecture designed using neural architecture search methods, demonstrating excellent performance on mobile devices and resource-constrained environments.Given the unique characteristics of each model, it is highly beneficial to experiment with them individually to determine their performance on our specific task.
Ensemble-based techniques play a crucial role in medical imaging, helping to strengthen weak learners and improve overall performance and generalization.By combining multiple models, EBDL methods leverage the diversity of individual TL models.In our study, we utilized ensemble-based techniques to enhance the models' capability to generalize.Each TL model had a different architecture, allowing them to learn distinct features from the data.
Among the EBDL methods we employed were softmax voting, weighted ensemble, and stack-based ensemble, with a specific focus on softmax voting.To create ensemblebased predictions using softmax voting, we collected predictions from each solo model and summed them along the prediction axis.The index of the highest value for each prediction determined the class with the majority votes, representing the ensemble prediction.By employing softmax voting, we effectively combined the predictions from multiple models, harnessing their collective knowledge.

Ensemble-Based Transfer Learning Models
Additionally, we utilized the weighted average ensemble technique in our study.This method involved assigning weights to the models based on their significance in contributing to the ensemble prediction.The predictions from each model were averaged along the prediction axis, and the index of the highest value indicated the class with the highest weighted average ensemble prediction.
Furthermore, we incorporated a stack-based ensemble method for our EBDL models.This approach involved stacking the predictions from our models together to create a composite input, which was then used to train a meta-model.In our research, we used Logistic Regression as the meta-model to generate our final ensemble prediction.This process allowed us to further enhance the predictive power of our EBDL models.

Attention-Enabled Transfer Learning Models
Attention mechanisms in deep neural networks are employed to improve model performance by enabling the network to selectively focus on crucial aspects of the input.In our study, we incorporated attention into our TL models to assess their impact on classification performance.As displayed in Equations ( 1)- (3).The attention weights were computed by considering the average prediction scores made with each model across the classes' axis.Subsequently, self-attention was applied to the predictions of each model.By multiplying the attention weights element-wise with the predictions, we assigned greater importance to specific predictions based on their corresponding attention weights.This approach effectively enhances the model's performance by enabling it to concentrate on relevant and significant features within the input.

Attention-enabled Ensemble-based Deep Learning Models
In our study, we developed aeEBDL models by combining the predictions of multiple models and incorporating a self-attention layer on the concatenated prediction layer.The process of calculating attention weights involves considering the average prediction scores made using each model across the classes' axis.As displayed in Equations ( 4)-( 6) these attention weights are then multiplied element-wise with the predictions, assigning greater significance to specific predictions based on their corresponding attention weights.
To generate ensemble predictions, we identified the class with the highest score from the attention-enabled combined prediction.This approach enhances the overall prediction performance by leveraging the attention weights to prioritize and emphasize certain predictions.
Here, P m denotes the m th model's prediction score, o m is the attention output of the m th model, C denotes the number of classes in the multiclass framework, w m is the attention weights of the m th model, M represents the total models, o (c) denotes the ensembled output from the solo M models, and P aeEBDL is the final attention-enabled ensemble-based model output.

Training Parameters
We utilized the Idaho State University (ISU) GPU cluster to execute all models using our dataset.The models were designed using libraries like Tensorflow 2.0, Pandas, Numpy, and OpenCV.The GPU cluster is a one-head node with eight Nvidia RTX 3090s with a GPU memory of 192 GB and a clock boost of 1.70 GHz; the cluster also has a storage of 12 TB for the head node.
For better convergence during training, we set the learning rate to 0.0001.Since we were dealing with multiclass classification, we employed categorical cross-entropy as our loss function.The loss function denoted with L CCE is defined as follows: where N m is the number of total images (input), C denotes the number of classes in the multiclass framework, and y i ∈ C C is the indicator function representing the i th term in the c th category.The models were trained over 50 epochs with a batch size of 16, which allowed for efficient processing of the dataset.To mitigate overfitting, we applied a dropout rate of 0.5.For the activation functions, we utilized softmax for the classification layer and ReLU for the other layers.This software and hardware environment helped to optimally train the DermAI system effectively.

Experimental Protocols
Based on our preliminary analysis, we developed an experimental workflow, which we have mentioned in this section.Initially, we examined our fine-tuned TL models and evaluated their performance.Subsequently, we investigated how we can make ensemble-based models out of our fine-tuned TL models using techniques like softmax voting, weighted average, and stack-based.We then evaluated the impact of the attention mechanism on the models on the skin lesion classification task.We also researched making ensemble-based models using the attention mechanism to see how they affect the classification performance.

Experiment 2: Comparison of EBDL Models vs. TL Models
The main objective of this study is to examine the effectiveness of an ensemble-based paradigm in skin lesion classification.To achieve this, we designed EBDL models using techniques such as softmax voting, weighted average, and stack-based.We generated six EBDL models using these three techniques and compared them with seven TL models.To ensure the reliability of the results, the experiment utilized the K10 cross-validation protocol.

Experiment 3: Comparison of TL Models vs. aeTL Models
In this experiment, our focus was to examine and compare the impact of employing an attention mechanism on TL models for skin lesion classification.To achieve this, we used an attention block and built seven aeTL models, which were then compared with seven non-attention TL models.The study was conducted using constant K10 crossvalidation protocols.

Experiment 4: Comparison of aeTL Models vs. aeEBDL Models
In this experiment, our focus was on evaluating the effectiveness of the ensemblebased paradigm in an attention-enabled framework by comparing aeTL and aeEBDL models for skin lesion classification.To achieve this, we built and compared three aeEBDL models along with seven aeTL models.The study was conducted using constant K10 cross-validation protocols.

Experiment 5: Effect of Training Size on the Performance of TL and aeTL Models
To validate the robustness of our TL models in substantially lesser amounts of training data, we analyzed four cross-validation protocols: K2, K4, K5, and the default K10 protocol.Using these protocols, we verified the performance drop of the models with varying training data sizes for each of the seven TL models.By averaging the results across the models, we intend to measure the difference in performance metrics from 90% Training data size to 50% Training data size.

Performance Metrics
For measuring the performance of the models on various fronts we have used various performance metrics.True positive (TP), true negative (TN), false positive (FP), and false negative (FN) came in handy to estimate the performance.We have used accuracy (η), recall (R), precision (ρ), and F1-score (F ).After calculating the accuracy of the TL, EBDL, aeTL, and aeEBDL models, we calculated their mean accuracies (∆) in Equations (8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18), respectively.In these equations "m" is the current model, "M" is the number of models, "d" is the current dataset, and "D" is the number of datasets.The probability curve ROC and degree of separability Area-under-the-curve (AUC) have been also calculated for every model.In standard deviation (σ), each value from the population is denoted by x i and µ, the population mean.

Results
The protocols were employed to evaluate the effectiveness of the system in skin lesion classification with the HAM10000 dataset in a multiclass framework.To analyze the system, a total of 23 AI models were utilized: seven TL models, six EBDL models, and seven aeTL models, and three aeEBDL models.16)

Performance of TL Models Using the HAM10000 Dataset
In this study, we demonstrate the performance of various TL models on the HAM10000 dataset.The experiment involved training and testing seven state-of-the-art baseline TL models of comparable architecture on an augmented dataset.As shown in Table 1, our models achieved a mean accuracy of 95.30%.Notably, EfficientNet B3 outperformed all other models, achieving an accuracy of 97.01% with an F1-score of 0.9698.This exceptional performance can be attributed to the ability of TL models to process skin lesion images and extract essential features for accurate classification.

Comparison of EBDL Models vs. TL Models
In this experiment, we developed six EBDL models using baseline TL models and utilized techniques like softmax voting, weighted average, and stack-based to create these models.The performance of the EBDL models is presented in Table 2, which showcases various performance metrics.The stack-based EBDL6: DenseNet201 + ResNet101 + NAS-NetMobile model outperformed other models with an accuracy of 99.69% and an F1-score of 0.9969.Furthermore, we compared the results of EBDL models with those of solo models, and the EBDL models demonstrated superior performance.Table 3 demonstrates the effectiveness of EBDL techniques in significantly improving the overall classification model, with EBDL techniques resulting in a 4.22% increase in mean accuracy.This improvement can be attributed to the fact that EBDL models leverage the strengths of different individual models and compensate for their weaknesses, leading to enhanced accuracy and generalization ability.Moreover, EBDL models can effectively address the limitations of TL models by combining predictions from multiple models, each with its unique strengths and weaknesses, thereby reducing the risk of overfitting and enhancing the robustness of the models.

Comparison of TL Models vs. aeTL Models
In this experiment, we implemented attention mechanisms on seven TL models and evaluated the benefits gained by incorporating attention into these models.According to Tables 4 and 5, the mean accuracy of the aeTL models is 98.31%, showing an average increase of 3.01% compared to the TL models.Among the aeTL models, aeEfficientNet B3 achieved the highest individual accuracy of 98.96% and an F1-score of 0.9896.This attention block proved to be most beneficial for the ResNet101 model, which experienced a significant increase of 4.10% in accuracy after incorporating attention.These results further support the hypothesis that attention blocks are a powerful paradigm in classification tasks.

Model Type
Mean Accuracy (%) This successful application of attention mechanisms on TL models demonstrates their ability to improve the models' performance, resulting in higher accuracy and better adaptation to the dataset.Due to their capability to focus on important features, they allow the models to better distinguish and classify complex patterns within the data.

Comparison of aeTL Models vs. aeEBDL Models
In this experiment, we developed three aeEBDL models that incorporated solo aeTL models.The results are presented in Table 6, with the best aeEBDL model achieving an accuracy of 99.73% and an F1-score of 0.9973.Upon comparing the performance with the seven aeTL models, as shown in Table 7, we observed that aeEBDL models outperform the aeTL models with a mean accuracy increase of 1.2%.This validates the performance gain in the ensemble-based paradigm, even in an attention-ensembled framework, where aeEBDL leverages the strengths of different individual models and compensates for their weaknesses, leading to enhanced accuracy and generalization ability.

Effect of Training Size on the Performance of TL and aeTL Models
In this experimental research, we examined how different sizes of training data affect the performance of TL and aeTL models.We assessed the performance metrics using various cross-validation protocols: K10 (default), K5, K4, and K2.The outcomes, presented in Tables 8 and 9, revealed a gradual decrease in performance metrics across these protocols.Both the best TL model (EfficientNet B3) and the best aeTL model (aeEfficientNet B3) were tested.The computed average mean accuracy demonstrated a decline in mean accuracy from 97.01% in K10 to 91.39% in K2, signifying a 5.62% reduction for EfficientNet B3.Likewise, for aeEfficientNet B3, the accuracy declined from 98.96% in K10 to 92.68% in K2, a 6.28% decrease.Despite the reduced training data in the K2 (50:50) validation protocol, our AI models exhibited reliable performance metrics.This finding highlights the effectiveness of our approach and how our models demonstrated robust performance even when faced with limited training data, proving their ability to maintain consistent performance under such conditions.

Performance Evaluation
During the performance evaluation, we utilized ROC curves to visualize the performance of the AI models.Furthermore, we measured the stability of the system by conducting two statistical tests: Wilcoxon signed-rank test and Cohen's Kappa coefficient on all the models.Also, we conducted a reliability test on all the AI models using misclassification rates of the images.

Receiver Operating Characteristics
ROC curves are commonly used to evaluate the performance of models across their entire operating range.In this study, we have plotted multiple ROC curves to assess the performance of different models.Figure 15 displays the ROC curve for the best-performing TL model, EfficientNet B3, which achieved an AUC of 0.9870.Figure 16 compares the ROC curves of the mean AUC of TL models (0.9776) with EBDL models (0.9982). Figure 17 visualizes the ROC curves of the best-performing aeTL models, with aeEfficientNet.
B3 has an AUC of 0.9960.Figure 18 presents a comparison of the mean AUC of TL models (0.9776) with aetTL models (0.9914). Figure 19 portrays the AUC of the overall best-performing model, aeEBDL 3, which achieved an AUC of 0.9989.Lastly, Figure 20 compares the ROC curves of the best-performing aeTL model (aeEfficientNet B3) with the best-performing aeEBDL model (aeEBDL 3).
Additionally, to establish the statistical significance of the results for all classes in each dataset, p-values were calculated.The obtained p-values were less than 0.01, indicating a high level of confidence in the observed text classification results.Overall, the use of ROC curves and p-values strengthens the evaluation of the models and provides valuable insights into their performance.

Stability Validation Using Statistical Tests
The stability of the system was validated by conducting two statistical tests on all the models.The tests performed were the Wilcoxon signed-rank test and Cohen's Kappa coefficient, which provide insights into the system's stability.The Wilcoxon signed-rank test is a non-parametric test used to confirm statistical hypotheses.It compares the locations of two separate populations in data using two matched samples to measure the paired difference.Cohen's Kappa coefficient is a statistical measure used to assess the reliability of two raters for categorical features.The results of these tests are presented in Table 10.The significance of the predicted labels is identified with p-values < 0.001.Furthermore, all the Kappa values ranged between 0.9 and 1.0, indicating an almost perfect agreement among the classifiers.

Receiver Operating Characteristics
ROC curves are commonly used to evaluate the performance of models across their entire operating range.In this study, we have plotted multiple ROC curves to assess the performance of different models.Figure 15 displays the ROC curve for the best-performing TL model, EfficientNet B3, which achieved an AUC of 0.9870.Figure 16 compares the ROC curves of the mean AUC of TL models (0.9776) with EBDL models (0.9982). Figure 17 visualizes the ROC curves of the best-performing aeTL models, with aeEfficientNet.Additionally, to establish the statistical significance of the results for all classes in each dataset, p-values were calculated.The obtained p-values were less than 0.01, indicating a high level of confidence in the observed text classification results.Overall, the use of ROC curves and p-values strengthens the evaluation of the models and provides valuable insights into their performance.

Reliability Analysis through Misclassification Results
For the reliability assessment of the system, we conducted predictions on the entire dataset using all available models.Subsequently, we computed the predictions made using each model for every image.By comparing these predicted labels with the true labels, we calculated the number of misclassified labels, visualized through Equations ( 19)- (21).This analysis allowed us to infer the probability of each image being misclassified and identify the types of lesion images that were most challenging to classify.
As shown in Figure 21, the misclassification rate for all the augmented images was approximately 2% across twenty-three models.However, TL models exhibited a higher misclassification rate of 4% for images.Also, incorporating attention with TL models reduced the misclassification of images to 2%.Notably, EBDL models and aeEBDL models demonstrated superior performance by misclassifying only 0.5% and 0.4% of images, respectively.
where, R is the reliability index, M class is the misclassification value, and I is the total number of images.

Discussion
The proposed system has been trained on the HAM1000 dataset and incorporates superior quality control techniques and class balancing using augmentation.We implemented seven TL models with and without attention, as well as six EBDL and three aeEBDL models on the dataset.Based on our comprehensive analysis, we summarize the primary and secondary findings.Through extensive experimentation and model analysis, we have proven our hypothesis.Additionally, we have benchmarked our proposed techniques (aeTL and aeEBDL) against existing studies in the field of skin cancer classification.Furthermore, the incorporation of attention mechanisms in our models proved to be advantageous.The attention paradigms enhanced the models' ability to capture important features, resulting in a more accurate and robust system.

Principal Findings
Extensive experimentation was conducted on various models, leading to valuable insights into their performances and effectiveness.(i) Incorporating ensemble-based architectures (max voting, weighted ensemble, and stack) increased the mean accuracy of EBDLs by 4.22% compared to their component TL models.(ii) aeTL models exhibited better performance compared to baseline TL models increasing the mean accuracy by 3.01%.(iii) aeEBDL models outperformed aeTL models in terms of accuracy, efficiency, and robustness, leading to a mean accuracy increase of 1.29%.(iv) The proposed models validated the hypotheses and demonstrated their behavior on data augmentation and class balancing.(v) Fine-tuned TL models initially outperformed existing approaches, with the fine-tuned EfficientNet B3 achieving the highest accuracy of 97.01% on the augmented dataset, while ResNet-101 achieved the lowest accuracy of 94.07% on the stack.

Discussion
The proposed system has been trained on the HAM1000 dataset and incorporates superior quality control techniques and class balancing using augmentation.We implemented seven TL models with and without attention, as well as six EBDL and three aeEBDL models on the dataset.Based on our comprehensive analysis, we summarize the primary and secondary findings.Through extensive experimentation and model analysis, we have proven our hypothesis.Additionally, we have benchmarked our proposed techniques (aeTL and aeEBDL) against existing studies in the field of skin cancer classification.Furthermore, the incorporation of attention mechanisms in our models proved to be advantageous.The attention paradigms enhanced the models' ability to capture important features, resulting in a more accurate and robust system.

Principal Findings
Extensive experimentation was conducted on various models, leading to valuable insights into their performances and effectiveness.(i) Incorporating ensemble-based architectures (max voting, weighted ensemble, and stack) increased the mean accuracy of EBDLs by 4.22% compared to their component TL models.(ii) aeTL models exhibited better performance compared to baseline TL models increasing the mean accuracy by 3.01%.(iii) aeEBDL models outperformed aeTL models in terms of accuracy, efficiency, and robustness, leading to a mean accuracy increase of 1.29%.(iv) The proposed models validated the hypotheses and demonstrated their behavior on data augmentation and class balancing.(v) Fine-tuned TL models initially outperformed existing approaches, with the fine-tuned EfficientNet B3 achieving the highest accuracy of 97.01% on the augmented dataset, while ResNet-101 achieved the lowest accuracy of 94.07% on the stack.(vi) Incorporating attention in the architecture further improved classification performance, with aeResNet-101 showing the highest accuracy increase of 4.10% when adapted to attention.(vii) The proposed models were validated through statistical tests and analyses of the effect of the training sample size on training accuracy.(viii) An elaborate and novel misclassification paradigm was utilized to inspect reliability of the proposed models.(ix) The proposed system achieved superior performance over the benchmark in the domain, by a significant margin.
The methodologies presented in this research paper are advanced and effective techniques for the classification of skin lesions.
Table 11 focused on eight studies that focused on building classifiers for skin lesion classification.Moldovan et al. [54] presented a dual-step classification approach based on TL.They used a DenseNet121 pre-trained model to predict the half number of classes in the first step and the other remaining classes in the second step for the overall classification.Lim et al. [53] used a fine-tuned MobileNet TL model and compared their models with their performance on augmented and non-augmented data.Rahi et al. [52] based the idea on the usage of CNN and TL models for classification and conducted a comparison of the performance of models.They used ResNet50, DenseNet121, and VGG11 as pretrained models and made their CNN from scratch, then evaluated their results.Chaturvedi et al. [51] proposed a fine-tuned ResNetXt101 for the classification of skin lesions and conducted a comparative study of five state-of-the-art TL models and constructed four EBDL models out of them.Iqbal et al. [55] demonstrate their deep CNN models that outperforms the state-of-the-art models for the classification on ISIC-19 data.Ali et al. [39] used normalization and data augmentation paradigms to train their proposed DCNN model that outperforms TL models such as AlexNet, VGG16, and ResNets.Kauser et al. [49] proposed EBDL models using TL models like InceptionResNetV2 as base models and used majority voting and weighted majority voting techniques for their classifiers.Charan et al. [56] proposed a two-path neural network, with one path taking input as original images and the second with multiple original images and their segmentation mask for the classification.Hoang et al. [47] proposed a entropy-based weighting and first-order cumulative combined segmentation technique to segment the HAM10000 dataset and further used these images for the channel shuffling function to train their proposed wide ShuffleNet for classification.Our results showcase the effectiveness of our best-in-class models, which outperform these existing models and demonstrate the effectiveness of aeTLs and aeEBDL.Our study also highlights the effectiveness of data augmentation methodology in medical imaging for increasing the training sample size, effectiveness, and robustness of a model.

Special Note on the Use of Attention in Skin Lesion Classification
Attention mechanisms play a vital role in enhancing skin lesion detection by enabling models to selectively focus on important features of the input image for classification, rather than treating every image part equally.This utilization of attention proves particularly advantageous in skin lesion classification as it allows the models to concentrate selectively on lesion areas, rather than the surrounding normal skin, facilitating the identification of the required features for accurate classification.
Moreover, attention mechanisms assist the models in improving feature extraction by prioritizing relevant features over irrelevant ones.This emphasis on relevant features enhances the model's performance and leads to a more accurate classification of skin lesions.By employing attention mechanisms, the models can effectively discern the distinguishing characteristics of skin lesions, resulting in a more robust and precise classification system.

Special Note on the Use of Segmentation of Lesion
While using a segmentation-based approach can be reliable and effective in identifying lesions in medical images [57][58][59][60][61], we chose not to implement it for several reasons.Our main objective was multiclass classification, and we discovered that we were able to achieve accurate results without the need for segmentation.The lesion sizes in our images were already relatively large, and the region of interest (ROI) around the lesion was closely positioned and centered, resulting in smaller background regions.This allowed our classification models to focus on relevant features without the requirement of external lesion segmentation.
Additionally, when we experimented with incorporating segmentation, we observed only a minor 1-2% overall improvement in classification performance, which was not significant enough to justify the added complexity and computational resources it entailed.By solely focusing on the classification task, we streamlined and pipelined our approach to achieve accurate results more efficiently.
Our classification models effectively learned to distinguish between different lesion classes, and just omitting the segmentation phase did not hinder their performance.This decision enabled us to allocate more resources to optimizing the classification models and exploring the attention-enabled and ensemble-based learning paradigm, ultimately leading to successful lesion detection without relying on segmentation.While segmentation was not the primary objective, segmentation methods based on level sets can be possible solutions for lesion segmentation [62][63][64][65].Classifications that combine deep learning-based features with machine learning for classifiers [66] have also proved effective.

Special Strengths, Weakness, and Extension
The research focused on the application of TL, EBDL, aeTL, and aeEBDL methods in skin lesion classification.The study showed significant improvement in the classification process, making the proposed methods a benchmark in the fields for skin lesion classification.The proposed models outperform existing studies.The use of attention in models gave an upper hand in classification, compared to existing models, and the EBDL technique leveraged the trained models' capabilities for increasing the prediction accuracy.Additionally, cross-validation and statistical tests prove the system's robustness and domain adaptability.
Due to the limited availability of high-quality dermatoscopic images, the existing study focused mainly on training classifiers on specific datasets.The HAM10000, although extensive, still has limited variation in representing the full spectrum of skin cancer or pigmented skin lesions.Also, the imbalanced class distribution potentially will overlook other types of skin cancer.The system requires an extra step of augmentation, which may introduce synthetic samples into the training data, inheriting uncertainties and limitations.
The exploration of generalizing the models to adapt more to the domain and regularization techniques such as weight decay can help in reducing the overfitting of AI models to datasets and their ability to generalize more to unseen and clinical data.[67][68][69] There have been studies in different domains, such as immunology [70,71], cardiovascular risk assessment [72], psoriasis diagnosis [73], and thyroid Doppler [74], where cloud-based end-to-end systems are used for detection and classification.We therefore intend to deploy our models on a similar paradigm to create an automated, scalable, and accessible skin cancer classification that will allow easy integration with other applications, remote access, and effective use of computational resources [71,75].The proposed cloud-based system follows a layered architecture, where the presentation layer operates locally on the users' devices, while the business and persistence layers are hosted on the cloud.This ensures a user-friendly system providing a real-time classification system on the device itself [72].Moreover, the design systems can be pruned to reduce the size of the trained models, leading to less consumption of computational resources and memory requirements [76,77].Artificial Intelligent systems are prone to bias, which can emerge due to imbalanced data distributions, demographic disparities, or variations in data collection protocol.Techniques involving bias mitigation and ranking them according to their bias can be introduced into the system [78][79][80].Lastly, adopting transformer-based architectures [81] for their excellence in capturing long-range dependencies and feature information can be beneficial for the detection of pigmented skin lesions.Vision transformers [82] will be a healthy approach in handling large-image resolutions, making more interpretable attention maps or enhancing the scalability and parallel processing powers.
Another area that can improve the classification is the adoption of automated skin lesion segmentation using advanced segmentation methods such as level sets [65,83] or stochastic medical image analysis methods [64].

Conclusions
Our study demonstrates a novel paradigm for skin classification based on image input.The system employs seven TL models, six EBDL models derived from the solo TL models, and seven aeTL models, along with three aeEBDL models.We designed a comprehensive experimental structure to assess model performance and validate our hypothesis.The system's generalization and reliability were evaluated through performance assessments using ROC curves, statistical tests, and reliability analyses through misclassification.Furthermore, we developed a benchmarking strategy concerning related strategies for skin lesion classification.The proposed system demonstrated high reliability and stability.

Figure 4 .
Figure 4. Architecture of max voting ensemble-based deep learning model.

Figure 4 .
Figure 4. Architecture of max voting ensemble-based deep learning model.Diagnostics 2023, 13, x FOR PEER REVIEW 9 of 33

Figure 5 .
Figure 5. Architecture of weighted averaged ensemble-based transfer learning model.

Figure 6 .
Figure 6.Architecture of stack ensemble-based deep learning model.

Figure 5 .
Figure 5. Architecture of weighted averaged ensemble-based transfer learning model.

Figure 5 .
Figure 5. Architecture of weighted averaged ensemble-based transfer learning model.

Figure 6 .
Figure 6.Architecture of stack ensemble-based deep learning model.

Figure 6 .
Figure 6.Architecture of stack ensemble-based deep learning model.

Figure 5 .
Figure 5. Architecture of weighted averaged ensemble-based transfer learning model.

Figure 6 .
Figure 6.Architecture of stack ensemble-based deep learning model.

Figure 14 .
Figure 14.Architecture of attention-enabled ensemble-based deep learning model.

Figure 14 .
Figure 14.Architecture of attention-enabled ensemble-based deep learning model.

Figure 15 .
Figure 15.ROC chart of best-performing TL models.Blue line: It denotes random classifier at an AUC of 0.5.

Figure 15 . 33 Figure 16 .
Figure 15.ROC chart of best-performing TL models.Blue line: It denotes random classifier at an AUC of 0.5.Diagnostics 2023, 13, x FOR PEER REVIEW 20 of 33

Figure 16 .
Figure 16.ROC chart of the mean performance of TL vs. EBDL models.Figure 16.ROC chart of the mean performance of TL vs. EBDL models.

Figure 16 .
Figure 16.ROC chart of the mean performance of TL vs. EBDL models.

Figure 20 .
Figure 20.ROC chart of the mean performance of aeTL vs. aeEBDL models.

Figure 21 .
Figure 21.Classification performance of different models.
(vi) Incorporating attention in the architecture further improved classification performance, with aeResNet-101 showing the highest accuracy increase of 4.10% when adapted to attention.(vii) The proposed models were validated through statistical tests and analyses of the effect of the training sample size on training accuracy.(viii) An elaborate and novel misclassification paradigm was utilized to inspect reliability of the proposed models.(ix) The

Figure 21 .
Figure 21.Classification performance of different models.

Table 1 .
Performance metrics of seven TL models.

Table 2 .
Performance metrics of six EBDL models.

Table 3 .
Comparison of EBDL models vs. TL models.

Table 4 .
Performance metrics of seven aeTL models.

.52 Mean accuracy of all models = 98.31%Table 5 .
Comparison of TL models vs. aeTL models.

Table 6 .
Performance metrics of three aEBDL models.

Table 8 .
Effect of training data on performance for EfficientNet B3 model.

Table 9 .
Effect of training data on performance for aeEfficientNet B3 model.

Table 10 .
Results of conducted statistical tests on 23 AI models.TL: transfer learning; EBDL: ensemble deep learning; aeTL: attention-based transfer learning; and aeEBDL: attention-based ensemble deep learning.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.82

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.Diagnostics 2023, 13, x FOR PEER REVIEW

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.
:It means that the particular column was not done/ conducted by the studies.

Table 11 .
Benchmarking of studies that were implemented for skin lesion classification.:It means that the particular column was not done/ conducted by the studies.