An Effective Skin Cancer Classification Mechanism via Medical Vision Transformer

Skin Cancer (SC) is considered the deadliest disease in the world, killing thousands of people every year. Early SC detection can increase the survival rate for patients up to 70%, hence it is highly recommended that regular head-to-toe skin examinations are conducted to determine whether there are any signs or symptoms of SC. The use of Machine Learning (ML)-based methods is having a significant impact on the classification and detection of SC diseases. However, there are certain challenges associated with the accurate classification of these diseases such as a lower detection accuracy, poor generalization of the models, and an insufficient amount of labeled data for training. To address these challenges, in this work we developed a two-tier framework for the accurate classification of SC. During the first stage of the framework, we applied different methods for data augmentation to increase the number of image samples for effective training. As part of the second tier of the framework, taking into consideration the promising performance of the Medical Vision Transformer (MVT) in the analysis of medical images, we developed an MVT-based classification model for SC. This MVT splits the input image into image patches and then feeds these patches to the transformer in a sequence structure, like word embedding. Finally, Multi-Layer Perceptron (MLP) is used to classify the input image into the corresponding class. Based on the experimental results achieved on the Human Against Machine (HAM10000) datasets, we concluded that the proposed MVT-based model achieves better results than current state-of-the-art techniques for SC classification.


Introduction
Over the last few years there have been substantial increases in the number of skin cancers (SC) caused by several factors such as smoking, drinking alcohol, and most importantly, the harmful UV rays emitted by the sun [1]. More than two people die of this deadly disease every hour in the United States alone. Cancer is a disease that is also deadly in other parts of the world besides the United States. Cancer occurs when cells in a part of the body grow uncontrollably, causing them to spread to other parts of the body. There are several types of cancer found in various parts of the body. SC is one of the most prevalent types of cancer and some of its variants can be potentially hazardous to human life [2]. One of the most commonly occurring types of SC is non-melanoma, while the other is melanoma. Non-melanoma cancer can usually be cured with surgery of great importance in medical diagnostics [35,36]. Recently, Haug et al. [37] presented a DL-based method for the classification of SCs that is efficient. They used two pre-trained models called DenseNet and EfficientNet and optimized their features according to those models. As an experimental evaluation, they used the HAM10000 dataset and achieved an accuracy of 85.8%. A major objective of this work was to make it possible to apply it to low-cost devices such as smartphones. Jetson Nano. Carcagn et al. [38] proposed a CNN-based method for multiclass SC classification. DensNet CNN architecture was used, fine-tuning was done according to the problem, and an SVM classifier was employed for the final classification. In their experiments, they provided a dataset from HAM10000 and obtained a 90% accuracy rate. However, their method seemed to perform much better on balanced class datasets. In [39] was proposed an ensemble of DL-based models for multiclass SC classification. The researchers used five trained deep learning models such as MobileNetV2, Inception-ResNetV2, DenseNet201, Inception3, and GoogleNet, and tuned them according to the problem. In addition" they used a plain classifier and a hierarchy of classifiers to classify the data. HAM10000 was used during the experiments, and they achieved an accuracy of 87.7%. DenseNet models have attained good results in experiments, and can be useful when balancing the datasets. A DL-based method for multiclass SC classification was presented by Mohammed et al. [40]. This study proposed a two-tiered framework to train models on all deeply connected layers and to resolve the issue of an imbalanced dataset. In the second step, they applied two pre-trained models, such as MobileNet and DenseNet121, for classification. As a result of the balanced training data, they achieved 92.7% accuracy over the HAM10000 dataset. The proposed model has the potential to be used in mobile applications. A method based on DL-based classification was presented by Chaturvedi et al. [41] for multiclass SC classification. As a first step, the models normalize the input images and resize them according to the DL models. A total of five different pre-trained models are then used to extract features and to classify them. The accuracy reached 92.83% using the balanced dataset HAM10000. One of the major contributions of this work is the integration of different DL models for better results evaluation by fusing information from different DL models. Next, [42] employed two DL models in order to classify the SCs, namely InveptionV3 and ResNet50. Using ISBI2018 and HAM10000 datasets, their method achieved an accuracy of 89.05% and 89.90%, respectively. Almaraz-Damian et al. [43] proposed a fusion framework for SC classification based on a dermoscopic image for the first stage. They combined well-known clinical features such as Asymmetry, Border, Color, and Diameter (ABCD) and handcrafted features to evaluate results more accurately. In the next stage, DL-based features were extracted and fused with the first stage. Classification was performed using the relevant vector machine and SVM classifier, which achieved 92.4% accuracy on the ISBI2018 dataset. Researchers in [44] developed a residual DL framework for SC classification and achieved 93% accuracy over the ISBI2018 dataset. Agrahari et al. [45] used the MobileNet model for efficient SC classification and performed experiments on the HAM10000 dataset, achieving 80.81, 91.25 and 96.26%, "top 1", "top 2", and "top 3" accuracy, respectively. In another approach [46], researchers used the data" augmentation technique to solve the data imbalance problem in the dataset. They used the pre-trained weight of several DL-based models, and Xception Net achieved promising results in the experiment. Nawaz et al. [47]., proposed a hybrid model for SC classification based on DL and fuzzy k-means clustering algorithms and achieved 95.6, 93.1 and 95.40% accuracy on the PH2, ISIC-2017 and ISIC-2016 datasets, respectively. Another hybrid model for SC classification was proposed by Sharma et al. [48] who fused the features of cascaded ensembling of CNN and a handcrafted features-based DL model and achieved state-of-the-art performance. There is no doubt that vision transformers play an important role in several vision-based challenging applications, such as fire detection [49,50], anomaly detection [51], and medical image classification [52,53]. It is well documented that, according to the recent literature, multiclass SC classification is not an easy task because of the large amount of similarity in the dermoscopic images. The second problem faced in the studies mentioned above is imbalanced datasets. Several researchers used the data augmentation method to deal with imbalanced datasets but, unfortunately, a low-class selection was used to deal with the problem, leading to overestimation of the likelihood of the dataset. This is not an optimal strategy because some of the images were overlooked during the training process. In this work, we propose an extensive evaluation of the MVT model to effectively classify SC images. In this work, the images under analysis were split into nine patches, then these patches transformed into a sequence by flattening and embedding. To keep positional information, position embedding was added to the patches. The acquired patch sequences were then fed to a number of Multi-Head Self-Attention (MHSA) layers to generate the final representation. In classification, the first sequence of the token is fed as an input to a classifier which classifies the given input images into their corresponding classes. The major contributions of this work are as follows: • The major problem in the field of SC is the lack of publicly available datasets. Although there are some datasets available on the internet with a limited number of samples, these datasets are imbalanced, which negatively affects the performance of a model Thus, we used extensive data augmentation by augmenting the data with various parameters and different techniques to fill the data gap and make the system transformational and noise invariant.

•
The current literature focuses on CNN-based models for SC classification, but these models reduce the dimensionality of the input data, which causes loss of meaningful information. To fill this gap we developed a vision transformer-based model with the potential of learning from a whole set of features, providing better accuracy.

•
We evaluated the proposed model on a publicly available HAM10000 dataset and obtained higher performance in terms of F1-score, specificity, sensitivity, and accuracy when compared to state-of-the-art methods.
The rest of the work is arranged in the following manner. Section 2 discusses our method. Section 3 deals with the results and analysis of our experiment. Finally, in Section 4, we present the conclusions of our experiment.

The Proposed Method
In this section we describe data preprocessing steps and our proposed framework, as shown in Figures 1 and 2. The proposed framework consists of two main phases, including data preprocessing as well as training the MVT model for the classification of SCs. dermoscopic images. The second problem faced in the studies mentioned above is imbalanced datasets. Several researchers used the data augmentation method to deal with imbalanced datasets but, unfortunately, a low-class selection was used to deal with the problem, leading to overestimation of the likelihood of the dataset. This is not an optimal strategy because some of the images were overlooked during the training process. In this work, we propose an extensive evaluation of the MVT model to effectively classify SC images. In this work, the images under analysis were split into nine patches, then these patches transformed into a sequence by flattening and embedding. To keep positional information, position embedding was added to the patches. The acquired patch sequences were then fed to a number of Multi-Head Self-Attention (MHSA) layers to generate the final representation. In classification, the first sequence of the token is fed as an input to a classifier which classifies the given input images into their corresponding classes. The major contributions of this work are as follows:  The major problem in the field of SC is the lack of publicly available datasets. Although there are some datasets available on the internet with a limited number of samples, these datasets are imbalanced, which negatively affects the performance of a model Thus, we used extensive data augmentation by augmenting the data with various parameters and different techniques to fill the data gap and make the system transformational and noise invariant.  The current literature focuses on CNN-based models for SC classification, but these models reduce the dimensionality of the input data, which causes loss of meaningful information. To fill this gap we developed a vision transformer-based model with the potential of learning from a whole set of features, providing better accuracy.  We evaluated the proposed model on a publicly available HAM10000 dataset and obtained higher performance in terms of F1-score, specificity, sensitivity, and accuracy when compared to state-of-the-art methods.
The rest of the work is arranged in the following manner. Section 2 discusses our method. Section 3 deals with the results and analysis of our experiment. Finally, in Section 4, we present the conclusions of our experiment.

The Proposed Method
In this section we describe data preprocessing steps and our proposed framework, as shown in Figures 1 and 2. The proposed framework consists of two main phases, including data preprocessing as well as training the MVT model for the classification of SCs.

Data Preprocessing
Pre-processing of data is an important step in machine learning that enhances the quality of the data to be used. When training the CNN model on raw input data, for example, classification accuracy may decrease. Therefore, in the preprocessing stage, data augmentation is used to generate supplementary images from the existing images to vary their scale, position, orientation, contrast enhancement, and brightness adjustment, as shown in Figure 1.
Brightness adjustment: Different lighting conditions can cause brightness variation in images and gamma correction transformation (1) with different values generates underlit and overly lit images as demonstrated in Figure 1. Hence, the selected images can have such illumination-induced variations. By using brightness adjustment transformation we can improve the variations.
Contrast Enhancement: This transformation is used to adjust effects of contrast variations in images due to varying conditions of lights. The contrast stretching mechanism in Equation (2) is used to adjust contrast variation through different factors as shown in Figure 1.
where f(x,y) is the input and g(x,y) is the output pixel value, s1, s2, r1, and r2, are the user defined parameters for contrast adjustment, z1, z2 and z3 are the scale factors for the rotations in grayscale image and formulated as s1/r1, , respectively, and L is the maximum range of gray level value. Geometric Transformations: This transformation includes translation, and rotation scaling was used on each image of the dataset to obtain new images as shown in Figure 1. For the CNN architecture, this step is very valuable to read the same object from different perspectives, which enhances the generalization capabilities of the model.

Our Proposed MVT Model for SC Classification
Transformer architecture was proposed by Vaswani et al. [54], which is an encoderdecoder module, and transforms a given sequence of elements into another sequence. The

Data Preprocessing
Pre-processing of data is an important step in machine learning that enhances the quality of the data to be used. When training the CNN model on raw input data, for example, classification accuracy may decrease. Therefore, in the preprocessing stage, data augmentation is used to generate supplementary images from the existing images to vary their scale, position, orientation, contrast enhancement, and brightness adjustment, as shown in Figure 1.
Brightness adjustment: Different lighting conditions can cause brightness variation in images and gamma correction transformation (1) with different values generates under-lit and overly lit images as demonstrated in Figure 1. Hence, the selected images can have such illumination-induced variations. By using brightness adjustment transformation we can improve the variations. g Contrast Enhancement: This transformation is used to adjust effects of contrast variations in images due to varying conditions of lights. The contrast stretching mechanism in Equation (2) is used to adjust contrast variation through different factors as shown in Figure 1.
where f (x,y) is the input and g(x,y) is the output pixel value, s1, s2, r1, and r2, are the user defined parameters for contrast adjustment, z1, z2 and z3 are the scale factors for the rotations in grayscale image and formulated as s1/r1, , respectively, and L is the maximum range of gray level value. Geometric Transformations: This transformation includes translation, and rotation scaling was used on each image of the dataset to obtain new images as shown in Figure 1. For the CNN architecture, this step is very valuable to read the same object from different perspectives, which enhances the generalization capabilities of the model.

Our Proposed MVT Model for SC Classification
Transformer architecture was proposed by Vaswani et al. [54], which is an encoderdecoder module, and transforms a given sequence of elements into another sequence. The major theme behind the transformers is to enable parallel processing for the data. In this research, we explored vision-based Transformer architecture for SC classification as shown in Figure 2, where the MVT architecture receives an image as input data with a size of 72 by 72. First, the input image is converted into patches as in [55]. MVT supports a different number of patches based on the underline scenario. We split the input image into nine patches. To deal with 2D images, the image X ∈ R (H×W×C) is reshaped as a sequence structure, similar to word embeddings, to the transformers input 2D patches X p ∈ R N * (P 2 .C) , where (H, W) represents the resolution of the original image, and the resolution of image patches are represented by (P, P). N = HW/P 2 is the effective length of sequence for the transformer. The transformer treated these patches in the same way as tokens in natural language processing. The transformer uses a constant width in each layer and a trainable linear projection maps each vectorized path to the model dimension D, the output of which is referred to as patch embeddings. The MVT comprised an embedding layer, encoder layer, and classifier layer, which are described as follows: Embedding Layer: The transformer processes each patch as an individual token and maps to dimensions D with a learnable linear projection E. The embedded projections are fused with a learnable class token U class , which is a key to completing the classification process. The positional embedding E pos is used to track and maintain the arrangement of each patch to identify the actual image. The patch encoded concatenation, with the token Z 0 , is given in the following equation.
Encoder Layer: In this step, the transformer encoder is used to receive a series of embedded patches Z 0 . The MVT creates L number of encoder blocks, then these blocks are divided into two subcomponents such as MHSA and the Multi-Layer Perceptron (MLP).
The MHSA block is the key element of the transformer encoder and contains selfattention and concatenation layers. The self-attention receives an input of x = x1, x2, . . . Xn, the transformer is responsible to makes an attention operation simultaneously on a set of queries (Q) with all keys (K), and values (V), as formulated in Equation (4).
Here Atten is a self-attention and W Q , W K , W V are weight matrices to be learned that attain the weights on the value. First, a dot product of Q across all K is calculated, then the square root of D is used to scale, and finally a SoftMax classifier is used. The transformer runs multi-head through scaled dot product attention many times in parallel with different weights. All these attention heads are fused to generate the final output, as formulated in Equation (5).
where MHSA is the combined attention heads and W Q i , W K i , W V i and W O are the learning parameter matrices.
The encoder part is composed of identical layers, since L and has two key subcomponents: a multi-head self-attention block (MHSA), and a fully connected feed-forward dense block (MLP), as shown in the Equations (6) and (7). The blocks consist of two dense layers followed by GeLU activation, where a skip connection is used in the encoder and the output is preceded by layer normalization (LN). z l = MHSA (LN(Z l−1 )) + Z l−1 , where l = 1, 2, 3, . . . , L z l = MLP LN z l + z l , where l = 1, 2, 3, . . . , L Classification layer: From the sequence, the first item Z 0 l is taken and fed to an external head classifier to anticipate the encoder's final layer for classification, which In the Algorithm 1, we provide the training and testing of the steps of the proposed MVT model.

Experimental Setup and Results
In this section, we provide a detailed explanation of the implementation and classification of the proposed method. The dataset used for the experimental evaluation was HAM10000 [34]. Several classes in the dataset are highly imbalanced, such as Dermatofibroma (Df), Melanoma (Mel), Melanocytic nevi (Nevi), Actinic keratoses (Akiec), Vascular lesions (Vasc), Basal cell carcinoma (Bcc), and Benign keratosis-like lesions (Bkl). During the analysis of the dataset, a total of 10,015 images were obtained. To improve the generalization of the model, we provided some preprocessing steps to increase visibility and the number of training examples. Moreover, we also addressed the issue of data imbalance through preprocessing steps. We did not make any preprocessing step for the Bkl, Nv, and Mel classes in the dataset, which can be seen in Table 1, because these classes have a high number of images. Table 1 shows the number of images in the original dataset and the number of images in the preprocesses dataset. As a result, 70% of the data were used for training, 20% for validation, and 10% for testing. Figure 3 shows sample images from the HAM10000 dataset. HAM10000 [34]. Several classes in the dataset are highly imbalanced, such as Dermatofibroma (Df), Melanoma (Mel), Melanocytic nevi (Nevi), Actinic keratoses (Akiec), Vascular lesions (Vasc), Basal cell carcinoma (Bcc), and Benign keratosis-like lesions (Bkl). During the analysis of the dataset, a total of 10,015 images were obtained. To improve the generalization of the model, we provided some preprocessing steps to increase visibility and the number of training examples. Moreover, we also addressed the issue of data imbalance through preprocessing steps. We did not make any preprocessing step for the Bkl, Nv, and Mel classes in the dataset, which can be seen in Table 1, because these classes have a high number of images. Table 1 shows the number of images in the original dataset and the number of images in the preprocesses dataset. As a result, 70% of the data were used for training, 20% for validation, and 10% for testing. Figure 3 shows sample images from the HAM10000 dataset. It is worth mentioning that all experiments were carried out with Python 3.6, Tensor-Flow with Keras front ends on a Core i5-7200u Central Processing Unit (CPU) (2.7 GHz) with a main memory of 8 GB and a GeForce (2060 gtx) Graphic Processing Unit (GPU) of 6 GB. To evaluate the performance, we employed different evaluation matrices, namely precision, recall, F1-measure, and accuracy, which were obtained from the confusion matrix as discussed in [56][57][58] and formulated in the following equation, and were previously employed by several researchers for different vision-based classification problem. It is worth mentioning that all experiments were carried out with Python 3.6, Tensor-Flow with Keras front ends on a Core i5-7200u Central Processing Unit (CPU) (2.7 GHz) with a main memory of 8 GB and a GeForce (2060 gtx) Graphic Processing Unit (GPU) of 6 GB. To evaluate the performance, we employed different evaluation matrices, namely precision, recall, F1-measure, and accuracy, which were obtained from the confusion matrix as discussed in [56][57][58] and formulated in the following equation, and were previously employed by several researchers for different vision-based classification problem.

Precision =
True positive True positive + False positive (9) Recall or Sensitivity = True positive True positive + False negative (10) Acc = True positive + True negative True positive + True negative + False negative + False positive (12)

Results Evaluation with and without Preprocessing
This section explains how we used a pre-trained MVT model for SC classification. We used 50 epochs to train the model on the target dataset before and after preprocessing. Based on the results reported in Tables 2 and 3, it is clear that MVT is effective with and without data preprocessing. The accuracy of the proposed model is presented in Figure 4 without preprocessing steps to represent the training and validation accuracy. Figure 4 shows the x-axis representing a number of epochs and the y-axis representing accuracy and loss over time. As shown in this Figure, the blue color indicates training accuracy and loss, whereas the orange color represents accuracy and loss during validation. The validation accuracy was higher than the accuracy of training at the initial stage of a training process after a few epochs had passed. It can be observed that the training accuracy reached 80% after five epochs, but a sudden reduction can be seen in the validation accuracy, which was 77% after five epochs. Training and validation accuracy of the proposed model, without the use of preprocessing, reached 98% and 90%, respectively. The validation loss throughout the experiment was higher than the training loss when considering the loss of training and validation. On the final epoch, the training and validation loss reached 0.02 and 0.1, respectively. Due to this, the proposed model introduced overfitting after 20 iterations.  reached 80% after five epochs, but a sudden reduction can be seen in the validation accuracy, which was 77% after five epochs. Training and validation accuracy of the proposed model, without the use of preprocessing, reached 98% and 90%, respectively. The validation loss throughout the experiment was higher than the training loss when considering the loss of training and validation. On the final epoch, the training and validation loss reached 0.02 and 0.1, respectively. Due to this, the proposed model introduced overfitting after 20 iterations.  Table 2 shows the confusion matrix for the proposed model without using the preprocessing step, and shows the misprediction value and the exact value for each class. Based on the data n Table 2, it can be seen that the Mel class had the lowest accuracy and the Df and Vasc classes had the highest accuracy. As a result, the overall test accuracy was achieved at 90.28%.
The classification report of our model using the test set of the unaugmented dataset is given in Figure 5, where the recall, precision, and F1-measure of each class are given. In Figure 5, it can be observed that the lowest recall, precision, and F1-measure was achieved by the Mel class whereas the Vasc and Df classes achieved the highest performance in terms of all evaluation metrics due to a large number of training examples.   Table 2 shows the confusion matrix for the proposed model without using the preprocessing step, and shows the misprediction value and the exact value for each class. Based on the data n Table 2, it can be seen that the Mel class had the lowest accuracy and the Df and Vasc classes had the highest accuracy. As a result, the overall test accuracy was achieved at 90.28%. Table 2. Confusion matrix of the proposed model without the preprocessing stage. The classification report of our model using the test set of the unaugmented dataset is given in Figure 5, where the recall, precision, and F1-measure of each class are given. In Figure 5, it can be observed that the lowest recall, precision, and F1-measure was achieved by the Mel class whereas the Vasc and Df classes achieved the highest performance in terms of all evaluation metrics due to a large number of training examples.    In addition, with regards to the loss, it is can be seen in Figure 6 that the training and validation losses were significantly reduced. By the end of the training and validation epochs, the training and validation loss had reached 0.01. addition, with regards to the loss, it is can be seen in Figure 6 that the training and validation losses were significantly reduced. By the end of the training and validation epochs, the training and validation loss had reached 0.01. The classification report in Figure 7 shows that the proposed MVT model over the augmented dataset achieved higher performance for precision, recall, and F1-measure compared to the unaugmented dataset.  Table 3 shows the outcomes of data preprocessing steps based on MVT. It is noteworthy that the performance of the SC classification significantly improved after MVT preprocessed the data. The confusion matrix of the proposed model using the preprocessing stage is given in Table 3. The matrix shows the correct and misprediction values of each class. As can be seen from Table 3, the lowest accuracy was achieved by the Bcc class, while the highest accuracy was reached by the Df and Vasc classes, respectively. The overall test accuracy was 96.14%. The classification report of the proposed model is given in Table 4, where the precision, recall, and F1 measure of each class can be observed. The classification report in Figure 7 shows that the proposed MVT model over the augmented dataset achieved higher performance for precision, recall, and F1-measure compared to the unaugmented dataset. addition, with regards to the loss, it is can be seen in Figure 6 that the training and validation losses were significantly reduced. By the end of the training and validation epochs, the training and validation loss had reached 0.01. The classification report in Figure 7 shows that the proposed MVT model over the augmented dataset achieved higher performance for precision, recall, and F1-measure compared to the unaugmented dataset.  Table 3 shows the outcomes of data preprocessing steps based on MVT. It is noteworthy that the performance of the SC classification significantly improved after MVT preprocessed the data. The confusion matrix of the proposed model using the preprocessing stage is given in Table 3. The matrix shows the correct and misprediction values of each class. As can be seen from Table 3, the lowest accuracy was achieved by the Bcc class, while the highest accuracy was reached by the Df and Vasc classes, respectively. The overall test accuracy was 96.14%. The classification report of the proposed model is given in Table 4, where the precision, recall, and F1 measure of each class can be observed.  Table 3 shows the outcomes of data preprocessing steps based on MVT. It is noteworthy that the performance of the SC classification significantly improved after MVT preprocessed the data. The confusion matrix of the proposed model using the preprocessing stage is given in Table 3. The matrix shows the correct and misprediction values of each class. As can be seen from Table 3, the lowest accuracy was achieved by the Bcc class, while the highest accuracy was reached by the Df and Vasc classes, respectively. The overall test accuracy was 96.14%. The classification report of the proposed model is given in Table 4, where the precision, recall, and F1 measure of each class can be observed.

Comparison of the Proposed Method with State-of-the-Art Methods
In this section, we discuss the comparison between the proposed model and stateof-the-art methods. As can be seen in Table 4, the proposed model achieved the highest performance in terms of precision, recall, F1-measure, and accuracy when compared to state-of-the-art methods. According to the proposed model, it achieves 96.14% accuracy, a 97.0% F1-measure, 96.50% recall or sensitivity, and 96.0% precision.
We conclude that our model outperforms state-of-the-art methods by 3.78, 12.3, 8.97, and 0.34% respectively. Additionally, we used the Grad-CAM visualization mechanism to generate a heat-map of our proposed model which was applied to the target dataset. A heat-map is used to precisely locate an SC lesion by identifying its precise location on the heat map. Figure 8 shows samples of a heat-map for each class. It is clear from these visual results that the proposed model is robust and effective in the area of SC classification.   [60] 89.00 83.00 83.00 83.10 Chaturvedi et al. [41] 88.00 88.00 88.00 93.20 Huang et al. [37] 75.18 ----85.80 Carcagni et al. [38] 88.00 76.00 82.00 90.00 Shahin et al. [42] 86.20 79.60 82.90 89.90 Jain et al. [46] 88

Comparison of the Proposed Method with State-of-the-Art Methods
In this section, we discuss the comparison between the proposed model and state-ofthe-art methods. As can be seen in Table 4, the proposed model achieved the highest performance in terms of precision, recall, F1-measure, and accuracy when compared to stateof-the-art methods. According to the proposed model, it achieves 96.14% accuracy, a 97.0% F1-measure, 96.50% recall or sensitivity, and 96.0% precision.
We conclude that our model outperforms state-of-the-art methods by 3.78, 12.3, 8.97, and 0.34% respectively. Additionally, we used the Grad-CAM visualization mechanism to generate a heat-map of our proposed model which was applied to the target dataset. A heat-map is used to precisely locate an SC lesion by identifying its precise location on the heat map. Figure 8 shows samples of a heat-map for each class. It is clear from these visual results that the proposed model is robust and effective in the area of SC classification.

Conclusions and Future Work
Machine learning algorithms have the potential to contribute greatly to augmenting the capabilities of medical experts in detecting early signs of skin cancer. This paper discusses a framework for skin cancer classification based on dermoscopic images using an MVT-based framework. The proposed method was evaluated on the HAM10000 dataset and achieved state-of-the-art performance. Compared to state-of-the-art methods, our method outperformed these in terms of precision, recall, F1measure, and accuracy by 3.78, 12.3, 8.97, and 0.34%, respectively. Our method improved when the number of images in all classes was increased to overcome the problem of data imbalance. The proposed MVT has a large number of training parameters and model size, and also requires a huge amount of training data that is not applicable to running on edge devices. In the future, we will employ model pruning and quantization techniques to overcome these challenges. Furthermore, we will employ enhanced level data augmentation techniques to overcome the problem of inequities in the data and thus improve the performance of SC classification.