Skin Lesion Segmentation Based on Vision Transformers and Convolutional Neural Networks—A Comparative Study

: Melanoma skin cancer is considered as one of the most common diseases in the world. Detecting such diseases at early stage is important to saving lives. During medical examinations, it is not an easy task to visually inspect such lesions, as there are similarities between lesions. Technological advances in the form of deep learning methods have been used for diagnosing skin lesions. Over the last decade, deep learning, especially CNN (convolutional neural networks), has been found one of the promising methods to achieve state-of-art results in a variety of medical imaging applications. However, ConvNets’ capabilities are considered limited due to the lack of understanding of long-range spatial relations in images. The recently proposed Vision Transformer (ViT) for image classiﬁcation employs a purely self-attention-based model that learns long-range spatial relations to focus on the image’s relevant parts. To achieve better performance, existing transformer-based network architectures require large-scale datasets. However, because medical imaging datasets are small, applying pure transformers to medical image analysis is difﬁcult. ViT emphasizes the low-resolution features, claiming that the successive downsampling results in a lack of detailed localization information, rendering it unsuitable for skin lesion image classiﬁcation. To improve the recovery of detailed localization information, several ViT-based image segmentation methods have recently been combined with ConvNets in the natural image domain. This study provides a comprehensive comparative study of U-Net and attention-based methods for skin lesion image segmentation, which will assist in the diagnosis of skin lesions. The results show that the hybrid TransUNet, with an accuracy of 92.11% and dice coefﬁcient of 89.84%, outperforms other benchmarking methods.


Introduction
Cancer is one of the deadliest diseases in the world.According to WHO, in 2020, nearly 10 million people have died due to cancer.According to 2020 data, among the new cancer cases found, the most common cases were breast cancer (2.27 M), lung cancer (2.21 M), colon and rectum (1.93 M), skin cancer (1.2 M) and stomach (1.09 M) [1].Skin cancers are distinguished into two categories, namely non-melanoma and melanoma.Nonmelanoma is a more common type of skin cancer whereas melanoma is a less common skin cancer.However, melanomas are the most dangerous ones and have a high spread rate.Even though the amount is only 5% of skin malignancy, the mortality rate is over 75% [2].Skin cancers can spread to other parts of the body and are often not curable; detection at a very early stage can help them from spreading as they may be curable.Melanoma skin cancers are a highly aggressive type of cancer, and their incidence has dramatically increased over the past three decades [3].To stop their spread and treat them, it is important to detect such cancers at a very early stage.For melanoma skin cancer, the five-year survival rate is 98% if detected at an early stage; however, it drops by 14% if detected in later stages.Due to the nature of melanoma, it is very important to detect it in a timely and accurate manner.The best way to detect skin lesions is through dermatoscopy.This technique is used as a primary examination to detect skin lesions.Due to the high resolution and quality, and enhanced visualization competence of dermoscopic images, these images help dermatologists to examine the skin lesions with naked eyes.Nevertheless, good expertise and deep knowledge are required to make the right diagnosis and such diagnoses/interpretations may differ from one dermatologist to another.
It is evident from the literature that skin lesion (melanoma) detection based on CNN (Convolutional Neural Networks) has obtained better performance than dermatologists' [4].Incorporating AI in smartphones has become a trend and is proven productive in terms of accuracy.Advancement in image processing and incorporating the deep learning techniques, especially CNN, has given a different edge to use in medical science.Processing of images using CNN and its automatic analysis methods can be proven a powerful tool and can provide a user-friendly intelligent system to scan the lesion images to detect melanoma outside the clinic [5,6].As a result, automatic analysis of skin lesions has become an important step in computer-aided diagnosis [7].
Using dermoscopy images, different approaches have been proposed to detect melanoma skin lesions based on the appearance of colour and texture patterns of skin for instant classical pattern analysis [8], ABCDE rules [9] and a seven-point checklist [10,11].In ABCDE rules, an easy and general framework has been provided in order to identify melanoma.The defined rules are border irregularity, un-uniform colour, 6 mm diameter or above, and growing lesions in colour, shape or size.Usually, it is the borders of the melanoma which are uneven and may contain rough edges, which are imprecisely defined.Hence, skin segmentation is performed at an early stage to get the border information or regions of interest (ROI).Such approaches have been proven to be beneficial for the subsequent classification or detection task [11,12].
It is a challenging task to perform automatic skin lesion segmentation on some skin lesions where lesions are not that clear or have a light pigment.It gets complicated when colour or visual patterns are similar, and the boundaries are alike for different melanomas.In such cases, skin segmentation becomes enormously difficult.It gets more complicated for computers to read images directly which are of high resolution, and resource intensive.Having such properties make the process slow.To expedite the process, the image size is reduced using down-sampling, which in turn has a negative effect on the textures and subtlety.Due to that, it gets more complicated to differentiate the boundaries of these skin lesions.Moreover, the colour and texture distribution gets affected by some elements present in skin lesions such as blood vessels, colour illumination and hair artefacts, which have a negative impact on learning.
Figure 1 shows some sample images of the ISIC dataset [13].ISIC contains the digital skin images of melanoma.The aim of keeping such images publicly available is to educate the public/professionals about melanoma, examine them and provide some diagnoses through teledermatology, clinical decision support and automated diagnosis.Applying skin lesion segmentation on such a dataset is a challenging task due to the characteristics of these images.As it can be seen from the figure some images are covered with hair, some lesions have blood vessels around them and some lesion images have fuzzy boundaries, which makes skin lesion segmentation difficult.Different approaches have been proposed for skin lesion segmentation by incorporating different CNN architectures with multiscale information [12,14,15] and multi-task learning frameworks [16,17] with auxiliary information [18].The aim of these proposed approaches is to utilize as much data available to detect skin lesions and make the right diagnosis.Nevertheless, such methods either require extra labelling or use extensive parameters which are impractical in real situations.Although CNNs have achieved the best results on various vision tasks, they have still shown limited performance on skin lesion segmentation tasks due to a lack of understanding of long-range spatial relations in skin lesion images.In order to overcome this limitation, the recently proposed Vision Transformer (ViT) [19] for image classification employs a purely self-attention-based model that learns long-range spatial relations to focus on the image's relevant parts.To achieve better performance, existing transformer-based network architectures require large-scale datasets.However, because medical imaging datasets are small, applying pure transformers to medical image analysis is difficult.ViT emphasizes low-resolution features, claiming that the successive downsamplings result in a lack of detailed localization information, rendering it unsuitable for skin lesion image classification.To improve the recovery of detailed localization information, several ViTbased image segmentation methods have recently been combined with ConvNets in the natural image domain.Furthermore, different architecture called UNet [20], based on CNN with few changes in its architecture, has been developed and incorporated into biomedical images not only to just classify the disease but also to identify the area of infection.Due to its success rate on medical images, many researchers [21][22][23][24] have adopted this model and use it for skin segmentation.This study provides a combined architecture that combines ViT and ConvNet to provide an efficient skin lesion image segmentation, which will assist in the diagnosis of skin lesions.
Recently, some of the feature extraction studies have shown some improvement in terms of accuracy [25][26][27].Such studies mainly follow parallel-based and serial-based approaches.The aim of these studies is to enrich the information related to the subject gathered from different sources.Nevertheless, incorporating this technique gives the rise to the number of predictors, which in turn has a negative impact on computational time.Due to this reason, many researchers only focus on introducing techniques that select only the best features.Feature-based techniques are of two types: heuristic-based and meta-heuristic technique.Meta-heuristic techniques are more valuable for the selection process and provide a fewer number of predictors.
In this paper, a detailed comparative analysis based on the properties of the Transformers and CNNs is carried out.The following points summarize the contribution of the paper:

•
We discuss the challenges in the skin lesion segmentation.

•
We provide a detailed comparative analysis of the state-of-the-art methods for the task of skin lesion segmentation.

•
We evaluate the efficacy of the state-of-the-art methods, and the experiments demonstrate the effectiveness of the U-Net and Transformer-based methods for skin lesion segmentation.Although CNNs have achieved the best results on various vision tasks, they have still shown limited performance on skin lesion segmentation tasks due to a lack of understanding of long-range spatial relations in skin lesion images.In order to overcome this limitation, the recently proposed Vision Transformer (ViT) [19] for image classification employs a purely self-attention-based model that learns long-range spatial relations to focus on the image's relevant parts.To achieve better performance, existing transformer-based network architectures require large-scale datasets.However, because medical imaging datasets are small, applying pure transformers to medical image analysis is difficult.ViT emphasizes low-resolution features, claiming that the successive downsamplings result in a lack of detailed localization information, rendering it unsuitable for skin lesion image classification.To improve the recovery of detailed localization information, several ViTbased image segmentation methods have recently been combined with ConvNets in the natural image domain.Furthermore, different architecture called UNet [20], based on CNN with few changes in its architecture, has been developed and incorporated into biomedical images not only to just classify the disease but also to identify the area of infection.Due to its success rate on medical images, many researchers [21][22][23][24] have adopted this model and use it for skin segmentation.This study provides a combined architecture that combines ViT and ConvNet to provide an efficient skin lesion image segmentation, which will assist in the diagnosis of skin lesions.
Recently, some of the feature extraction studies have shown some improvement in terms of accuracy [25][26][27].Such studies mainly follow parallel-based and serial-based approaches.The aim of these studies is to enrich the information related to the subject gathered from different sources.Nevertheless, incorporating this technique gives the rise to the number of predictors, which in turn has a negative impact on computational time.Due to this reason, many researchers only focus on introducing techniques that select only the best features.Feature-based techniques are of two types: heuristic-based and meta-heuristic technique.Meta-heuristic techniques are more valuable for the selection process and provide a fewer number of predictors.
In this paper, a detailed comparative analysis based on the properties of the Transformers and CNNs is carried out.The following points summarize the contribution of the paper:

•
We discuss the challenges in the skin lesion segmentation.

•
We provide a detailed comparative analysis of the state-of-the-art methods for the task of skin lesion segmentation.

•
We evaluate the efficacy of the state-of-the-art methods, and the experiments demonstrate the effectiveness of the U-Net and Transformer-based methods for skin lesion segmentation.
The rest of the paper is organized as follows: related works to this research are reported in Section 2. In Section 3, the details about the comparative methods are presented.Section 4 shows the experimental details and performance evaluation of the comparative methods and the benchmarking models.Finally, Section 5 concludes the paper.

Related Work
In different computer vision tasks, the convolutional neural network models have made exceptional progress [28][29][30][31][32], and have been proven to be the best in solving segmentation problems.Skin lesion classification accuracy directly depends on the lesion segmentation.The more accurate the lesion segmentation is, the better the result of lesion classification.Lately, a lot of research has been done to develop such models, which can automatically conduct the lesion segmentation in such a way that differentiates the normal area of skin from the lesion.Different techniques have been used, such as region growing, threshold, edge detection, etc. to develop such models.
Segmentation problems have been treated as classification problems by some researchers, such as Jafari et al. [33].They proposed an automatic skin segmentation model for non-dermoscopic images.For better performance, this model uses the guided filter technique [34] as a pre-processing technique to reduce the noise artefacts on the input images.A window around the pixel is placed to identify whether it belongs to a lesion area or not and then is fed to a CNN and output labels pixel in the centre of the patch.A segmentation map is generated with the label as 1 (lesion) and 0 (normal) for each pixel of the pre-processed images and then the segmentation mask is generated by selecting the largest connected component of an image.The output of this model is the predicted labels of the pixels.Nonetheless, such a model requires a dense prediction because it is based on pixel-level prediction.U-NET is a well-known model proposed by Ronneberger et al. [20].It has a good success rate for medical images when it comes to segmentation problems.Due to its popularity, many models have been proposed [21][22][23][24] adopting U-NET architecture.They have used it for melanoma skin segmentation and classification.In [23], the authors have modified the original U-NET by widening the kernel at each convolutional block using a dilated convolution technique in order to expand the receptive field of the proposed model; whereas in [24], the authors have utilized colour transformation to select different colour bands in order to improve the performance of the proposed approach.Another model proposed by Yuan et al. [35] incorporated the deconvolution method and replaced the regular cross-entropy function with a loss function based on Jaccard distance; whereas a full resolution CNN model is proposed by Al-masni et al. [36], in which models directly learn the full resolution features of every single pixel from the input images without reducing the size of input images.However, there is an issue of using full resolution CNN as an overfitting of the model, which is likely to happen on the features of non-melanoma skin lesions.In that case, such models perform poorly when it comes to complex features of melanoma skin lesions due to their inconsistent boundary appearance and different textures.Bi et al. [37] proposed a new model to overcome the limitation of full-resolution CNN.Their model learns the necessary skin lesion characteristics for each class (be it melanoma or non-melanoma) individually.They incorporated a new probability-based, stepwise integration to combine the segmentation output generated from each class.In another study [38] of skin lesion segmentation, a dilated residual network has been used along with pyramid pooling networks.To attain sharp boundaries, they have used a negative log-likelihood and endpoint error loss together.In a recent study, a bootstrapping CNN method was proposed by Xie et al. [17] for both skin lesion segmentation and skin classification problem.In this method, one task aids another in a bootstrapping way where a coarse segmentation network is trained, and then a predicted coarse mask is utilized to guide the classification network.Simultaneously, localization maps are generated to increase the prediction of masks outperforming the coarse mask.Recently, another model of CNN architecture using auxiliary information has been proposed [18].The model does not need any pre-post-processing of data.This model predicts edges and does the classification of skin lesions simultaneously by using two parallel branches.Predicting edges helps the neural network to pay attention to segmentation mask boundaries.
In brief, the aforementioned methods focused mainly on feature extraction.They used pre-trained CNN models in order to achieve it.Furthermore, they incorporated CNN segmentation models for lesion detection.Their primary aim is to improve the model's accuracy in terms of performing skin segmentation.However, hardly anyone has focused on reducing the systems' time in terms of prediction.
Since the task of skin lesion segmentation is quite different from multiorgan segmentation, as in skin lesion segmentation, the task is to predict the binary mask rather than the multiclass pixel-wise segmentation.Therefore, we modified the implementation of these models in order to apply these models for this specific task of skin lesion segmentation.The following subsections outline a brief overview of these methods and the architectural changes that were carried out for this specific task of skin lesion segmentation.

U-Net
As stated in [20], U-Net consists of contracting and expansive path (downsampling and upsampling) as shown in Figure 2.For this particular task, we used four downsampling and upsampling levels with one bottom level.Two convolutional layers per downsampling level and one convolutional layer after concatenation per up sampling level.Gaussian Error Linear Unit (GELU function) was used as an activation function with batch normalization and Softmax as an output function.GELU multiplies its input by the cumulative density function of the normal distribution.The reason for using GELU is that it provides a welldefined negative gradient to prevent neurons from dying while also limiting how far into the negative regime activations can have an effect, allowing for improved feature mixing between layers.In addition to this, max pooling was used for downsampling and reflective padding for upsampling.
Appl.Sci.2022, 12, x FOR PEER REVIEW 5 of 18 the classification of skin lesions simultaneously by using two parallel branches.Predicting edges helps the neural network to pay attention to segmentation mask boundaries.
In brief, the aforementioned methods focused mainly on feature extraction.They used pre-trained CNN models in order to achieve it.Furthermore, they incorporated CNN segmentation models for lesion detection.Their primary aim is to improve the model's accuracy in terms of performing skin segmentation.However, hardly anyone has focused on reducing the systems' time in terms of prediction.
Since the task of skin lesion segmentation is quite different from multiorgan segmentation, as in skin lesion segmentation, the task is to predict the binary mask rather than the multiclass pixel-wise segmentation.Therefore, we modified the implementation of these models in order to apply these models for this specific task of skin lesion segmentation.The following subsections outline a brief overview of these methods and the architectural changes that were carried out for this specific task of skin lesion segmentation.

U-Net
As stated in [20], U-Net consists of contracting and expansive path (downsampling and upsampling) as shown in Figure 2.For this particular task, we used four downsampling and upsampling levels with one bottom level.Two convolutional layers per downsampling level and one convolutional layer after concatenation per up sampling level.Gaussian Error Linear Unit (GELU function) was used as an activation function with batch normalization and Softmax as an output function.GELU multiplies its input by the cumulative density function of the normal distribution.The reason for using GELU is that it provides a well-defined negative gradient to prevent neurons from dying while also limiting how far into the negative regime activations can have an effect, allowing for improved feature mixing between layers.In addition to this, max pooling was used for downsampling and reflective padding for upsampling.

V-Net
V-Net [39] originally proposed for 3d MRI images was modified to 2d inputs of skin lesion images as shown in Figure 3.The same number of downsampling and upsampling

V-Net
V-Net [39] originally proposed for 3d MRI images was modified to 2d inputs of skin lesion images as shown in Figure 3.The same number of downsampling and upsampling levels were used as in U-Net implementation, i.e., four with one bottom level.The number of stacked convolutional layers in the residual path increases from one to three downsampling levels (symmetrically decreasing with upsampling levels).Unlike U-Net, the Parametric ReLU (PReLU) activation function was used with batch normalization and Softmax as an output activation function.PReLU is a type of LeakyReLU that instead of having a predetermined slope, makes it a parameter for the neural network to figure out itself, which helps in the better adaption to other parameters such as weights and biases.Downsampling was carried through stridden convolutional layers and upsampling through transpose convolutional layers.

Attention U-Net
Attention U-Net [40] was proposed for multi-class CT abdominal image segmentation as shown in Figure 4. We modified this model for skin lesion image segmentation.We used four downsampling and upsampling levels with two convolutional layers per downsampling level and two convolutional layers after concatenation per upsampling level.We used the ReLU activation function with batch normalization.In addition to this, we used additive attention and ReLU attention activation.We used stridden convolution layers for downsampling and bilinear interpolation for upsampling.

Attention U-Net
Attention U-Net [40] was proposed for multi-class CT abdominal image segmentation as shown in Figure 4. We modified this model for skin lesion image segmentation.We used four downsampling and upsampling levels with two convolutional layers per downsampling level and two convolutional layers after concatenation per upsampling level.We used the ReLU activation function with batch normalization.In addition to this, we used additive attention and ReLU attention activation.We used stridden convolution layers for downsampling and bilinear interpolation for upsampling.
Appl.Sci.2022, 12, x FOR PEER REVIEW 6 of 18 levels were used as in U-Net implementation, i.e., four with one bottom level.The number of stacked convolutional layers in the residual path increases from one to three downsampling levels (symmetrically decreasing with upsampling levels).Unlike U-Net, the Parametric ReLU (PReLU) activation function was used with batch normalization and Softmax as an output activation function.PReLU is a type of LeakyReLU that instead of having a predetermined slope, makes it a parameter for the neural network to figure out itself, which helps in the better adaption to other parameters such as weights and biases.Downsampling was carried through stridden convolutional layers and upsampling through transpose convolutional layers.

Attention U-Net
Attention U-Net [40] was proposed for multi-class CT abdominal image segmentation as shown in Figure 4. We modified this model for skin lesion image segmentation.We used four downsampling and upsampling levels with two convolutional layers per downsampling level and two convolutional layers after concatenation per upsampling level.We used the ReLU activation function with batch normalization.In addition to this, we used additive attention and ReLU attention activation.We used stridden convolution layers for downsampling and bilinear interpolation for upsampling.

TransUNet
The TransUNet [41] architecture has been used for the medical image segmentation on the Synapse multi-organ segmentation dataset for segmenting the abdominal CT images into eight abdominal organ classes as shown in Figure 5. Since the task of skin lesion segmentation is quite different from multiorgan segmentation, as in skin lesion segmentation, the task is to predict the binary mask rather than the multiclass pixel-wise segmentation.Therefore, we modified the basic architecture of the TransUNet, which is a combination of the Transformer and U-Net architectures in order to address the challenges in skin lesion image segmentation.For an image, instance i ∈ R H×W×C , with a spatial resolution of H × W and a channel count of C is determined.The goal is to forecast the label map with size H × W that corresponds to the related ground truth label map.The most common and conventional method is to train a CNN (such as U-Net) to encode images into high-level feature representations, which are subsequently decoded back to full spatial resolution.Unlike other approaches, this method uses Transformers to incorporate self-attention mechanisms into the encoder architecture.
The TransUNet [41] architecture has been used for the medical image segmentation on the Synapse multi-organ segmentation dataset for segmenting the abdominal CT images into eight abdominal organ classes as shown in Figure 5. Since the task of skin lesion segmentation is quite different from multiorgan segmentation, as in skin lesion segmentation, the task is to predict the binary mask rather than the multiclass pixel-wise segmentation.Therefore, we modified the basic architecture of the TransUNet, which is a combination of the Transformer and U-Net architectures in order to address the challenges in skin lesion image segmentation.For an image, instance  ∈ ℝ  ×  ×  , with a spatial resolution of  ×  and a channel count of C is determined.The goal is to forecast the label map with size  ×  that corresponds to the related ground truth label map.The most common and conventional method is to train a CNN (such as U-Net) to encode images into high-level feature representations, which are subsequently decoded back to full spatial resolution.Unlike other approaches, this method uses Transformers to incorporate self-attention mechanisms into the encoder architecture.The input to the model is (512, 512, 3) images.The model has four down and upsampling levels-two convolutional layers per downsampling and upsampling layers, respectively.We used 12 transformers blocks and the number of attention heads was set to 12.The other changes that have been made to the basic architecture of the TransUNet are the number of nodes in the multi-layer perceptron (MLP) and the embedding dimensions.In the proposed model, each transformer has 1536 multi-layer perceptron (MLP) nodes that embed the images into 384 dimensions.The reason for modifying these was to make the network focus on more precise localization and produce accurate binary masks as compared to the multiclass segmentation in the base TransUNet.For the transformer, an MLPs Gaussian Error Linear Unit (GELU) was used as an activation function and upsampling through bilinear interpolation was carried out.Sigmoid was used as an output function.

Swin-UNet
Swin-UNet [42] is transformer-based U-shaped architecture with an encoder, bottleneck, decoder and skip connections as shown in Figure 6.The encoder, bottleneck and decoder are all built using the Swin-Transformer block.The inputs are divided into nonoverlapping image patches.Each patch is treated as a token and fed into the Transformerbased encoder to learn deep feature representations.The extracted context features are then up-sampled by the decoder with a patch expanding layer and fused with the multiscale features from the encoder via skip connections to restore the spatial resolution of the feature maps and further perform segmentation prediction.Three downsamplings and upsampling with one bottom level were used.Two Swin transformers per downsampling level and two Swin transformers per upsampling level were used.Two by two patches were extracted from the input and then embedded into 64 dimensions.In addition to this, The input to the model is (512, 512, 3) images.The model has four down and upsampling levels-two convolutional layers per downsampling and upsampling layers, respectively.We used 12 transformers blocks and the number of attention heads was set to 12.The other changes that have been made to the basic architecture of the TransUNet are the number of nodes in the multi-layer perceptron (MLP) and the embedding dimensions.In the proposed model, each transformer has 1536 multi-layer perceptron (MLP) nodes that embed the images into 384 dimensions.The reason for modifying these was to make the network focus on more precise localization and produce accurate binary masks as compared to the multiclass segmentation in the base TransUNet.For the transformer, an MLPs Gaussian Error Linear Unit (GELU) was used as an activation function and upsampling through bilinear interpolation was carried out.Sigmoid was used as an output function.

Swin-UNet
Swin-UNet [42] is transformer-based U-shaped architecture with an encoder, bottleneck, decoder and skip connections as shown in Figure 6.The encoder, bottleneck and decoder are all built using the Swin-Transformer block.The inputs are divided into nonoverlapping image patches.Each patch is treated as a token and fed into the Transformerbased encoder to learn deep feature representations.The extracted context features are then up-sampled by the decoder with a patch expanding layer and fused with the multiscale features from the encoder via skip connections to restore the spatial resolution of the feature maps and further perform segmentation prediction.Three downsamplings and upsampling with one bottom level were used.Two Swin transformers per downsampling level and two Swin transformers per upsampling level were used.Two by two patches were extracted from the input and then embedded into 64 dimensions.In addition to this, 512 nodes per Swin transformer, shift attention windows and Softmax output function were used.

Experimental Results
The implementation details of the comparative methods are presented in this section.After that, the dataset and evaluation metrics are described.Finally, the performance of the different methods is assessed and compared to state-of-the-art approaches.

Implementation Details
All the models are trained using ISIC 2018 [18] training data, and their performance is assessed using testing data.A dermoscopic image (input), its matching ground-truth segmentation mask and an edge (contour) image (outputs) are required for training the proposed model.By using a contour detection technique, the ground truth of the edge picture may be automatically extracted from the ground truth of the segmentation mask.We trained all the models with stochastic gradient descent (SGD) with a learning rate of 0.001, momentum = 0.9 and decay = 0.0005.The batch size is set to 1 and the models were trained for 100 to 200 epochs with early stopping criteria.The models were trained with binary cross-entropy loss.
The methods were implemented in Keras with the TensorFlow backend.All the experiments were carried out on Nvidia Tesla V100 GPUS with 32 GB memory.

Dataset
ISIC 2018 Dataset: ISIC dataset [13] contains of three parts-skin lesion segmentation, attribute detection and classification of the skin lesion into type.For segmentation, the dataset contains 2594 and 1000 number of images for training and testing, respectively.The training set is provided along with ground truth images.The images were resized to a uniform dimension of 512 by 512 as a pre-processing step in order to have a uniform evaluation of all the methods.The images present in dataset have illumination variations as well as different artefacts such as hair, colour-marks, rulers and glue.Figure 7 shows some of the image samples of dataset ISIC 2018.

Experimental Results
The implementation details of the comparative methods are presented in this section.After that, the dataset and evaluation metrics are described.Finally, the performance of the different methods is assessed and compared to state-of-the-art approaches.

Implementation Details
All the models are trained using ISIC 2018 [18] training data, and their performance is assessed using testing data.A dermoscopic image (input), its matching ground-truth segmentation mask and an edge (contour) image (outputs) are required for training the proposed model.By using a contour detection technique, the ground truth of the edge picture may be automatically extracted from the ground truth of the segmentation mask.We trained all the models with stochastic gradient descent (SGD) with a learning rate of 0.001, momentum = 0.9 and decay = 0.0005.The batch size is set to 1 and the models were trained for 100 to 200 epochs with early stopping criteria.The models were trained with binary cross-entropy loss.
The methods were implemented in Keras with the TensorFlow backend.All the experiments were carried out on Nvidia Tesla V100 GPUS with 32 GB memory.

Dataset
ISIC 2018 Dataset: ISIC dataset [13] contains of three parts-skin lesion segmentation, attribute detection and classification of the skin lesion into type.For segmentation, the dataset contains 2594 and 1000 number of images for training and testing, respectively.The training set is provided along with ground truth images.The images were resized to a uniform dimension of 512 by 512 as a pre-processing step in order to have a uniform evaluation of all the methods.The images present in dataset have illumination variations as well as different artefacts such as hair, colour-marks, rulers and glue.Figure 7 shows some of the image samples of dataset ISIC 2018.

Evaluation Metrics
For the evaluation, different matrices have been used in this research work, such as Intersection over Union (IoU), Precision (P), Recall (R) and Accuracy (ACC).These requirements are as defined as follows: wherein TP (True Positive) denotes the number of correctly recognized foreground pixels (interest region), the number of background pixels accurately recognized as a background is referred to as TN (True Negative) (skin region), the number of background pixels labelled as the foreground is known as FP (False Positive) and the number of foreground pixels incorrectly labelled as background is known as FN (False Negative).
The ratio of overlapping and union areas between the expected segmentation mask and the ground truth mask is represented by IoU.The IoU metric measures how similar the prediction mask is to the ground truth mask.The percentage of correctly identified pixels in the total number of pixels is represented by ACC.The fraction of successfully segmented foreground pixels versus the total number of foreground pixels is represented by recall (R).

Qualitative Results
A qualitative examination of the different method's performance is undertaken in this section.The segmentation and edge prediction branches' final results are shown in Figures 8-12.In most cases, as illustrated in Figures 8, 9 and 11, the attention-based methods combined with the properties of U-Net can appropriately segregate the pigment regions as compared to the U-Net based methods when used separately.The first row shows the output predictions given an input image, which is a simple scenario because the colour

Evaluation Metrics
For the evaluation, different matrices have been used in this research work, such as Intersection over Union (IoU), Precision (P), Recall (R) and Accuracy (ACC).These requirements are as defined as follows: TP + TN TP + TN + FP + FN wherein TP (True Positive) denotes the number of correctly recognized foreground pixels (interest region), the number of background pixels accurately recognized as a background is referred to as TN (True Negative) (skin region), the number of background pixels labelled as the foreground is known as FP (False Positive) and the number of foreground pixels incorrectly labelled as background is known as FN (False Negative).
The ratio of overlapping and union areas between the expected segmentation mask and the ground truth mask is represented by IoU.The IoU metric measures how similar the prediction mask is to the ground truth mask.The percentage of correctly identified pixels in the total number of pixels is represented by ACC.The fraction of successfully segmented foreground pixels versus the total number of foreground pixels is represented by recall (R).

Qualitative Results
A qualitative examination of the different method's performance is undertaken in this section.The segmentation and edge prediction branches' final results are shown in Figures 8-12.In most cases, as illustrated in Figures 8, 9 and 11, the attention-based methods combined with the properties of U-Net can appropriately segregate the pigment regions as compared to the U-Net based methods when used separately.The first row shows the output predictions given an input image, which is a simple scenario because the colour contrast between the foreground and background regions of the input image is high.As a result, the attention-based approaches can accurately detect the pigment region.The input images in the bottom rows have hairs and low contrast and deformed shapes, which can distort the textures of the skin lesions and make learning difficult.Despite this, the attention-based methods (TransUNet and Attention U-Net) succeed in segmenting the pigment regions.To put it another way, the attention-based methods show robust performance to the distortions and exhibit good overall performance.It can be clearly seen in the output results from U-Net and Attention U-Net that these two methods generate larger segmentation maps with unclear boundaries as compared to the TransUNet method, whereas the Swin-UNet completely fails to segment the lesions, thus depicting the efficacy of the hybrid method for skin lesion segmentation.To quantitatively evaluate the prediction results, we have used Intersection over Union (IoU), Precision (P), Recall (R) and Accuracy (ACC) metrics.The training and testing results are shown in Tables 1 and 2, respectively.The accuracy rate of the TransUNet model reached 50-55% within the 10-12 iterations and the accuracy rates constantly increased up to 120 iterations.From 120 iterations, the accuracy remains almost consistent and reaches 92%.It is due to the fact that the hybrid CNN-transformer, as the hybrid CNN-transformer encoder with long range dependencies, retains the high-level features and passes directly to each stage of decoder and thus results in better output predictions and precise localizations, whereas when compared with other models, it can be seen that there is not much consistency found in terms of accuracy while training the other models (U-Net and Attention U-Net).It can be seen that the accuracy reached 89% and 87% in Attention U-Net and U-Net models, respectively.The results in Tables 1 and 2 summarize the performance of the different benchmarking architectures in terms of (IoU), Precision (P), Recall (R) and Accuracy (ACC) metrics and validates the efficacy of the TransUNet as compared to the other benchmarking architectures.In addition to the evaluation metrics, we also compared the performance of these models on the basis of training time, inference time and dataset size as shown in Table 3.It can be seen in the results that the hybrid architectures take more time to train than the U-Net and V-Net architectures and take more time for the inference as well.It can be clearly seen in the output results from U-Net and Attention U-Net that thes two methods generate larger segmentation maps with unclear boundaries as compared to the TransUNet method, whereas the Swin-UNet completely fails to segment the lesions thus depicting the efficacy of the hybrid method for skin lesion segmentation.
To quantitatively evaluate the prediction results, we have used Intersection over Un  During the training, it can be depicted from the Table 1 that the TransUNet model outperforms U-Net, V-Net, Attention U-Net and SwinUNet in terms of IoU, Precision and Recall.The accuracy of TransUNet is significantly better than U-Net and Attention U-Net.During testing, again TransUNet performs better than U-Net and Attention U-Net.The accuracy rate increased to around 12% when compared to other models.

Conclusions
Skin cancer is considered as one of the deadliest cancers.It is important to detect skin lesions at a very early stage to control its spread as well as to cure it.In this paper, the different methods were evaluated on an ISIC 2018 skin lesion image dataset.Experimental results on the ISIC 2018 data have shown that the hybrid architecture of TransUNet, combining the properties of transformers and U-Net, outperforms other benchmarking methods both qualitatively and quantitatively in terms of various evaluation metrics.Furthermore, it has been shown that the TransUNet method is robust for various deformations, low contrast and noise.As a future avenue, the results of skin lesion segmentation can be combined with skin lesion classification for further diagnosis of lesion areas of skin in terms of identifying the possible progression of the lesion.Malignant skin lesions are considered fatal, and in order to identify such lesions at the right time, early diagnosis is highly recommended.
For that reason, incorporating the clinical knowledge into the lesion image identification system to automatically analyse and diagnose the current state of skin lesions with a high accuracy rate is crucial.

Figure 1 .
Figure 1.Sample of Skin Lesion Images.

Figure 1 .
Figure 1.Sample of Skin Lesion Images.
Appl.Sci.2022,12,  x FOR PEER REVIEW 6 of 18 levels were used as in U-Net implementation, i.e., four with one bottom level.The number of stacked convolutional layers in the residual path increases from one to three downsampling levels (symmetrically decreasing with upsampling levels).Unlike U-Net, the Parametric ReLU (PReLU) activation function was used with batch normalization and Softmax as an output activation function.PReLU is a type of LeakyReLU that instead of having a predetermined slope, makes it a parameter for the neural network to figure out itself, which helps in the better adaption to other parameters such as weights and biases.Downsampling was carried through stridden convolutional layers and upsampling through transpose convolutional layers.
512 nodes per Swin transformer, shift attention windows and Softmax output function were used.

Figure 8 .
Figure 8.The TransUNet method output visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.

Figure 8 .Figure 9 .
Figure 8.The TransUNet method output visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.

Figure 9 .Figure 10 .
Figure 9.The U-Net output visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.Figure 9.The U-Net visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.

Figure 10 .
Figure 10.The V-Net output visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.

Figure 11 .
Figure 11.The Attention U-Net output visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.

Figure 11 .
Figure 11.The Attention U-Net output visualizations: (a) input test image; (b) the corresponding ground truth segmentation mask and (c) the output probability map of the segmentation prediction.

Figure 12 .
Figure 12.The Swin-UNet output visualizations (a) input test image; (b) the corresponding ground truth segmentation mask; (c) the output probability map of the segmentation prediction.

Figure 12 .
Figure 12.The Swin-UNet output visualizations (a) input test image; (b) the corresponding ground truth segmentation mask; (c) the output probability map of the segmentation prediction.

Table 1 .
Comparative results of the different methods on train set.

Table 2 .
Comparative results of the different methods on the test set.

Table 3 .
Comparative results in terms of training time, inference time and dataset limit.