SkinSwinViT: A Lightweight Transformer-Based Method for Multiclass Skin Lesion Classi ﬁ cation with Enhanced Generalization Capabilities

: In recent decades, skin cancer has emerged as a signi ﬁ cant global health concern, demanding timely detection and e ﬀ ective therapeutic interventions. Automated image classi ﬁ cation via computational algorithms holds substantial promise in signi ﬁ cantly improving the e ﬃ cacy of clinical diagnoses. This study is commi tt ed to mitigating the challenge of diagnostic accuracy in the classi ﬁ cation of multiclass skin lesions. This endeavor is inherently formidable owing to the resem-blances among various lesions and the constraints associated with extracting precise global and lo-cal image features within diverse dimensional spaces using conventional convolutional neural network methodologies. Consequently, this study introduces the SkinSwinViT methodology for skin lesion classi ﬁ cation, a pioneering model grounded in the Swin Transformer framework featuring a global a tt ention mechanism. Leveraging the inherent cross-window a tt ention mechanism within the Swin Transformer architecture, the model adeptly captures local features and interdependencies within skin lesion images while additionally incorporating a global self-a tt ention mechanism to discern overarching features and contextual information e ﬀ ectively. The evaluation of the model’s performance involved the ISIC2018 challenge dataset. Furthermore, data augmentation techniques augmented training dataset size and enhanced model performance. Experimental results highlight the superiority of the SkinSwinViT method, achieving notable metrics of accuracy, recall, precision, speci ﬁ city, and F1 score at 97.88%, 97.55%, 97.83%, 99.36%, and 97.79%, respectively.


Introduction
Skin diseases represent a pervasive health concern across all age cohorts.Among the primary types of skin cancer, namely melanoma and non-melanoma, melanoma exhibits a higher mortality rate and is considered the most malignant form [1]. Timely detection of melanoma substantially elevates the 5-year survival rate to 95%, in stark contrast to a dismal 20% without intervention [2].As incidence and mortality rates ascend, the imperative of early detection becomes increasingly conspicuous.Presently, machine learning and deep learning methodologies enable automated diagnosis of skin lesions via highresolution dermoscopic images.To facilitate global research endeavors in skin cancer detection and analysis, the International Skin Imaging Collaboration (ISIC) orchestrates the annual ISIC Grand Challenge [3].
With continual advancements in medical technology, there is burgeoning recognition of artificial intelligence and machine learning's potential in skin lesion identification and diagnosis [4].Particularly in the realm of skin cancer detection, the intricate and diverse nature of skin ailments poses challenges to traditional diagnostic modalities grounded in subjective clinical assessment, heightening the propensity for misdiagnosis.Hence, the development of automated and precise dermatological lesion identification systems assumes paramount importance [5].Computer-aided diagnosis (CAD) systems have made substantial strides in identifying and assessing various malignancies [6], spanning lung cancer [7], breast cancer [8], thyroid cancer [9], brain cancer [10], and liver cancer [11], among others.In the domain of skin cancer detection, CAD system implementation becomes indispensable, enhancing efficiency, curtailing time and costs, and compensating for the scarcity of dermatologists.
Recent years have witnessed rapid advancements in dermatological lesion recognition, attributed to significant strides in CAD systems, owing to the rapid evolution of machine learning techniques [12].Traditional machine learning approaches, such as support vector machines (SVM) and random forest models, offer interpretability and small sample learning advantages [13].These methods rely on features designed by domain experts and furnish explanatory lesion attributes, which facilitate medical practitioners' comprehension of the model's decision-making process.Conversely, classic deep learning methods, exemplified by ResNet, MobileNet, and VGGNet models [14], excel in discerning deep abstract features and in exploiting transfer learning for skin disease lesion identification.They adeptly capture intricate characteristics such as lesion texture, shape, and structure, leveraging pre-trained models on extensive image datasets to enhance model efficacy.Deep learning methodologies predicated on transformers, exemplified by Vision Transformer (ViT) and Swin Transformer (SwinViT) [15], offer automatic feature learning, context modeling, scalability, and resilience in skin disease lesion recognition.These methodologies effectively encompass contextual information within images, spanning local and global contexts via a self-attention mechanism, affording a deeper understanding of the lesion relationship.
Nonetheless, a notable challenge stems from the substantial inter-class resemblance observed in numerous skin lesion images, rendering the identification of unique and discernible custom features arduous.Convolutional neural networks (CNNs) may conflate low-level features in such scenarios, resulting in crucial information loss.Moreover, CNNs exhibit constraints in capturing global contextual information, potentially omitting vital details.To surmount these challenges, this study proposes a deep learning framework employing a local-global hierarchical attention mechanism grounded in transformers for multi-category skin lesion classification diagnosis.The study encompasses seven skin lesion types: melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AKIEC), benign keratosis (BKL), dermatofibroma (DF), and vasculopathy (VASC), as delineated in Figure 1.The principal objective of this framework is to augment the accuracy, precision, and resilience of multiclass skin lesion classification.Our main contributions to this endeavor are as follows: (1) Data augmentation techniques are employed to address training data imbalances and generate a more representative set of skin lesion image samples.This methodology enables the model to glean robust representations of lesion features.(2) A novel local-global hierarchical attention mechanism is introduced to capture pivotal features across different abstraction levels.By amalgamating local and global features across multiple levels, the model harnesses both intricate and contextual information to amplify feature representation and diagnostic precision.This hierarchical attention mechanism furnishes adaptability in handling features across distinct abstraction levels, enabling the model to tackle complex diagnostic tasks adeptly.
(3) This study proposes an encoder-decoder framework based on the Transformer model, which leverages its scalability.The model's representation capacity is bolstered by modulating the encoder and decoder layer counts alongside attention head numbers.Pre-training and fine-tuning strategies are additionally employed to refine the performance of skin lesion identification tasks.(4) Comparative experiments against state-of-the-art methodologies corroborate the superior performance of our proposed method in multiclass skin lesion classification.
The experimental results demonstrate substantial enhancements in accuracy, precision, specificity, and F1 index, confirming the effectiveness of our method.
The article is structured as follows: starting with the abstract and introduction, Section 2 provides a comprehensive literature review, offering a systematic overview of the current progress and status of related research.Subsequently, Section 3 elaborates on the proposed methodology, detailing the data sources and adopted data augmentation techniques.Furthermore, state-of-the-art models and methods are introduced and presented.In Section 4, this study presents the experimental results and conducts an extensive comparative analysis based on various indicators, including precision, accuracy, and F1 index.The experimental results are thoroughly interpreted to provide a comprehensive evaluation of the proposed method's performance.Finally, the conclusion summarizes the effectiveness and innovativeness of the proposed method.

Related Work
In the realm of dermatological image diagnosis, significant advancements have been achieved by scholars and practitioners hailing from both domestic and international spheres.Notably, conventional machine learning methodologies have exhibited promising outcomes in delineating skin lesions through contour analysis.For instance, Chatterjee et al. [16] conducted a study based on the ABCDE criterion [17], encompassing the scrutiny of shape, edge regularity, texture, and color attributes of skin lesions.Leveraging image processing tools, they extracted quantitative features and employed SVM for classification, thereby evaluating the efficacy of the proposed framework.Dhivyaa et al. [18] deployed the region-growing technique for lesion segmentation and feature extraction, followed by decision tree-driven classification, culminating in the selection of the random forest algorithm as the final model.Relative to SVM, random forest demonstrates diminished computational complexity, pliability in design, and proficiency in handling diverse categories of dermatological conditions.Pham et al. [19], in their comparative investigation of melanoma classification, explored diverse data preprocessing methods, feature extraction strategies, and classification algorithms.Their findings underscored that employing linear normalization, HSV feature extraction, and a balanced random forest classifier yielded optimal results on the HAM10000 dataset, achieving an accuracy rate of 74.75%.Despite the strides made by assorted machine learning methodologies in skin classification, wherein some have surpassed the expertise of human practitioners in select cases [20], conventional machine learning algorithms, rooted in statistical principles, exhibit notable performance fluctuations across varied scenarios.Furthermore, these classical algorithms often necessitate intricate feature engineering and lack adaptability.
In recent times, CNNs have garnered considerable traction in the realm of computer vision [21], eclipsing traditional machine-learning approaches in terms of classification and detection prowess.Shen et al. [22] proposed a robust data augmentation technique adaptable to any deep-learning paradigm for skin lesion classification.Their Efficient-Netb0 methodology outperformed its counterparts on the ISIC2018 dataset, boasting an accuracy rate of 85.3%.Huang et al. [23] conducted a comparative study of multiple models, affirming that DenseNet [24] exhibited superior performance in benign and malignant binary classification tasks, while EfficientNet [25] excelled in multi-classification endeavors.Liu et al. [26] introduced the multi-level relationship capture network (MRCN), employing a region correlation learning module to emulate interrelations among distinct significant regions within the central lesion zone.Moreover, a cross-image learning module was employed to emulate profound semantic correlations across multiple images.Rigorous experiments across three arduous datasets validated the exceptional performance of MRCN.Tahir et al. [27] introduced DSCCNet, a deep learning architecture tailored for multi-classification diagnosis of skin cancer using dermoscopic imagery.Outperforming the baseline model, DSCCNet furnishes robust diagnostic assistance to dermatologists and medical practitioners.Nonetheless, CNNs confront certain constraints in skin lesion recognition, such as discerning analogous features, information loss, and constraints in capturing holistic, contextual information.While refinements in deep learning models and alternative methodologies exist to bolster recognition efficacy, sustained inquiry and enhancements are imperative to fortify accuracy and resilience.Consequently, sustained endeavors and exploration remain requisite in the realm of skin lesion identification.
Xie et al. [28] propounded Swin-SimAM, a melanoma detection approach amalgamating SwinViT for feature extraction with the parameter-free attention module SimAM.Demonstrating commendable performance, their method attained an impressive AUC performance of 90% in discriminating melanoma from non-melanoma entities, encompassing nevi and seborrheic keratoses.Eskandari et al. [29] devised a skin lesion segmentation framework predicated on a U-shaped hierarchical Transformer and an inter-scale context fusion (ISCF) methodology.This approach amalgamates each stage adaptively, harnessing attention correlation within the encoder at each juncture.Empirical validations underscored the robust applicability and efficacy of the ISCF model within each stage's context.Khan et al. [30] unveiled the SKINVIT model, founded on Outlook and Transformer architectures.This model adeptly captures both fine-grained and global features to bolster the accuracy of melanoma and non-melanoma classification.Validation across three datasets yielded promising results.However, notwithstanding their high accuracy in classification and detection endeavors, these models necessitate substantial computational resources and time, curtailing their real-time feasibility and scalability.
To address these challenges, this study introduces SkinSwinViT as a lightweight model.Building upon global attention and SwinViT, SkinSwinViT incorporates a localglobal hierarchical attention mechanism to capture pivotal features across diverse abstraction tiers.By amalgamating local and global features, the model adeptly harnesses both granular details and contextual comprehension, thereby enriching feature representation and diagnostic accuracy, ultimately enhancing overall performance.Furthermore, SkinSwinViT is engineered to be lightweight, demanding fewer computational resources and time while offering enhanced real-time performance and scalability.The primary objective of this study is to proffer a high-precision skin lesion classification model proficient in automatically and accurately identifying seven types of skin maladies.This model aspires to mitigate skin cancer mortality, alleviate the burden on dermatologists, and narrow the accuracy chasm in early-stage lesion diagnosis.Through refined feature representation capabilities and heightened accuracy, SkinSwinViT furnishes dependable support for the identification and treatment of incipient skin lesions, thereby endowing dermatologists with more reliable diagnostic assistance.

Materials and Methods
This section elucidates the SkinSwinViT methodology and presents the dermoscopic image dataset utilized for the classification of seven distinct lesion types.
In this experiment, the ISIC2018 dataset undergoes partitioning into distinct training and testing subsets, adhering to an 80% allocation for training and a 20% allocation for testing, as shown in Table 1.

Data Augmentation
Considering that the imbalance of different categories of samples has a significant impact on model performance, this study performs data augmentation processing on the existing dataset.This study performs several different transformations, such as random rotation within 180°, horizontal or vertical translation, random scaling, and vertical or horizontal flip operations, on the training set samples.An example of a data augmentation image is shown in Figure 2. Data augmentation was implemented on the remaining six categories of samples, excluding melanocytic nevus (NV) samples, thereby mitigating the risk of overfitting attributed to inadequate sample sizes and enhancing the model's generalization capacity.The quantity of data pertaining to melanocytic nevus remained consistent, comprising 5364 images.The resulting training dataset size subsequent to data augmentation is presented in Table 2. Nevertheless, the images within the dataset exhibit variability in size, necessitating their conversion to a uniform size to align with the input requirements of the deep learning model.Accordingly, the images are uniformly scaled to dimensions of 244 × 244 and subsequently normalized.

Proposed SkinSwinViT Architecture
The SkinSwinViT model, as proposed in this study, is designed for the purpose of identifying skin disorders classified into seven distinct categories.Figure 3 presents the architectural design of the SkinSwinViT model.Drawing inspiration from SwinViT [32], the model primarily integrates components including data preprocessing, SwinTransformer Block, Global Attention Block, and Classifier Block.To address the imbalance within the skin lesion dataset during data preprocessing, augmentation techniques are employed to enrich the segmented training set.Subsequently, the parameters of the pre-trained model serve as initial values.Following this, the proposed methodology utilizes a two-layer encoder, comprising SwinTransformer Block and Global Attention Block, to extract nuanced features across various dimensional spaces.Finally, the proposed method employs a Classifier Block to ascertain the categorical outcome of skin lesion images.The detailed framework is expounded below.
1. SwinTransformer Block: Derived from the SwinViT model, the SwinTransformer Block initially conducts patching to segment skin lesion images into smaller patches, facilitating localized processing and computational efficiency.Embedding preserves spatial relationships among patches, facilitating comprehensive feature capture.Subsequently, Windowed Multi-Head Self-Attention (W-MSA) and Shifted Windowed Multi-Head Self-Attention (SW-MSA) are employed to capture local features and inter-patch relationships.These attention mechanisms enable contextual comprehension of local information, enhancing feature extraction efficacy.By integrating W-MSA and SW-MSA, contextual dependencies between neighboring patches are effectively captured, bolstering feature extraction.Moreover, employing multiple Swin Blocks within the SwinTransformer Block facilitates the learning of complex, abstract-level features.Each Swin Block comprises layers of W-MSA and SW-MSA.Additionally, patch merging reduces image resolution and integrates multi-scale information, enhancing overall comprehension. Specifically Here,  , ,  represents the multi-head self-attention mechanism, , , and  are the query, key, and value matrices, respectively.√ is the dimensionality of the key vectors and √ is used to scale the dot product to prevent gradient vanishing or exploding.The Softmax function converts the dot product results into a probability distribution. signifies the concatenation of multi-head attention mechanisms. is a weight matrix, and  is a bias term. denotes the transformation function of the Global Attention Block.  denotes the residual connection. stands for Layer Normalization, which is used to stabilize the training process.
3. Classifier Block: Incorporated within the model decoder, the Classifier Block commences with layer normalization to stabilize data distribution and expedite training.Following normalization, the feature vector undergoes dimensionality reduction via adaptive average pooling to compress feature vector length from sequence length to a fixed length 1.This compression retains essential features, facilitating efficient data representation.The data then progresses to a linear transformation layer to learn input-output linear relationships.Ultimately, the linear layer output is processed through a Softmax function, yielding a probability distribution for final classification.This distribution reflects model confidence or likelihood for each class, enabling classification based on the highest probability class.

𝑦 4. Loss Function: The model's classification loss function utilizes cross-entropy loss, quantifying disparities between model-predicted probability distributions and true label distributions.The loss function is expressed as follows: Here, L denotes the loss value,  represents category j value in the true label (utilizing one-hot encoding),  symbolizes model-predicted probability value for category j, N represents category count, and M represents sample count.The summation symbol ∑ accumulates all categories.

Transfer Learning
In conventional machine learning paradigms, the assumption often prevails that the feature spaces of both training and test datasets remain identical.However, starting the training and testing processes from scratch can be highly time-consuming and resourceintensive [33].To alleviate this predicament, transfer learning (TL) has emerged as a viable strategy, aiming to achieve precise classification outcomes even with constrained training data.Transfer learning involves leveraging knowledge from one model to improve the performance of another model in a target domain.Its primary objective is to enhance efficiency in the target domain.This approach proves particularly effective when dealing with relatively small target domain datasets, as it can leverage datasets from related source domains [34].By utilizing transfer learning, this study can address such situations more robustly.
Figure 4 illustrates the TL workflow employed in this study, where original deep models trained on ImageNet are fine-tuned on target datasets like ISIC2018.Pre-trained models (AlexNet, VGG-11, GoogleNet, ResNet50, ViT, SwinViT, and SkinSwinViT) leverage knowledge from ImageNet, and their weights are optimized on comprehensive datasets to enhance feature extraction.Additionally, the prediction layer (fully connected layer) is modified in this paper to accommodate the 7 target categories for training.

Performance Metrics
To evaluate the performance of individual models such as the proposed SkinSwinViT, this study considers the following performance metrics.
To measure the performance indicators of multi-classification, after data enhancement technology processing, the number of pictures in each category is the same, and there is no imbalance in the number of samples, so this paper adapts the Macro-average method.In multiclass classification problems, Macro-average calculates the indicators of each category and then averages the indicators of all categories.This is equivalent to setting the weights of all categories to be consistent.The specific formula is as follows: represents the accuracy of prediction of a certain category among the seven categories of skin diseases. represents one of the seven types of skin diseases.The same applies to other indicators, where n represents the number of classification categories.In this article, n = 7, and the summation symbol ∑ is used to accumulate all categories.The indicators referred to in this study, by default, refer to the above-average indicators.
(True Positive): Refers to the number of instances that are correctly predicted to be a certain type of skin disease. (False Negative): Refers to the number of instances that are incorrectly predicted to be not a certain type of skin disease. (False Positive): Refers to the number of non-skin disease instances that are incorrectly predicted to be a certain type of skin disease. (True Negative): Refers to the number of cases of nonskin diseases that are correctly predicted to be non-skin diseases.Recall quantifies the frequency with which a classifier accurately predicts a positive outcome among all samples that should have been classified as positive.High precision indicates the test's ability to precisely identify positive samples, thereby mitigating the occurrence of false positives.Elevated specificity signifies the accurate exclusion of negative samples and a diminished risk of misdiagnosis.The F1 score, as the harmonic mean of precision and recall, offers a comprehensive assessment of a classifier's accuracy in predicting positive instances.This study employs the ROC (Receiver Operating Characteristic) curve and AUC (Area Under the Curve) value for separate analyses of each category.The ROC curve plots the False Positive Rate (FPR) on the x-axis and sensitivity on the y-axis.Sensitivity measures the correct identification of positive examples, while the false positive rate represents the incorrect identification of negative examples.The formula for calculating the false positive rate is as follows: The AUC is a metric that measures the overall performance of a classification model by calculating the area under the ROC curve.A higher AUC indicates better classification effectiveness for the model.The AUC value typically falls between 0.5 and 1.0, where 0.5 represents a random classifier, and 1.0 represents a perfect classifier.

Experimental Setup
The proposed SkinSwinViT framework is instantiated within the Anaconda environment, utilizing Python 3.9, with the installation of essential libraries such as Pytorch, Scikit-Learn, Matplotlib, and Numpy on the Linux operating system.The system configuration encompasses an Intel Platinum-8350C processor operating at 2.6 GHz, complemented by 32 GB DDR4 RAM (4 modules) and four NVIDIA RTX 3090 graphics processing units.Training of the SkinSwinViT framework is conducted on the ISIC2018 dataset, augmented as outlined in the data enhancement section to forestall overfitting, bolster classifier efficiency for unobserved images, and mitigate sample imbalance.Furthermore, three optimization strategies, namely Adam, AdamW, and SGD, are employed, with the optimal strategy selected to update SkinSwinViT parameters during training iterations.Concluding specifications entail an epoch count set to 100, with batch size values explored within the range [4,8,16,32], with superior performance observed when the batch size is set to 4 and a default learning rate of 0.00001.

Experimental Results
In this section, the comprehensive performance comparison of the proposed SkinSwinViT model is carried out and compared with other CNN methods in terms of accuracy measures.The study evaluated the performance of various models, such as AlexNet, VGG-11, GoogleNet, ResNet50, ViT, SwinViT, and our SkinSwinViT model.
Table 3 illustrates that the proposed SkinSwinViT model exhibits superior performance on the dataset, achieving an accuracy of 0.9788, representing a 3.4% improvement over the second-best model, SwinViT, which attained an accuracy of 0.9444.This notable enhancement underscores the efficacy of SkinSwinViT in the classification of skin diseases, evidencing its aptitude in such tasks.Moreover, SkinSwinViT attains the highest recall rate at 0.9775, surpassing SwinViT's rate of 0.9391, indicative of its superior ability to identify positive samples and heightened sensitivity in disease detection.Notably, SkinSwinViT demonstrates commendable precision, specificity, and F1 score, registering values of 0.9783, 0.9936, and 0.9779, respectively.The elevated precision, specificity, and F1 score affirm the model's proficiency in accurately discerning positive samples within skin disease classification, reflecting a well-balanced performance.Despite a marginal increase in parameter count compared to ResNet50 and SwinViT, SkinSwinViT exhibits exceptional performance and notable generalization capabilities, effectively mitigating this numerical discrepancy.Its remarkable adeptness in image comprehension and feature extraction is complemented by commendable generalizability.With a parameter size of 31 million, SkinSwinViT emerges as a lightweight model with considerable efficacy.From Figure 5, it is apparent that with an increasing number of iterations, the model's accuracy on the training set exhibits a gradual ascent followed by stabilization, concurrently with a gradual decline and stabilization in the loss function.This trend indicates an augmentation in the model's learning capacity, leading to improved fitting to the training data.However, an exclusive focus on the model's performance solely on the training set may precipitate overfitting issues.Overfitting delineates a scenario wherein a model performs well on the training set but performs poorly when presented with unseen data due to its excessive complexity or propensity to glean excessive information from noise and intricacies within the training set, consequently resulting in diminished generalization performance.Figure 6 delineates a discernible trend wherein, with an increasing number of iterations, the model's accuracy on the testing set exhibits a gradual augmentation followed by stabilization concurrently with a gradual reduction and stabilization in the loss function.This observation signifies a progressive enhancement in the model's predictive prowess and generalization capabilities concerning unfamiliar data during the testing process.Additionally, comparative analysis of various metrics on the testing set reveals improved performance in contrast to the training set, thereby indicating the model's superior performance on the testing set while mitigating overfitting.Notably, as the testing loss diminishes, the model's alignment with the testing data augments.Furthermore, with the model's improvement, the testing loss demonstrates a decline, underscoring the model's heightened generalization performance.Table 4 unequivocally demonstrates the superior performance of the SkinSwinViT model over other models on the testing set, attaining exemplary accuracy of 0.9906, recall of 0.9906, precision of 0.9916, specificity of 0.9995, and F1 scores of 0.9910.Notably, SkinSwinViT exhibits a notable improvement of more than 1.55% across all metrics compared to SwinViT.This discernible superiority underscores the robust classification and generalization capabilities inherent in the SkinSwinViT model.In summary, the SkinSwinViT model proposed within this study emerges as the frontrunner on the evaluated dataset, boasting high-performance metrics.Moreover, the model parameters are judiciously configured, thereby consuming fewer resources.This exemplary performance underscores the model's efficacy in skin disease classification tasks, positioning it as a promising candidate for an effective CAD system for skin disease diagnosis.The experimental findings furnish compelling evidence in support of further research and application in the realms of dermatological diagnosis and treatment.

SkinSwinViT Performance Analysis
This section presents the performance analysis of the proposed SkinSwinViT model on the considered dataset.Recognizing that accuracy alone does not provide a comprehensive assessment of a model's performance, a confusion matrix was generated to elucidate the model's performance across multiple classes.
Figure 7 presents the confusion matrix comparing SkinSwinViT and SwinViT.It highlights SkinSwinViT's precise performance across the majority of categories, effectively classifying a substantial portion of the samples.Remarkably, the integration of the global attention module significantly enhances classification accuracy while mitigating misclassification.This module facilitates the model in discerning and optimizing pivotal features, thereby improving differentiation between categories.Despite occasional misclassifications observed in the NV, MEL, and BKL classes, possibly attributable to class similarity or inherent noise within the samples, the overall high performance of SkinSwinViT remains evident.Table 5 highlights the notable performance of the SkinSwinViT model, particularly evident in the DF category, where it demonstrates exceptionally high accuracy.Furthermore, the SkinSwinViT model consistently surpasses the SwinViT model across all individual categories, underscoring its enhanced classification capabilities.Specifically, it exhibits superior accuracy in distinguishing NV, MEL, BKL, and other sample types.The experimental findings reveal that SkinSwinViT_A achieves a mere 86.21% accuracy on the limited sample dataset, indicating subpar performance.Transformer-based models typically demand extensive data for effective training and exhibit diminished performance in small-scale tasks.To address this, and inspired by SpotTune's approaches, this study freezes the feature extraction layer and exclusively trains the fully connected layers [35].Notably, both methods yield comparable accuracy, with training solely the fully connected layer being more resource-efficient.
2. Optimizer Selection: The proposed model is trained with various optimizers to evaluate classification performance.Table 7 presents the classification performance of SkinSwinViT trained with various optimizers.3. Impact of Data Augmentation: Data augmentation is employed to rectify dataset imbalance.Table 8 showcases the performance of SkinSwinViT with and without augmentation.The results demonstrate that the accuracy of SkinSwinViT with augmentation on the dataset increases by 3.17% compared to SkinSwinViT without augmentation.Additionally, this study reveals that horizontal flipping minimally affects the accuracy of image classification, possibly due to the dataset being captured from various angles during sampling.
4. The impact of global attention on the proposed SkinSwinViT: The study compares the performance of SkinSwinViT with and without the global attention module.Table 9 outlines the differences in performance metrics.

Comparison with State-of-the-Art Methods
This study undertakes a comparative evaluation of the accuracy of the proposed framework against state-of-the-art methodologies on the ISIC2018 Task 3 dataset.The pertinent outcomes are delineated in Table 10.Crucially, all enumerated SOTA methodologies utilize the identical experimental dataset, ensuring equitable and precise comparisons.Our proposed method attains a remarkable accuracy of 97.8% on the ISIC2018 dataset, surpassing the performance of alternative techniques in terms of accuracy (Acc), precision (Pre), and specificity (Spe).This enhancement underscores the superiority of our SkinSwinViT method over extant state-of-the-art technologies.

Discussion
In this study, we propose a novel approach for skin lesion classification, termed SkinSwinViT, which integrates the Swin Transformer with a global attention mechanism to capture fine-grained local and global features within skin lesion images.Our method demonstrates outstanding performance on the ISIC2018 dataset, surpassing prior research efforts.We juxtapose our findings with existing literature data to comprehensively elucidate the progress achieved, as illustrated in Table 10.
Relative to alternative models for skin lesion classification, our proposed SkinSwinViT model exhibits superior advancement.Specifically, compared to the extended hybrid model + handcrafted feature model by Sharafudeen et al. (2023) [38], our SkinSwinViT model significantly enhances predictive accuracy, precision, and specificity while maintaining a more parsimonious architecture, with improvements of 5.9%, 3.7%, and 1.6%, respectively.Furthermore, compared to the outstanding BF 2 SkNet model by Ajmal et al. (2023) [43], our SkinSwinViT model demonstrates exceptional performance, with increases in accuracy and precision of 0.7% and 2.7%, respectively, under similar complexity conditions.These notable outcomes establish our SkinSwinViT model as the state-of-the-art method in the field of skin lesion classification.Moreover, although our research primarily focuses on demonstrating the effectiveness of the SkinSwinViT model in multiclass skin lesion classification, we believe its potential extends to other medical domains.The model's ability to capture nuanced local features and global contextual information facilitates accurate image classification across diverse diseases.
Nonetheless, our study has certain limitations.The attention mechanism may not fully capture global dependencies, necessitating further refinement and optimization.Furthermore, the model's robustness needs to be evaluated on larger, more complex medical image datasets to ensure generalizability.To effectively address these limitations and challenges, our future research will focus on expanding the classification task and improving the model's capacity to handle larger, more complex medical image datasets.Additionally, we will undertake a comprehensive exploration of cross-modal information fusion techniques and delve into model interpretability and visualization approaches to enhance physicians' trust and acceptance of the model's predictions.The primary objective of our research is to enhance the performance and effectiveness of the algorithm in facilitating accurate and reliable diagnoses by applying it to expansive and diverse datasets of skin lesions, encompassing a broader range of classifications.Moreover, we strive to develop an advanced diagnostic assistance system that embodies practicality, providing invaluable support in the field of dermatology diagnostics.These endeavors will advance the field of dermatology diagnostics, improving the model's acceptability and usability in clinical practice.

Conclusions
The classification of dermoscopic images of seven types of skin diseases was performed using a deep convolutional neural network model.By incorporating a layered Transformer architecture and window attention mechanism, the SwinViT model achieved improved prediction accuracy and computing efficiency.However, its ability to capture global dependencies is limited as it primarily focuses on local and adjacent windows' dependencies.In order to overcome this limitation and enhance prediction accuracy, this study proposes an improved lightweight model called SkinSwinViT that introduces a global-local attention mechanism to comprehensively consider information from other locations.Our main conclusion from this endeavor is as follows: (1) This study effectively addressed the issues of limited data Significantly, the SkinSwinViT model not only demonstrates exceptional proficiency in the multiclass classification of skin disease lesions but also possesses the ability to model local features and global context, which can provide valuable assistance for image classification tasks in other medical fields and further promote the development of medical image analysis.
In summary, the model proposed in this study holds substantial promise for enhancing the effectiveness and applicability of skin disease image classification models, effectively alleviating the inherent risks of misdiagnosis and missed diagnosis while concurrently fostering improvements in treatment outcomes and the overall quality of life for patients.Furthermore, this model has the capacity to provide medical professionals with more reliable and accurate diagnostic assistance, driving progress in the field of dermatology.

Figure 1 .
Figure 1.Illustrative Instances of Diverse Skin Lesions from ISIC2018.

Figure 2 .
Figure 2. Example of data-enhanced image: (a) Original image in the basic training set; (b) Image after data enhancement.

Figure 5 .
Figure 5.Comparison of train Accuracy and Loss results of multiple models in the training set.(a) Comparison of train Accuracy results of multiple models; (b) Comparison of train Loss results of multiple models.

Figure 6 .
Figure 6.Comparison of test Accuracy and Loss results of multiple models in the testing set: (a) Comparison of test Accuracy results of multiple models; (b) Comparison of test Loss results of multiple models.

Figure 7 .
Figure 7. Confusion matrix of SkinSwinViT and SwinViT in the training set.(a) Confusion matrix of SkinSwinViT; (b) Confusion matrix of SwinViT.

Figure 8 Figure 8 .
Figure 8 presents the ROC curve of SkinSwinViT and SwinViT.A comparison between the two models reveals significant improvements in the AUC values across the

4. 4 .
AblationsThe ablation analysis comprises several facets: (1) Pre-training Model Impact on SkinSwinViT; (2) training the model with different optimizers to determine the optimal value; (3) the impact of augmented and unaugmented data on the proposed method; (4) the effect of global attention mechanism on the proposed method.1. Pre-training Model Impact on SkinSwinViT: Table 6 illustrates the performance comparison across three distinct configurations of SkinSwinViT: SkinSwinViT_A, devoid of pre-training; SkinSwinViT_L, integrating ImageNet pre-training and training all layers; and SkinSwinViT, incorporating ImageNet pre-training with only the fully connected layer trained.
volume and imbalanced samples by employing data augmentation techniques.The results unequivocally confirmed the effectiveness of data augmentation in improving the performance of the 7-class skin lesion SkinSwinViT recognition model.(2) The proposed local-global hierarchical attention mechanism effectively captures crucial features, enhancing feature representation and diagnostic accuracy.(3) Leveraging a Transformer-based encoder-decoder framework, this study improves the model's representation and scalability while optimizing skin lesion recognition performance through pre-training and fine-tuning techniques.(4) The experimental results demonstrate the exceptional performance of the SkinSwinViT model in the image classification task of various skin diseases, surpassing SOTA methods in the field.Moreover, the model exhibits advantages such as low computational resource requirements and minimal time consumption, thereby validating the effectiveness of our proposed approach.

Author Contributions:
Data curation, K.T. and R.H.; Formal analysis, K.T. and Y.L.; Funding acquisition, J.S., R.C. and M.D.; Investigation, K.T. and R.H.; Methodology, K.T. and J.S.; Project administration, M.D.; Resources, K.T., J.S., and R.H.; Software, K.T. and J.S.; Supervision, J.S., R.C., and Y.L.; Validation, K.T., J.S., R.C., M.D. and Y.L.; Visualization, K.T.; Writing-original draft, K.T., J.S. and M.D.; Writing-review and editing, K.T., J.S., R.C., and M.D.All authors have read and agreed to the published version of the manuscript.Funding: This research was supported by a special grant from the program for scientific research start-up funds of Guangdong Ocean University under Grant No. 060302102303, Guangdong Basic and Applied Basic Research Foundation under Grant No. 2023A1515011326, program for scientific research start-up funds of Guangdong Ocean University under Grant No. 060302102101, Guangdong Provincial Science and Technology Innovation Strategy under Grant No. pdjh2023b0247, National College Students Innovation and Entrepreneurship Training Program under Grant No. 202310566022, and Guangdong Ocean University Undergraduate Innovation Team Project under Grant No. CXTD2023014.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 1 .
Distribution of samples on ISIC2018 dataset.

Table 2 .
Sample distribution of the training set after data enhancement.
, the Patching and Embedding layer divides the input image  ∈  into non-overlapping patches of size 4 × 4, mapping dimensions to C dimensions to produce an embedded image  ∈  / / .Subsequently,  is normalized using Layer Normalization and forwarded to the Swin Block for feature extraction.The SwinTransformer Block consists of four stages, with patch merging operations at the conclusion of the first three stages to alter input feature dimensions.[1, 1, 3, 1] Swin Blocks are utilized across four stages, with channel counts per stage as [C, 2C, 4C, 8C].The attention mechanism within each Swin Block is detailed as follows: Comprising multi-head self-attention mechanisms, layer normalization, and MLP layers, the Global Attention Block integrates spatial and semantic information.A multi-head self-attention mechanism fuses global information, followed by residual connections and layer normalization for feature representation learning.Subsequently, an MLP layer, comprising two linear layers and a nonlinear activation function (e.g., GELU), integrates feature information from different locations to generate a comprehensive global feature representation.This block enables global contextual comprehension through self-attention calculations on all feature vectors.The algorithm for the global attention mechanism is articulated as follows: where  and  represent output features of W-MSA and Multi-Layer Perceptron (MLP) modules, respectively.

Table 3 .
Performance comparison between the training set SkinSwinViT and other models.

Table 4 .
Comparison between testing set SkinSwinViT and other models.

Table 5 .
Performance of model on various skin diseases in the training set.

Table 6 .
Performance Evaluation of Pre-trained Models.

Table 7 .
Performance Evaluation under Different Optimizers.Following 100 epochs of training, the Adam optimizer outperforms SGD.Adaptive optimizers such as Adam are more conducive to Transformer models, as advocated by the Swin Transformer authors.AdamW, integrating Weight Decay, facilitates more rapid decay of parameters with excessive values.

Table 8 .
Performance of using data augmentation or not.

Table 9 .
Comparison of metrics on whether to use global attention.The integration of the global attention module into the SkinSwinViT model yields superior metrics on the dataset.The experimental results indicate that the global attention mechanism enables better modeling of global context features, enhancing the model's deep representation capabilities and enabling more accurate feature identification and understanding in images.

Table 10 .
Comparison of models trained on the ISIC2018 Task 3 dataset.