Melanoma Detection Using XGB Classifier Combined with Feature Extraction and K-Means SMOTE Techniques

Melanoma, a very severe form of skin cancer, spreads quickly and has a high mortality rate if not treated early. Recently, machine learning, deep learning, and other related technologies have been successfully applied to computer-aided diagnostic tasks of skin lesions. However, some issues in terms of image feature extraction and imbalanced data need to be addressed. Based on a method for manually annotating image features by dermatologists, we developed a melanoma detection model with four improvement strategies, including applying the transfer learning technique to automatically extract image features, adding gender and age metadata, using an oversampling technique for imbalanced data, and comparing machine learning algorithms. According to the experimental results, the improved strategies proposed in this study have statistically significant performance improvement effects. In particular, our proposed ensemble model can outperform previous related models.


Introduction
Malignant melanoma (MM) is the most severe form of skin cancer; although rare, it has a high mortality rate. According to GLOBOCAN statistics, in 2020, there were approximately 325,000 cases of melanoma skin-cancers worldwide, and melanoma accounted for 1.7% of the all-sites global cancer diagnoses. The calculations for the global agestandardized incidence rates show that the rate is 3.8/100,000 for males and 3.0/100,000 for females. The cumulative lifetime risk for males was 0.42% and for females was 0.33% [1]. MM easily spreads throughout the body, causing other cancers, such as brain, liver, and kidney cancers. Once it has spread, the survival rate will be less than 50%. The 5-year survival rate of patients is as high as 90 to 99%, if discovered early and resected; however, if detected late, the survival rate drops to approximately 15 to 20%. The actual cause of MM is unclear. At present, the medical community recognizes that exposure to ultraviolet light is a risk factor for the cancer. The frequency of MM in Asians is minimal, although it is more common on the palms/soles of the hands and feet, which are nonirradiated parts and hence unrelated to ultraviolet exposure. Recently, the incidence and mortality of MM have been increasing. Mortality caused by the cancer is common in young age groups, unlike the other types of cancer. Furthermore, a delay in medical treatment worsens the prognosis, causing metastasis, and even death; therefore, early diagnosis and treatment are essential [2,3].
In clinical diagnosis, it is difficult for dermatologists to identify early MM from a mole. South Queensland, Australia, has the highest global MM frequency. Edith Cowan Univer-sity in Australia developed a blood test for MM antibodies, with an accuracy of 79% [4]; however, there are still testing and time cost limitations. Dermoscopy is a paramount technique for the initial diagnosis of MM. Therefore, if the development of artificial intelligence (AI) models of computer-aided diagnosis (CAD) systems can help dermatologists interpret the dermoscopy images, it will help to reduce medical costs.
Machine learning (ML) classifiers have been employed for the automatic diagnosis methods of skin lesions. Before modeling, these classifiers are input a set of handcrafted image features, such as the skin lesions-related features that dermatologists pay attention to. Recently, in most computer vision tasks, deep learning (DL) convolutional neural networks (CNNs) can automatically extract high-level image features and significantly improve the classification performance. Therefore, CNN-based CAD systems have been recently used to detect various diseases [5,6].
According to the latest review paper [7], on the topic of using neural networks to detect melanoma, the relevant architectures published in 2018-2021 were classified into the following four techniques. 1. Using a convolutional neural network; 2. Using multiple convolutional neural networks; 3. Using a convolutional neural network combined with other classifiers; 4. Using other techniques, such as combining ABCDE rules with traditional machine learning algorithms.
To test the performance of the AI models on dermoscopy images, there are many researchers who use public databases (such as PH2, MED-NODE, and ISIC). This article reviewed a total of 25 recent articles on MM CAD, published between 2016 and 2022, and listed the lowest and highest of the six indicators for the evaluation of efficacy, as shown in Table 1. For example, Warsi et al. [9] used a 3D color texture feature (CTF) and a multilayer neural network model for the binary classification of MM diseases by a total of 200 dermoscopy images in the PH2 dataset. Based on the holdout method, 70% of the images were used as the training set, 15% of the images were used as the validation set, and 15% of the images were used as the test set. Their results showed the best performance in the PH2 dataset and reached 97.5% accuracy (ACC), 98.1% sensitivity (SEN), and 93.84% specificity (SPE). Iqbal et al. [14] proposed a new deep convolutional neural network (DCNN) model with multiple filter sizes: classification of the skin lesions network (CSLNet) architecture. Through data pre-processing and the data augmentation of ISIC-17, ISIC-18, and ISIC-19 images, it achieved 96.4% AUC, 93.25% ACC, 93.25% SEN, 90.64% SPE, 93.97% precision (PRE), and 94.47% F1 in the ISIC 2017 dataset, using the 7:1:2 holdout method.
To evaluate the performance of the MM prediction models using an oversampling technique, Kalwa et al. [28] used 200 dermoscopy images for the MM binary classification. By combining image feature extraction (FE), SVM, and synthetic minority oversampling technique (SMOTE) methods, the AUC was increased from 0.720 to 0.850. Magalhaes et al. [29] used 287 infrared thermography skin images for MM binary classification. Using an ensem-ble model of image FE, random forest (RF), SVM, and SMOTE methods, the recall increased from 0.473 to 0.696.
The contributions of this study are listed as follows: 1. Dermoscopy images (2299) were used for MM CAD, a dermatologist handcrafted feature method was used as a comparison base, and four classification efficiency improvement strategies were proposed: (1) a comparison of different transfer learning techniques for automatic image FE; (2) the addition of the metadata of gender and age; (3) a comparison of the class balance of the training data with different oversampling techniques; and (4) a comparison of the classification performance of different ML algorithms. According to the experimental results, the four proposed strategies are statistically significant for MM detection; 2.
We combined the DL and ML methods to automatically extract the features directly from the dermoscopy images and perform benign and MM diagnosis. The experimental results show that our proposed model combining metadata, K-means SMOTE, and an extreme gradient boosting (XGB) classifier can achieve higher classification and predictability than using only the MELA-CNN feature extractor.

MM Dataset
In this study, we integrated the ISIC Challenge 2018 (ISIC2018) and the ISIC Challenge 2019 (ISIC2019) datasets [30][31][32][33] for the binary classification of benign and MM. The ISIC2018 dataset contains five handcrafted features provided by dermatologists: pigment networks; negative networks; streaks; globules; and milia-like cysts. Meanwhile, the ISIC2019 dataset contains two pieces of basic patient data: age and gender. There are 2299 records in this dataset, including 1849 benign and 450 MM. Because of the imbalanced data, subsequent processing is performed using oversampling techniques.

FE Techniques
FE is a preprocessing procedure in data mining. To evaluate the impact of the dermatologist handcrafted features [30] and automatic DL FE [34] on the classification performance of an ML algorithm for predicting MM, we compared the following five FE techniques.
(1) Handcraft: We employed five handcrafted characteristics provided by dermatologists [30]: pigment networks; negative networks; streaks; globules; and milia-like cysts. A pigment network is a grid comprising many brown lines crossing each other; a negative network is a curve formed by many hyperpigmented cell connections; a streak comprises pigmented projections surrounding a melanocytic lesion; a globule comprises multiple brown circles; a milia-like cyst comprises many white, yellowish circles or ovals; (2) VGG16: VGG16 is a DL CNN model proposed by Karen Simonyan et al. [35]. They used the ImageNet dataset of one million images to classify one thousand classes. VGG16 takes 224 × 224 RGB images as the input and comprises 13 convolutional layers and 3 fully connected layers, as well as a nonlinear activation function-rectified linear unit (ReLU). All of the layers used three × three small convolution kernels, to avoid too many parameters.  [34], we used the Inception-ResNetV2 architecture as the backbone to develop MELA-CNN ( Figure 1). After retrieving the feature maps of the average pooling layer of InceptionResNetV2, a fully connected layer of 256 nodes is added, and ReLU is used. Further, batch normalization and Sigmoid layers are introduced, and MELA-CNN trained weights are obtained after the fine-tuning process using the target dataset. This DL model can automatically extract 256 features from dermoscopy images.
ception series includes InceptionV1, InceptionV2, InceptionV3, InceptionV4, and In-ceptionResNet series. InceptionV3 was proposed by Szegedy et al. [36] as an improved InceptionV2. They used the ImageNet dataset of one million images to classify one thousand classes. InceptionV3 takes 224×224 RGB images as input and comprises 47 layers. In addition, this model adopts the batch normalization of Incep-tionV2 to accelerate the model training. This DL model can automatically extract 2048 features from dermoscopy images; (4) InceptionResNetV2: InceptionResNetV2 is an Inception module-based DL model. It uses 299×299 RGB images as input. In addition, it replaces the pooling layers in the Inception modules A, B, and C, with ResNet connections to accelerate the training [37]. This DL model can automatically extract 1536 features from dermoscopy images; (5) MELA-CNN: Based on the transfer learning technique [34], we used the Inception-ResNetV2 architecture as the backbone to develop MELA-CNN ( Figure 1). After retrieving the feature maps of the average pooling layer of InceptionResNetV2, a fully connected layer of 256 nodes is added, and ReLU is used. Further, batch normalization and Sigmoid layers are introduced, and MELA-CNN trained weights are obtained after the fine-tuning process using the target dataset. This DL model can automatically extract 256 features from dermoscopy images.

SMOTE
Because our datasets are from the medical field, the feature of a considerable numerical imbalance in the number of negative and positive samples is common. Therefore, we employed a data oversampling method to solve the imbalance in the number of data categories to avoid misjudgment of the classifier during training. Chawla et al. [38] proposed SMOTE, which randomly selects the k-nearest neighbor samples to increase the number of transactions in minority categories to the same number as the number of transactions in the majority category, to solve the problem of data imbalance. Because the SMOTE sampling technique is prone to generate noise and affect the classifier prediction performance, Douzas et al. [39] proposed K-means SMOTE, which is based on SMOTE and k-means clustering, for data oversampling. First, the data are grouped using the k-means method, and the clusters with minority classes accounting for less than 50% are selected. Then, the number of samples to be generated is calculated, and more samples are assigned to the clusters with sparse samples. Finally, SMOTE is performed in this cluster, and the number of minority samples is increased to the same number as the majority samples, solving the problem of data imbalance, and improving the shortcoming that SMOTE is prone to noise.

SMOTE
Because our datasets are from the medical field, the feature of a considerable numerical imbalance in the number of negative and positive samples is common. Therefore, we employed a data oversampling method to solve the imbalance in the number of data categories to avoid misjudgment of the classifier during training. Chawla et al. [38] proposed SMOTE, which randomly selects the k-nearest neighbor samples to increase the number of transactions in minority categories to the same number as the number of transactions in the majority category, to solve the problem of data imbalance. Because the SMOTE sampling technique is prone to generate noise and affect the classifier prediction performance, Douzas et al. [39] proposed K-means SMOTE, which is based on SMOTE and k-means clustering, for data oversampling. First, the data are grouped using the k-means method, and the clusters with minority classes accounting for less than 50% are selected. Then, the number of samples to be generated is calculated, and more samples are assigned to the clusters with sparse samples. Finally, SMOTE is performed in this cluster, and the number of minority samples is increased to the same number as the majority samples, solving the problem of data imbalance, and improving the shortcoming that SMOTE is prone to noise.

XGB
XGB, proposed by Tianqi Chen et al. [40], is based on the concept of gradient boosting decision tree (GBDT). GBDT is a gradient boosting algorithm based on a decision tree. Gradient boosting is an ensemble learning model that mainly trains the multiple weak classifiers, assembling them into a stronger classifier. The goal is to minimize the loss function and increase the weight of the misclassified classes by computing negative gradients to improve the next iteration of the training.
Compared with GBDT, XGB adds a regularization method, to make the loss function smoother, reduce the model complexity, and avoid overfitting. In addition, an approx-imation algorithm is used to find the optimal solution for splits, optimize the gradient boosting, and increase the efficiency and scalability. Further, considering the processing of missing or sparse values, it can be designated as a specific branch to improve the efficiency of an algorithm. Finally, to accelerate the model operation, XGB also supports a parallel operation and an early stop. When the prediction result reaches the optimum, the tree can be stopped in advance to increase the training speed. XGB can also improve the model classification accuracy.

Evaluation Metrics
To evaluate the performance of the different models for binary classification, we em- Recall (REC): The proportion of positive diagnosis results that are true positive, which is also called the true positive rate (TPR).
F1-score: The harmonic mean of PRE and REC. 2 AUC: The AUC of TPR and FPR. FPR is the false positive rate, which refers to the proportion of false positives in the actual disease-free population.
The higher the value of the above five indicators, the better the classification performance of the model. Because of the use of ACC and PRE to evaluate the class-imbalanced dataset, the model may be biased due to numerous FNs. In this study, we aimed to develop a model that can effectively detect patients with MM. Therefore, we used REC, F1-score, and AUC as the main evaluation criteria for the model performance.

Stratified K-Fold Validation
We employed a stratified K-fold method for the 10-fold stratified cross-validation, which is an improvement of the K-fold cross-validation method. The K-fold cross-validation method divides the data into mutually exclusive k groups of equal sizes, and then repeats the training and testing k times. Each time, one group is used as the test data, and the others are used as the training data to verify the accuracy. Finally, the average of k times the accuracy is used as the final accuracy. The innovation of the stratified K-fold method is that each fold is extracted according to the category ratio for training and testing. Because the method ensures that the proportion of two categories in each fold is equal to the original dataset, it is suitable for imbalanced data classification.

Paired T-Test
To evaluate whether the difference in the MM detection ability using the proposed enhancement strategy is statistically significant, we used the paired t-test to compare the predictive performance of the two models: where d denotes the mean of the difference between paired data; S d denotes the standard deviation of the difference between paired data; and n denotes the number of pairs of data. The null hypothesis is a 10-fold validated REC or F1-score mean difference of 0 between the two models. When p < 0.05, it means that there is a statistically significant difference in the classification performance between the two models.

Proposed Framework
In this study, we integrated the ISIC2018 dermoscopy image data [30,31] and the ISIC2019 patient age and gender basic data [32,33] to form a research dataset for developing an MM detection model. The overall research architecture is shown in Figure 2. First, the five FE methods were implemented on the dermoscopy images-VGG16, Incep-tionResNetV2, Inception V3, MELA-CNN, and the dermatologist handcrafted method. Then, we merged the optimal image features and metadata. Finally, we compared different oversampling techniques with different ML algorithms to find the optimal MM detection model. The proposed model architecture is shown in Figure 3. Based on the transfer learning technique [34], MELA-CNN is developed to automatically extract image features. In Figure 3, the first, second, and third block diagrams depict InceptionResNetV2, MELA-CNN, and the optimal MM detection model proposed in this study. The overall architecture of The proposed model architecture is shown in Figure 3. Based on the transfer learning technique [34], MELA-CNN is developed to automatically extract image features. In Figure 3, the first, second, and third block diagrams depict InceptionResNetV2, MELA-CNN, and the optimal MM detection model proposed in this study. The overall architecture of the proposed model is based on InceptionResNetV2 as the backbone and applies a fine-tuning process to train MELA-CNN for automatic image FE. Then, by combining two sets of metadata with the optimal image features, 258 features are obtained. In addition, we used K-means SMOTE for class balance. Finally, we employed XGB for MM detection.

Experimental Result
In this study, 2299 images were manually annotated by dermatologists in the ISIC2018 and ISIC2019 datasets to train and test the optimal classification model. In the process, the stratified K-fold method was used for 10-fold cross-validation, and data were extracted according to the proportion of categories. Then, we put them into each fold, performed 10 rounds of training and testing, and obtained the following results.

FE Techniques
Five techniques were used for the FE of dermoscopy images-the dermatologist handcrafted method, VGG16, InceptionResNetV2, Inception V3, and MELA-CNN-and the number of features after extraction was 512, 1536, 2048, and 256, respectively. Table 2 summarizes the results of the five techniques combined with XGB to compare their performance differences. Clearly, MELA-CNN was the most efficient method, with an F1-score value of 0.756. Meanwhile, the dermatologist handcrafted method had the worst performance, with an F1-score value of only 0.064. The F1-score values of VGG16, InceptionResNetV2, and Inception V3 are up to 0.282, 0.309, and 0.295, respectively.

Experimental Result
In this study, 2299 images were manually annotated by dermatologists in the ISIC2018 and ISIC2019 datasets to train and test the optimal classification model. In the process, the stratified K-fold method was used for 10-fold cross-validation, and data were extracted according to the proportion of categories. Then, we put them into each fold, performed 10 rounds of training and testing, and obtained the following results.

FE Techniques
Five techniques were used for the FE of dermoscopy images-the dermatologist handcrafted method, VGG16, InceptionResNetV2, Inception V3, and MELA-CNN-and the number of features after extraction was 512, 1536, 2048, and 256, respectively. Table 2 summarizes the results of the five techniques combined with XGB to compare their performance differences. Clearly, MELA-CNN was the most efficient method, with an F1-score value of 0.756. Meanwhile, the dermatologist handcrafted method had the worst performance, with an F1-score value of only 0.064. The F1-score values of VGG16, InceptionResNetV2, and Inception V3 are up to 0.282, 0.309, and 0.295, respectively. Figure 4 shows the performance comparison chart of the F1-score of the five FE techniques combined with XGB. Clearly, MELA-CNN significantly outperforms the other techniques.

Metadata
To evaluate the dermoscopy image features by adding metadata, including age and gender, for the difference in the predictability of the diagnostic model, we employed XGB for the model training and used the F1-score as the main evaluation metric. The results in Table 3 show that the F1-score of the five symptoms obtained using the dermatologist handcrafted method, after adding metadata, can reach 0.415. Compared with the results without metadata, the F1-score increased by 35.1%. After adding metadata, the F1-score increased from 0.756 to 0.800, i.e., a 4.4% increase, for the 256 image features extracted by MELA-CNN. Figure 5 shows the F1-score performance comparison chart of 5 and 256 features with metadata. This figure clearly shows the relevance of the metadata. The classification performance of both of the models-the dermatologist handcrafted method and MELA-CNN-improved.

Metadata
To evaluate the dermoscopy image features by adding metadata, including age and gender, for the difference in the predictability of the diagnostic model, we employed XGB for the model training and used the F1-score as the main evaluation metric. The results in Table 3 show that the F1-score of the five symptoms obtained using the dermatologist handcrafted method, after adding metadata, can reach 0.415. Compared with the results without metadata, the F1-score increased by 35.1%. After adding metadata, the F1-score increased from 0.756 to 0.800, i.e., a 4.4% increase, for the 256 image features extracted by MELA-CNN. Figure 5 shows the F1-score performance comparison chart of 5 and 256 features with metadata. This figure clearly shows the relevance of the metadata. The classification performance of both of the models-the dermatologist handcrafted method and MELA-CNN-improved.

SMOTE
Because of the problem of class imbalance in our datasets, we used 10 oversampling techniques to balance the classes of binary data with XGB, to assess the difference in performance. In this study, the oversampling technique was used only for the training set, and the test set was maintained in its original composition. The results in Table 4 represent the difference in performance on the test set that compared 10 oversampling techniques with the original no-sampling technique for model training. The results show that the original F1-score was only 0.800 without an oversampling technique. After using K-means SMOTE, which is the optimal oversampling method, the F1-score reached 0.861. The  Figure 6 shows the performance comparison chart of the F1-score under the 258 features obtained using the 10 oversampling techniques. Clearly, K-means SMOTE shows the most obvious improvement in the test performance of the original imbalanced training dataset.

SMOTE
Because of the problem of class imbalance in our datasets, we used 10 oversampling techniques to balance the classes of binary data with XGB, to assess the difference in performance. In this study, the oversampling technique was used only for the training set, and the test set was maintained in its original composition. The results in Table 4 represent the difference in performance on the test set that compared 10 oversampling techniques with the original no-sampling technique for model training. The results show that the original F1-score was only 0.800 without an oversampling technique. After using K-means SMOTE, which is the optimal oversampling method, the F1-score reached 0.861. The  Figure 6 shows the performance comparison chart of the F1-score under the 258 features obtained using the 10 oversampling techniques. Clearly, K-means SMOTE shows the most obvious improvement in the test performance of the original imbalanced training dataset.

ML Algorithms (Classifiers)
In this study, we compared the performance of 13 ML algorithms for MM classification, using the optimal results obtained using K-means SMOTE: XGB classifier, histogram-based gradient boosting (HistGB classifier), SVM, gradient boosting, RF, multilayer perceptron (MLP), Gaussian naive Bayes (Gaussian NB), logistic regression, bagging classifier, stochastic gradient descent logistic regression (SGD-LR), adaptive boosting (AdaBoost), decision tree, and K-neighbors classifier. Table 5 summarizes the results of the 13 ML algorithms for MM diagnosis. The F1-score of XGB is 0.861, which is the optimal classification performance. Figure 7 shows the F1-score performance comparison chart of the algorithms. Clearly, XGB significantly outperforms all of the other ML algorithms.

ML Algorithms (Classifiers)
In this study, we compared the performance of 13 ML algorithms for MM classification, using the optimal results obtained using K-means SMOTE: XGB classifier, histogrambased gradient boosting (HistGB classifier), SVM, gradient boosting, RF, multilayer perceptron (MLP), Gaussian naive Bayes (Gaussian NB), logistic regression, bagging classifier, stochastic gradient descent logistic regression (SGD-LR), adaptive boosting (Ada-Boost), decision tree, and K-neighbors classifier. Table 5 summarizes the results of the 13 ML algorithms for MM diagnosis. The F1-score of XGB is 0.861, which is the optimal classification performance. Figure 7 shows the F1-score performance comparison chart of the algorithms. Clearly, XGB significantly outperforms all of the other ML algorithms.

Effect of FE and Metadata
In this study, the dermatologist handcrafted method was used as a comparison basis to discuss the differences in the improvement of the performance for MM detection by four strategies: (1) using the automatic image FE method; (2) adding metadata; (3) using

Effect of FE and Metadata
In this study, the dermatologist handcrafted method was used as a comparison basis to discuss the differences in the improvement of the performance for MM detection by four strategies: (1) using the automatic image FE method; (2) adding metadata; (3) using SMOTE; and (4) using different ML algorithms. Figures 8 and 9 show the ROC and PRE-REC (PR) curves for four comparison FE techniques.
MELA-CNN is an automatic image FE technique; it could achieve the optimal classification performance because it performed the fine-tuning process on the target dataset. Compared with the handcrafted method, MELA-CNN has 31.9% and 47.2% increases in the AUC and the PR curve area, respectively. The results show that using MELA-CNN improves the predictability of the model compared with the handcrafted method.
Moreover, adding the metadata with 256 features could further improve the predictability of the model. The AUC and PR curve area increased to 0.865 and 0.827, respectively. Age and gender had a good resolution in the MM diagnosis. Finally, K-means SMOTE was used to tackle the problem of prediction bias, due to the imbalance of the data categories, and the AUC improved by as much as 0.970. These results once again show that the four proposed improvement strategies can effectively improve MM predictability.

Effect of Oversampling Techniques
As mentioned above, using MELA-CNN for FE and adding the metadata with 258 features can improve the prediction performance. Therefore, this method is used as the basis, and K-means SMOTE, SMOTE, and RandomOverSampler are used as representative methods to compare the impact of the different oversampling techniques on classifi-

Effect of Oversampling Techniques
As mentioned above, using MELA-CNN for FE and adding the metadata with 258 features can improve the prediction performance. Therefore, this method is used as the basis, and K-means SMOTE, SMOTE, and RandomOverSampler are used as representative methods to compare the impact of the different oversampling techniques on classifi-

Effect of Oversampling Techniques
As mentioned above, using MELA-CNN for FE and adding the metadata with 258 features can improve the prediction performance. Therefore, this method is used as the basis, and K-means SMOTE, SMOTE, and RandomOverSampler are used as representative methods to compare the impact of the different oversampling techniques on classification performance. Compared with SMOTE, which randomly selects the k-nearest neighbor samples for oversampling without grouping, K-means SMOTE is a method for oversampling samples with denser minority classes in clusters. Using K-means SMOTE yields better results; its AUC and PR curve area can reach 0.970 and 0.924, respectively. Figures 10 and 11 depict the ROC and PR curves for the performance comparisons of three oversampling methods. The results show that K-means SMOTE has the best classification and discernibility.

Effect of ML Algorithms (Classifiers)
As mentioned above, using 258 features with K-means SMOTE technology can achieve an optimal performance. Based on K-means SMOTE, we compared the differences in the prediction performance of different classifiers. Figures 12 and 13 depict the ROC and PR curves for the performance comparison of three classifiers. Clearly, XGB has the best predictability and discernibility.
XGB, an ensemble learning classifier that combines multiple ML techniques, uses boosting to continuously train and revise weak learners to improve the prediction performance. It uses non-replacement random sampling to generate the different training sub-

Effect of ML Algorithms (Classifiers)
As mentioned above, using 258 features with K-means SMOTE technology can achieve an optimal performance. Based on K-means SMOTE, we compared the differences in the prediction performance of different classifiers. Figures 12 and 13 depict the ROC and PR curves for the performance comparison of three classifiers. Clearly, XGB has the best predictability and discernibility.
XGB, an ensemble learning classifier that combines multiple ML techniques, uses boosting to continuously train and revise weak learners to improve the prediction performance. It uses non-replacement random sampling to generate the different training sub-

Effect of ML Algorithms (Classifiers)
As mentioned above, using 258 features with K-means SMOTE technology can achieve an optimal performance. Based on K-means SMOTE, we compared the differences in the prediction performance of different classifiers. Figures 12 and 13 depict the ROC and PR curves for the performance comparison of three classifiers. Clearly, XGB has the best predictability and discernibility.
XGB, an ensemble learning classifier that combines multiple ML techniques, uses boosting to continuously train and revise weak learners to improve the prediction performance. It uses non-replacement random sampling to generate the different training subsets from the original training dataset and votes or averages for each training result to make the final prediction. Compared with the Gaussian NB and K-neighbors classifier, XGB has the best performance, with an AUC of 0.970 and a PR curve area of 0.924.

Significance Test for Performance Improvement
To assess whether the four improvement strategies proposed in this study are statistically significant, we employed XGB with the stratified 10-fold cross-validation method for five handcrafted features provided by dermatologists, 256 extracted features by MELA-CNN, 258 features obtained after adding the metadata of age and gender, and 258 features obtained by K-means SMOTE.
Tables 6-8 summarize the REC paired t-test results for the number of features of 5 versus 256, 256 versus 258, and 258 versus 258+K-means SMOTE, respectively. Besides, Tables 9-11 lists the corresponding F1-score paired t-test results. The null hypothesis is that the difference in 10-fold REC or F1-score between two models is 0. From Tables 6-11, a significant p-value of less than 0.05 was obtained for all of the test results, thereby con-

Significance Test for Performance Improvement
To assess whether the four improvement strategies proposed in this study are statistically significant, we employed XGB with the stratified 10-fold cross-validation method for five handcrafted features provided by dermatologists, 256 extracted features by MELA-CNN, 258 features obtained after adding the metadata of age and gender, and 258 features obtained by K-means SMOTE.
Tables 6-8 summarize the REC paired t-test results for the number of features of 5 versus 256, 256 versus 258, and 258 versus 258+K-means SMOTE, respectively. Besides, Tables 9-11 lists the corresponding F1-score paired t-test results. The null hypothesis is that the difference in 10-fold REC or F1-score between two models is 0. From Tables 6-11, a significant p-value of less than 0.05 was obtained for all of the test results, thereby con-

Significance Test for Performance Improvement
To assess whether the four improvement strategies proposed in this study are statistically significant, we employed XGB with the stratified 10-fold cross-validation method for five handcrafted features provided by dermatologists, 256 extracted features by MELA-CNN, 258 features obtained after adding the metadata of age and gender, and 258 features obtained by K-means SMOTE. Tables 6-8 summarize the REC paired t-test results for the number of features of  5 versus 256, 256 versus 258, and 258 versus 258+K-means SMOTE, respectively. Besides,  Tables 9-11 lists the corresponding F1-score paired t-test results. The null hypothesis is that the difference in 10-fold REC or F1-score between two models is 0. From Tables 6-11, a significant p-value of less than 0.05 was obtained for all of the test results, thereby confirming the following three findings. (1) The use of the MELA-CNN feature extractor has a significant performance improvement over the dermatologist handcrafted method; (2) After adding the metadata of age and gender, the improvement in classification performance was also statistically significant; (3) Finally, there is also a statistically significant improvement in the predictive power of the model using K-means SMOTE.

Performance Comparison with Previous Related Studies
We compare five of the evaluation metrics of our research results and two previous studies [28,29] (as shown in Tables 12 and 13). Based on the 7:3 holdout method, Kalwa et al. [28] performed only the binary MM classification for 200 dermoscopy images, combined with image FE, SVM, and SMOTE, the AUC can increase from 0.720 to 0.850. In this study, the binary MM classification was performed on 2299 dermoscopy images, and the proposed model was constructed by combining MELA-CNN, metadata, K-means SMOTE, and XGB and compared with five handcrafted image features provided by dermatologists, 1536 features extracted by the traditional transfer learning, 256 features extracted by MELA-CNN, and 258 by adding metadata. As a result, the AUC increased from 0.585 to 0.971, the F1-score increased from 0.056 to 0.890, and the REC increased from 0.030 to 0.867. Using the 8:2 holdout method, Magalhaes et al. [29] performed a binary classification of MM for 287 infrared thermography images. After combining image FE and the integrated model of RF and SVM with SMOTE, the REC can increase from 0.473 to 0.696. In this study, the AUC increased from 0.621 to 0.981, the F1-score increased from 0.063 to 0.905, and the REC increased from 0.033 to 0.878. The results in Tables 12 and 13 confirm that the proposed model can obtain better classification and predictability than previous related models. We also compared the performance of our proposed model with some other work in the literature. Based on the dataset, the imbalanced ratio (IR) of Non-Me and Me (IR) samples, the classification method, the validation method, and the performance of the test set, a comparative summary of these techniques is provided in Table 14. Since the different studies use different datasets and performance metrics, valid comparisons are difficult. However, the method proposed in this study still exhibits excellent performance.

Conclusions
Recently, the incidence of skin cancer has increased globally. The accurate classification of skin lesions directly influences the accurate and prompt diagnosis of skin cancer. MM is a highly lethal skin cancer that can rapidly metastasize, and eventually cause death if not detected early and treated properly.
Based on the method of expert manual annotation, an AI model for CAD of MM was developed for 2299 dermoscopy images in this study. We proposed four improvement strategies: (1) comparing different transfer learning techniques for automatic image FE; (2) adding the metadata of gender and age; (3) comparing different oversampling techniques for the class balancing of training data; and (4) comparing the classification performance of different ML algorithms. According to the experimental results, the proposed improvement strategies have a statistically significant effect on performance improvement.
After the analysis and comparison of the experimental results, we showed an effective combination of DL and ML methods to automatically extract features from dermoscopy images and perform benign and MM diagnoses. The experimental results also show that the proposed model, using the MELA-CNN feature extractor plus metadata, combined with K-means SMOTE and XGB, can obtain a better classification and prediction ability than the previous related models. Both the statistics and tests performed in this study confirmed that the proposed MM detection model has excellent classification performance.
However, if future clinical applications are to be met, it is necessary to further test the detection capabilities of a larger amount of case data and more categories of skin lesions to optimize the AI model. Based on the method proposed in this study, developing a computer-aided diagnosis system for melanoma with a user-friendly interface to support the clinical practice of dermatologists and provide an interpretation mechanism after automatic diagnosis is also the goal of the next stage of this study.  Institutional Review Board Statement: The study protocol was approved by the Ethics Committee for Human Genome and Gene Analysis at Nagasaki University (#120221).