Melanoma Classiﬁcation from Dermoscopy Images Using Ensemble of Convolutional Neural Networks

: Human skin is the most exposed part of the human body that needs constant protection and care from heat, light, dust, and direct exposure to other harmful radiation, such as UV rays. Skin cancer is one of the dangerous diseases found in humans. Melanoma is a form of skin cancer that begins in the cells (melanocytes) that control the pigment in human skin. Early detection and diagnosis of skin cancer, such as melanoma, is necessary to reduce the death rate due to skin cancer. In this paper, the classiﬁcation of acral lentiginous melanoma, a type of melanoma with benign nevi, is being carried out. The proposed stacked ensemble method for melanoma classiﬁcation uses different pre-trained models, such as Xception, Inceptionv3, InceptionResNet-V2, DenseNet121, and DenseNet201, by employing the concept of transfer learning and ﬁne-tuning. The selection of pre-trained CNN architectures for transfer learning is based on models having the highest top-1 and top-5 accuracies on ImageNet. A novel stacked ensemble-based framework is presented to improve the generalizability and increase robustness by fusing ﬁne-tuned pre-trained CNN models for acral lentiginous melanoma classiﬁcation. The performance of the proposed method is evaluated by experimenting on a Figshare benchmark dataset. The impact of applying different augmentation techniques has also been analyzed through extensive experimentations. The results conﬁrm that the proposed method outperforms state-of-the-art techniques and achieves an accuracy of 97.93%.


Introduction
Skin is the most curious and outer layer of the human body, protecting the body from heat, light, dust, and other harmful radiations, such as ultra-violet. Human skin is made up of two layers called the dermis and epidermis. The outermost layer of the skin is called the epidermis, composed of three types of scaly and flat cells on the surface called squamous cells. The cells that protect the skin from damage and provide skin color are basal cells and melanocytes cells. Many diseases can harm the skin, and cancer is one of the most aggressive and deadly diseases in human skin. In skin cancer, melanoma and non-melanoma are the two most known types [1]. Melanoma is the deadliest and most severe skin cancer that is the cause of almost all types of skin cancers whose growth starts with the cells of melanocytes present on the outermost layer of the skin. Melanoma is also called malignant melanoma, which can grow and affect nearby healthy cells. This process is commonly known as metastasis. Malignant melanoma has four major subtypes: superficial spreading melanoma, nodular melanoma, lentigo malignant melanoma, and acral lentiginous melanoma [2]. The acral lentiginous melanoma is most commonly found in people with darker skin, such as Hispanic, African, and Asian ancestries. As compared to men, this type of melanoma frequently occurs more in women [3]. One reason for the increase in melanoma cases is due to UV radiation from sunshine or burning of the skin in sun rays. Acral lentiginous melanoma appears as a tiny (about 6 mm) flat spot of discolored skin, often black or dark brown. It usually grows on the soles, palms, or under nails sometimes and mainly occurs on the back of men and fingers and legs in women [4]. It has poor diagnosis because it is hard to differentiate between an acral melanoma and an acral nevus. Usually, it is identified at the later stages of melanoma development that reduces the survival rates of patients [5]. Melanoma is a curable disease if it is diagnosed at an earlier stage [6]. Early diagnosis techniques for melanoma include biopsy, pathology report, and medical imaging analysis, such as dermoscopy. Dermoscopy is a non-invasive imaging technique usually used to diagnose melanoma early to improve survival chances. In the dermoscopy, a magnified image of the cancerous region is taken at high resolution to locate the region on the skin, which is then analyzed by the dermatologists for melanoma detection [7]. The analysis of dermoscopy images by dermatologists is expensive and requires a high level of expertise to precisely determine the disease [8]. This issue has raised the need for developing accurate computer-aided diagnosis techniques that could assist in the early detection of melanoma from dermoscopy images. However, it is a challenging task due to several reasons. First of all, melanoma may include a high degree of visual similarity between cancerous and non-cancerous cells, making it hard to discriminate between melanoma and non-melanoma skin cancer. Secondly, it is difficult to segment the skin lesion from normal skin regions because of the low contrast. Thirdly, melanoma and non-melanoma are visually similar, and the skin conditions in different peoples have visually different melanoma. Third, the high intra-class variation of melanoma size, color, shape, and location in dermoscopic images makes it hard to detect melanoma. In addition, other artifacts, such as color calibration charts, hair, ruler marks, and veins, also cause blurriness and occlusions, making this problem more complicated [1,9,10]. Numerous automated techniques have been proposed to assist dermatologists in melanoma diagnosis in recent years. These techniques include traditional machine learning and deep learningbased methods [5,11]. Recently, deep learning-based methods have produced excellent results in medical image analysis, such as segmentation, detection, and classification. Hence, more attention is being paid to deep learning-based methods for melanoma detection. This research proposes a transfer learning-based approach for acral lentiginous melanoma identification from dermoscopy images. The main contribution of this paper is as follows: • A novel stacked ensemble framework based on transfer learning is presented to address the task of acral lentiginous melanoma classification; • Extensive experiments have been performed on the benchmark dataset with and without data augmentation to show the impact of data augmentation in improving the accuracy of the proposed model; • The proposed method outperforms state-of-the-art methods for acral lentiginous melanoma classification.
The rest of the paper is organized as follows: In Section 2, an extensive literature review of the existing studies based on deep learning, transfer learning, and deep ensemble learning. In Section 3, the proposed stacked ensemble approach for the classification of acral lentiginous melanoma is elaborated. Section 4 details experiments performed on the dermoscopy imaging dataset. Finally, the paper is concluded in Section 5.

Background
In this section, a brief introduction of transfer learning followed by an overview of each pre-trained CNN architectures that are used in the methodology is being discussed. Deep learning-based models can achieve promising results when large datasets are available for training the model. However, it is not always possible to increase training samples for some domains, such as medical imaging, due to the scarcity of data. In these domains, transfer learning can be useful. In transfer learning, a model trained on a large dataset, such as ImageNet, can be used for applications similar to domains with comparatively smaller datasets, as is shown in Figure 1. In transfer learning, the knowledge gained by training on an extremely large dataset with thousands of classes is transferred to similar problems through weight sharing. In this way, the weights that are already being trained and adjusted can be utilized and shared with another problem, such as the classification of acral lentiginous melanoma in this case. Transfer learning has been successfully and widely used for different applications, such as video analytics, automation, manufacturing, medical imaging, and baggage screening [12].
In this paper, different pre-trained models including, VGG16 [13], Xception [14], InceptionResnetV2 [15], DenseNet121 [16], DenseNet169 [16], and DenseNet210 [16] are fine-tuned for melanoma classification. Instead of designing a CNN architecture from scratch, the proposed methodology is based on fine-tuning a few top layers in which weights in early layers are frozen. Early layers of any CNN-based models are responsible for extracting low-level features, such as edges, lines, blobs, etc. The efficient extraction of these low-level features is extremely important for any image classification problem. Since pre-trained deep CNN architecture weights are already highly optimized on a large dataset, the proposed methodology is based on the fine-tuning of top layers only to optimize high-level features while keeping initial layers frozen. Then, an ensemble of the models mentioned above is created to achieve excellent results. The background information on each model is presented in the subsequent sections.

Pre-Trained Xception Model
The first model chosen for the methodology is a pre-trained Xception network [14], also known as an extreme version of Inception. Xception is a deep CNN architecture developed by Google researchers having a total depth of 71 layers. It is a modified version of Inception-V3 architecture that has surpassed VGG16, ResNet, and Inception-V3 in many classification tasks. It consists of a modified version of depthwise separable convolution and max-pooling layers, all linked together as a residual network. These modified depthwise separable convolutions in Xception consist of pointwise convolutions (1 × 1 convolution) followed by depthwise convolutions (n × n convolution). The illustration of the idea of modified depthwise separable convolutions is shown in Figure 2. The architecture diagram of Xception comprises three important sections; Entry flow, Middle flow, and Exit flow, as shown in Figure 3. The input image is passed into the entry flow, followed by a middle flow that is repeated eight times, and finally, it is passed into the exit flow for classification at the end.

Pre-Trained InceptionResNet-v2 Model
The second model is pre-trained InceptionResNet-V2, based on inception networks and has 164 layers. It integrates residual connections as in ResNet [17] architectures to increase the performance with low computational costs. After the summation of residual connections, batch normalization is added with each block. To stabilize the training process, residual connections are scaled-down before feeding into the activations of the previous layers. For this work, the top two blocks of this model are fine-tuned, and weights are updated. Global average pooling layer is applied, and the last four fully connected layers with 1024, 512, 256, and 128 units, respectively, and ReLU activation is used. For the last layer, the sigmoid activation function is used for binary classification as shown in Figure 4.

Pre-Trained DenseNet121 Model
The third pre-trained model is DenseNet121 [16]. DenseNets simplifies the connectivity pattern by ensuring information flow between layers compared to other state-of-the-art Deep CNN architectures. This exploits network potential through feature reuse instead of drawing feature representational capability from extremely deep or wide architecture reuse. It requires fewer parameters than an equivalent traditional CNN, as there is no need to learn redundant feature maps. The feature maps in DenseNets are concatenated after each Dense block that acts as an input for the next dense block. This model contains four dense blocks followed by a transition block. The top layer containing dense blocks is fine-tuned, and weights are updated. Global average pooling layer followed by four fully connected layers with 1024, 512, and 256 units, respectively, with ReLU activation are added on top of the pre-trained model. Lastly, a sigmoid layer with two units is used as the output layer. The proposed methodology for fine-tuning of pre-trained models is shown in Figure 5.

Pre-Trained DenseNet201 Model
The last pre-trained model is DenseNet201. It has 201 convolutional layers. The fine-tuning of this model is also carried out by un-freezing Dense block 4. Global aver-age pooling layer followed by four fully connected layers with 1024, 512, and 256 units, respectively, with ReLU activation are added on top of the model. Lastly, a sigmoid layer with two units is used as the output layer. The proposed methodology for fine-tuning this model is shown in Figure 5.

Related Work
Skin cancer is a life-threatening disease that must be classified and diagnosed in its early stage. Before the deep learning era, classical machine learning approaches were used, dependent on hand-crafted feature engineering. In recent years, the emergence of deep learning in medical imaging enables the model to learn complex features automatically. This section presents a comprehensive literature review of the existing methods based on deep learning, transfer learning, and ensemble learning.

Deep Learning-Based Techniques
In [18], authors survey about 19 studies conducted on skin lesions classification which use CNN based classifier and then compared their performance with clinicians. These experiments were conducted on single images of suspicious lesions. In [19], the author surveyed automatic skin cancer detection and the application of image processing and machine learning in cancer detection. In [20], authors surveyed about integrating patient data into skin lesion classification using CNN. Another study [21] presented a survey on the latest research efforts in detecting skin lesion and their classification through CNN, transfer learning, and ensemble approaches. Several deep learning-based techniques have been proposed for skin cancer detection using dermoscopic images, such as [5], a CNNbased approach was proposed. In which, researchers created their dataset and used data augmentation to enhance the quality of the dataset and achieved 80.23% accuracy. For better results, another study used multiple CNN models for melanoma classification [9]. They trained VGG-16, VGG-19 pre-trained models on their dataset and achieved an accuracy of 76%, but their accuracy was not good enough. To cope with this issue, ref. [22] used deep learning architecture. The main focus of their research was lesion attribute detection, lesion boundary segmentation, and lesion diagnosis. They used multiple pre-trained models such as AlexNet [23], Xception, ResNet [17], and VGGNet, and obtained the best accuracy of 92.74% on ResNet. Another study [24] used deep learning models for three main tasks: segmentation, feature extraction, and classification; these tasks were performed on the ISIC-2017 dataset. Experimental results illustrate auspicious accuracy, 75% for segmentation and 91% for classification. Several studies [25][26][27][28] used deep learning with different architectures and algorithms on a well-known ISIC dataset for lesion classification. Training deep learning-based models from scratch is time-consuming and needs more computational resources. To overcome this issue, ref. [29] used pre-trained models such as VGG-16, AlexNet, and ResNet for classification and achieved 83.83% accuracy on ISIC 2017 dataset. Another study [30] used VGG-16, VGG-19, and DCNN for training with different types of data augmentation on the HAM 10000 dataset [31]. Some other studies, such as [32], used CNN with GAN to improve the performance on the ISIC dataset [33]. GAN was used to generate synthetic medical images to overcome the deficiency of data and achieved 71% accuracy. In another study [34], melanoma skin cancer was detected by machine learning and imaging biomarker cues on datasets provided by IBC's and achieved 77% accuracy. Furthermore, ref. [35] used pixel-based fusion and multilevel feature reduction to perform two experiments on ISBI-2016, and ISIC-2017 datasets, for segmentation and classification and achieved an accuracy of 95% melanoma classification. In [36], additional features of skin lesions images were extracted for classification of melanoma type and to decrease the false-positive rate. They applied SVM, neural network, and Random-Forest classifiers on the heraldic13 dataset, and the highest accuracy of 90% was achieved with the random forest classifier.

Transfer Learning-Based Techniques
Transfer learning-based techniques have achieved high accuracy and significantly reduced the need for large datasets for different classification tasks. For instance, ref. [37] utilized a transfer learning-based method for Skin Lesion Classification and achieved 85.8% accuracy. Another study [38] proposed two-stage frameworks; in the first stage, the interclass difference of data distribution was carried out, while in the second stage, training of deep CNN on the ISIC-2016 dataset was performed and achieved a 94% of F-score. In several other studies, such as [10,39,40], transfer learning has been applied using AlexNet for classification on the HAM1000 dataset and achieved an accuracy of 96.87%. Some other studies [37], and [41], used the transfer learning technique using VGG-16 for feature extraction and SVM, decision tree, linear discriminant analysis, and K-nearest neighbor algorithms for classification on HAM10000 and ISIC dataset.

Ensemble Learning-Based Techniques
Recent studies focued on making an ensemble of different models to achieve high accuracy using dermoscopic images. The Ensemble technique has proven to be successful in increasing the overall accuracy of different applications. Ensemble of a Deep Neural Networks models, such as AlexNet, VGGNet, GoogLeNet was used in [42] for skin cancer classification and achieved 84.8% accuracy on ISIC 2017 dataset. Other studies [43][44][45], used an ensemble of different techniques for classification on ISIC 2017 datasets and achieved an accuracy of 76%. The recent studies [5,11,22] suggest that little attention has been paid to diagnose acral melanoma because of its infrequent occurrences. The prior researches [7,24,46] mainly focused on the classification of skin lesions images into some cancer types and did not provide further information about the subtype of cancer. For example, the study [47] classified skin lesions into melanoma and non-melanoma. Another research [1] was focused on the classification of skin lesions into different categories. The classification of melanomas in subtypes is very important for better diagnosis, and it can increase the patient's survival rates [5,11]. This work is focused on acral lentiginous melanoma detection from dermoscopy images.

Methodology
This section elaborates on the methodology adopted to classify acral melanoma and benign nevi from dermoscopy images. Figure 6 shows a block diagram of the proposed method. First, data augmentation is applied to increase the number of training samples for each category. Second, pre-trained models, VGG16, Inception-V3, Xception and Incep-tionResNetV2, DenseNet121, DenseNet169, and DenseNet201 are fine-tuned to make them inline for classification. Finally, ensemble learning is applied to detect acral lentiginous melanoma by creating a stacked ensemble of fine-tuned models. These steps have been discussed in the subsequent sections.

Preprocessing
Before passing images into the CNN for training, preprocessing is applied to the dermoscopy images. First of all, due to the difference in image dimension, all images are resized automatically to a fixed 224 × 224 dimension using the open-cv library to make it compatible to fit these images in the pre-defined shape of input tensors of selected pre-trained CNN architectures. It has been observed that resizing images has little impact on the prediction capability of the model. However, if we use the default image size, the total number of parameters will increase exponentially, making our model computationally expensive. Color channels of images are then transformed from BGR to RGB format. Finally, all images are normalized to scale the pixel intensity values from 0-255 to 0-1. The class labels are encoded to 0 and 1 for acral melanoma and benign nevi, respectively. For medical imaging, especially for the classification of acral lentiginous melanoma, dataset samples are not large enough to train the deep learning-based models. Literature suggests data augmentation techniques, such as image translation, rotation, sheering, mirroring, width shift, height shift, horizontal, and vertical flipping can be applied on dermoscopy images to increase dataset samples [48,49]. Five augmentation techniques are applied in this work, including rotation, width shift, height shift, vertical flipping, and horizontal flipping.

Stacked Ensemble of Fine-Tuned Pre-Trained CNN Architectures
Pre-trained Deep CNN architecture has a different depth and network structure. Thus, the performance varies on different problems. Each pre-trained model has its strength and limitation while applied to medical images. Multiple models are trained on the same dataset, predictions are made on each model, and results are combined using the staked ensemble learning method to achieve the best performance. The ensemble learning can reduce variance and significantly improve the performance [50]. The simplest way to combine the predictions of multiple trained models is to take an average of predictions made by each model on a similar set of training and testing data. Averaging ensemble equally combines predictions from multiple trained models and generates combined predictions [50]. On the other hand, the weighted average ensemble technique, also known as model blending, assigns weights to the predictions of an individual model that is optimized using validation data [51]. Stacked generalization or stacking is the modified version of an averaging ensemble that involves post-training the newly generated model generated by combining multiple sub-models. The proposed methodology consists of ensemble learning performed by stacking of four fine-tuned, pre-trained models; Xception, Inception-ResNet-V2, DenseNet121, and DenseNet201, as shown in Figure 7.
In this regard, each pre-trained model is fine-tuned, retrained, evaluated, and saved independently, as shown in Figures 3-5 for Xception, Inception-ResNet-V2, DenseNet121, and DenseNet20, respectively. These saved models are then loaded independently and combined to form a new architecture by stacking mechanism. This design is efficient in terms of complexity because late fusion is used instead of early fusion for staking different models. Top layers of a newly formed stacked ensemble model consist of global average pooling followed by a fully connected layer with 10 neurons and a sigmoid activation function for classification.

Experimentations and Results
This section presents the experimental setup adopted to classify acral lentiginous melanoma with benign nevi and compares the proposed method with state-of-the-art methods.

Experimental Setup
Different hyper-parameters are tuned to train the model for achieving high accuracy, as shown in Table 1, and binary cross-entropy is used as a loss function. Adam [52] is used as an optimizer for all pre-trained models, and the learning rate is set to 0.00001 (1 × 10 −4 ) with a batch size of 32. The number of epochs varies for the top 4 fine-tuned pre-trained CNN architectures. The ensemble of these pre-trained models is trained for ten epochs. All these hyper-parameters are learned empirically.

Hyper-Parameter Parameter Value
Optimizer Adam Learning Rate 0.0001 Loss Function binary crossentropy The dataset used for the experimentation has been taken from [53]. This dataset consists of 724 dermoscopy images; out of these, 350 belong to the acral lentiginous class, and 347 belong to benign nevi class. The sample images from this dataset are shown in   Two experiments are carried out to compare the impact of data augmentation on the performance of pre-trained CNN architectures. In this regard, Table 3 shows the performance of fine-tuned pre-trained models without data augmentation. In contrast, Table 4 shows the performance of fine-tuned pre-trained models with data augmentation. Data augmentation increases the overall performance of fine-tuned pre-trained models, especially in the case of the Xception model, accuracy is increased from 90% to 95%, which is a significant gain in performance. Data augmentation is also helpful in avoiding overfitting and improving the model's overall performance as models are trained on small datasets leads to overfitting. To validate the performance of the proposed model, accuracy, precision, recall, F1 score, sensitivity, and specificity are used as performance metrics, as shown in Table 5. The accuracy metric is the standard metrics in terms of classification problems, and is defined by: The results confirm that the proposed stacked ensemble model successfully classified acral melanoma and benign nevi. In comparison, benign nevi is the most correctly classified with an accuracy of 98.79% and acral melanoma with an accuracy of 96.77%. The sample classification results of acral melanoma and benign nevi through the proposed model are shown in Figure 9.

Comparison with State-of-the-Art Methods
The proposed method of fine-tuning pre-trained Xception and Inception-ResNet-V2, DenseNet121, and DenseNet201 achieved overall test accuracy of 95.17%, 95.17%, 94.48%, and 95.86% respectively. The stacked ensemble of these models is generated to increase the model's overall performance. As shown in Table 6, the proposed ensemble technique obtained a test accuracy of 97.93% which shows significant improvement in terms of accuracy. The confusion matrix of the proposed model is shown in Figure 10.   The confusion matrix is also known as the error matrix. It is a specific type of tabular layout that gives information about the ground truth class and a predicted class, showing the model's performance for each class. The ground truths are shown along the y-axis and predicted class labels are along the x-axis of the confusion matrix.
The comparison of the performance of the proposed model is shown in Table 7, which confirms that the proposed method outperforms state-of-the-art methods.

Conclusions
This research proposed a stacked ensemble-based method for acral lentiginous melanoma classification, the most common type of melanoma in Asians. Four pre-trained models, i.e., InceptionV3, Xception, InceptionresnetV2, and DenseNet121, are trained and ensembled to achieve excellent results. The ensemble-based approach has significantly outperformed all four individual models in terms of accuracy on the acral melanoma dataset. As the size of the dataset is not big enough, data augmentation and transfer learning is applied to train all these models. The proposed model achieved 97.83% sensitivity, 97.50% specificity, and 97.93% accuracy for the classification of acral melanoma and benign nevi dermoscopy images. It has been concluded that the proposed method can be helpful for dermatologists in identifying skin lesions effectively. This technique can be extended for other skin cancer diseases as future work. In addition to this, segmentation of skin lesions can also be considered to assist dermatologists in identifying an affected skin region.

Conflicts of Interest:
The authors declare no conflict of interest.