Multiclass Skin Cancer Classiﬁcation Using Ensemble of Fine-Tuned Deep Learning Models

: Skin cancer is a widespread disease associated with eight diagnostic classes. The diagnosis of multiple types of skin cancer is a challenging task for dermatologists due to the similarity of skin cancer classes in phenotype. The average accuracy of multiclass skin cancer diagnosis is 62% to 80%. Therefore, the classiﬁcation of skin cancer using machine learning can be beneﬁcial in the diagnosis and treatment of the patients. Several researchers developed skin cancer classiﬁcation models for binary class but could not extend the research to multiclass classiﬁcation with better performance ratios. We have developed deep learning-based ensemble classiﬁcation models for multiclass skin cancer classiﬁcation. Experimental results proved that the individual deep learners perform better for skin cancer classiﬁcation, but still the development of ensemble is a meaningful approach since it enhances the classiﬁcation accuracy. Results show that the accuracy of individual learners of ResNet, InceptionV3, DenseNet, InceptionResNetV2, and VGG-19 are 72%, 91%, 91.4%, 91.7% and 91.8%, respectively. The accuracy of proposed majority voting and weighted majority voting ensemble models are 98% and 98.6%, respectively. The accuracy of proposed ensemble models is higher than the individual deep learners and the dermatologists’ diagnosis accuracy. The proposed ensemble models are compared with the recently developed skin cancer classiﬁcation approaches. The results show that the proposed ensemble models outperform recently developed multiclass skin cancer classiﬁcation models.


Introduction
Cancer can cause death if not diagnosed and treated in a timely fashion and can start almost anywhere in the human body. Skin cancer is a common type of cancer, as more than three million Americans are diagnosed with skin cancer each year (https:// www.skincancer.org/skin-cancer-information/skin-cancer-facts, accessed on: 24 October 2021). If skin cancer is diagnosed early, it can usually be treated. There are eight categories of skin cancers: melanoma (MEL), melanocytic nevi (NV), basal cell carcinoma (BCC), benign keratosis lesions (BKL), actinic keratosis (AK), dermatofibroma (DF), squamous cell carcinoma (SCC), and vascularl Lesions (VASC) [1]. MEL is the most dangerous type of cancer, as it spreads to other organs very rapidly. It develops in body cells called melanocytes. MEL is not common as compared to other categories of skin cancer. NV are pigmented moles and vary in different colors of skin tones. It mostly develops during childhood and early adult life as the number of moles increases up to the age of 30 to 40. Thereafter, the number of naevi tends to decrease. BCC develops in cells of the skin called basal cells. The basal cell performs the functionality of producing new skin cells as old ones die off. AK is a pre-cancer that develops on skin affected by chronic exposure to ultraviolet (UV) rays. BKL is one of the common benign neoplasms of the skin. DF occur at all ages and in people of every ethnicity. It is not clear if DF is a reactive process or a neoplasm [2]. The lesions are made up of proliferating fibroblasts. Vascular lesions are relatively common abnormalities of the skin and underlying tissues. SCC is the most accruing form of skin cancer after melanoma and usually results from exposure to UV rays.
In literature, machine learning approaches such as support vector machine (SVM) [2], neural networks [3], naïve Bayes classifier [4], and decision trees [5] have been used for skin cancer classification. The problem with machine learning approaches is the requirement of human-engineered features. In the last decade, deep learning approaches, such as convolutional neural networks (CNN) became popular due to their ability with regard to automatic feature extraction [6][7][8][9], and have been extensively used in research [10][11][12][13]. Dorj et al. [14] worked on skin cancer classification using deep CNN. Romero et al. in [15] performed melanoma cancer classification with the dermoscopy images using CNN. The technique has an accuracy of 81.3% on the International Skin Imaging Cancer (ISIC) archive dataset. Jinnai et al. [16] carried out pigmented skin lesion classification using the clinical images and faster region-based CNN. The classification accuracy of the method was compared with the ten-board certified dermatologist diagnosis accuracy. Esteva et al. [11] performed multiclass skin cancer classification using dermoscopy images with the different variants of CNN. Adegun et al. in [17] developed a probabilistic model to achieve the better performance of a fully convolutional network-based deep learning system for the analysis and segmentation of skin lesion images. The probabilistic model achieved an accuracy of 98%.
Recently, researchers have proposed ensemble methods to enhance classification performance [18][19][20]. Bajwa et al. in [21] developed ensemble model using ResNet-152 [22], DenseNet-161 [22], SE-ResNeXt-101 [23], and NASNet [23] for the classification of seven classes of skin cancer using the ISIC dataset and achieved an accuracy of 93%. The ensemble is a machine learning method that combines the decision of several individual learners to increase classification accuracy [24]. The ensemble model exploits the diversity of individual models to make a combined decision; therefore, it is expected that the ensemble model increases classification accuracy [25,26]. The binary class skin cancer classification has been performed in [15,[27][28][29], but many researchers could not address multiclass classification with better results. The recent approaches developed in [11,19,[30][31][32] for multiclass skin cancer classification also failed to achieve higher accuracy. In this research, improved performance heterogeneous ensemble models are developed for multiclass skin cancer classification using majority voting and weighted majority voting. The ensemble models are developed using diverse types of learners with various properties to capture the morphological, structural, and textural variations present in the skin cancer images for better classification. The proposed ensemble methods perform better than both the individual deep learning models and deep learning-based ensemble models proposed in the literature for multiclass skin cancer classification.
The following contributions are made in this research work: • Five pre-trained models are developed, and their decision is combined using majority weighting and weighted majority voting for the classification of eight different classes of skin cancer. • The pre-trained models with different structural properties are trained to capture the morphological, structural, and textural variations present in the skin cancer images with the following idea: residual learning, extraction of more complex features, improvement in the declined accuracy caused by the vanishing gradient, feature invariance through the residual learning, and extraction of the fine detail present into the image. • The proposed ensemble methods perform better than the expert dermatologists and previously proposed deep learning-based ensemble models for multiclass skin cancer classification. • A comparative study is conducted for the performance analysis of five fine-tuned deep learning models and their ensemble models on the ISIC dataset to determine the model with better performance. • In our proposed method, no extensive pre-processing has been performed on the images, and no lesion segmentation has been carried out to make the work more generic and reliable.
The rest of the paper is organized as follows: Section 2 presents related work. Section 3 describes the proposed method. Ensemble methods are discussed in Section 4, whereas Section 5 discusses different deep neural network models followed by individual models in Section 6. The quality measures used to measure the performance of the proposed study are presented in Section 7. Section 8 discusses the results, followed by the conclusion.

Related Work
Skin cancer is usually diagnosed with the physical examination of the skin or with the help of biopsy. The detection of skin cancer through the physical examination requires a great degree of experience and expertise, and biopsy-based examination is a tedious and time consuming task as it also requires expert pathologists. Currently, the macroscopic and dermoscopy images are used by the dermatologist during the detection procedure of skin cancer. But even with the dermoscopy images, accurate skin cancer detection is a challenging task, as multiple skin cancers may appear similar in initial appearance. Furthermore, even the expert dermatologists have limited studies and exposure experience to different types of skin cancer through their lifetimes. Expert dermatologists have a skin cancer detection accuracy of 62% to 80% [33,34]. The reports on diagnostic accuracy of clinical dermatologists have shown 62% accuracy with the clinical experience of three to five years. However, the dermatologists with more than 10 years of experiernce have a diagnostic accuracy of 80%, and diagnostic performance falls even more for dermatologists with the less experience [34]. Therefore, dermoscopy in the hands of an inexperienced dermatologist may reduce diagnostic accuracy [33,35,36].
In early 1980s computer-aided diagnosis (CAD) systems were developed to assist the dermatologist to meet the challenges faced during the process of diagnosis [37]. Initially, CADs were developed using a dermoscopy image for binary classification of melanoma and benign [37]. Since then, much research has been carried out to solve this challenging problem. Several studies [37][38][39] have been performed using the manual evaluation methods developed using the ABCD method proposed by Nachbar et al. in [40]. Moreover, machine learning classifiers have been developed with handcrafted features. These classifiers include support vector machine (SVM) [2], naïve Bayes [4], k-nearest neighbor [41], logistic regression [42], and artificial neural networks [3]. But the existence of high intraclass and low interclass variations in MEL have caused the unsatisfactory performance of handcrafted features-based cancer detection [43].
With the advent of deep learning, CNN caused a breakthrough for the solution of many problems, including skin cancer classification. CNN gave higher detection accuracy and it also reduced the burden of hand-crafted feature extraction by automatically extracting the feature [44]. As CNNs require huge datasets for better feature leaning at abstract levels [45], transfer learning has been introduced to meet the limitation of huge dataset requirements [20,32], where a model trained for a task is reused for another task. Khalid et al. proposed a skin lesion detection technique by developing a transfer learning-based AlexNet model for the classification of three skin lesions in [46]. To develop the proposed method, data augmentation is applied to enhance the dataset to achieve an accuracy of 98.61%. Kawahara et al. developed a skin cancer detection technique using dataset consisting of 1300 images to construct a linear classifier using the features extracted by the CNN in [47]. The technique does not require preprocessing or skin lesion segmentation. They carried out classifications of five and 10 classes and achieved an accuracy of 85.8% and 81.9%, respectively. In [32], a novel CNN architecture has been developed. The architecture relies on multiple tracts to perform the skin lesion classification. The authors used a pretrained CNN for single resolution and retained it for multi-resolution on publicly available datasets and obtained an accuracy of 79.15% for the ten classes. In [48], Ali et al. developed a skin lesion classification approach using deep convolutional neural networks (DCNN) to classify benign and malignant skin lesions. To develop the proposed method, the authors carried out the preprocessing consisting of noise removal by applying the filter, input normalization, and data augmentation steps and achieved an accuracy of 91.93%.
Ensemble methods are developed to enhance classification performance. Ensemble methods exploit the diversity in individual models to obtain their higher accuracy. Recently, multiclass skin cancer classification techniques have been developed in the literature using ensemble approaches. Harangi et al. [18] proposed how an ensemble of CNNs models can be developed for enhancement of skin cancer classification accuracy and developed an ensemble model for three classes of skin cancer and achieved an accuracy of 84.2%, 84.8%, 82.8%, and 81.4% for the models of GoogleNet, AlexNet, ResNet, and VGGNet, respectively. The authors enhanced the accuracy of 83.8% with the ensemble model of GoogleNet, AlexNet, and VGGNet. In [20], Nyiri and Kiss developed different ensemble methods using CNNs. To develop the proposed technique, the authors performed the preprocessing on ISIC2017 and ISIC2018 datasets using different preprocessing methods and got an accuracy of 93.8%. In Previous work for skin cancer classification based on dermoscopy images not only lacks the generality but also has lower accuracy for multiclass classification [11,19,32]. In this paper, we propose a multiclass skin cancer classification using diverse types of learners with various properties to capture the morphological, structural, and textural variations present in the skin cancer images for better classification. The proposed ensemble models perform better than both the individual deep learning models and deep learning-based ensemble models proposed in the literature for multiclass skin cancer classification.

Proposed Methodology
The proposed work is performed in two stages. In the first stage, we have developed five diverse deep learning-based models of ResNet, Inception V3, DenseNet, InceptionRes-Net V2, and VGG-19 using transfer learning with the ISIC 2019 dataset. The selection of five pre-trained models with different structural properties is made to capture the morphological, structural, and textural variations present in the skin cancer images with the following idea: residual learning, extraction of more complex features, improvement in the declined accuracy caused by the vanishing gradient, feature invariance through the residual learning, and extraction of the fine detail present into the image. At the second stage, two ensemble models have been developed. For ensemble model development, the decisions of deep learners have been combined using majority voting and weighted majority voting to classify the eight different categories of skin cancer. Figures 1 and 2 shows the overall block diagram of the proposed system.  Table 1. It is observed from Table 1 that distribution of data samples across different classes varies. For example, the melanocytic nevi(NV) class consists of 12,875 images. Similarly, the melanoma class consists of 4522 images, and basal cell carcinoma(BCC) consists of 3323 images. To prepare the dataset for the development of the proposed ensemble models, 1500 images have been randomly selected from each of the NV, BCC, Melanoma, and BKL classes. From the rest of the four classes, all available images in the ISIC repository have been added into the dataset. Thus, the dataset has been formed with 7487 images. Then it has been splitted into two parts: training and test dataset. The training dataset consists of 5690 images and the test dataset has been formed by taking 25% of the total dataset. Thus, the test dataset consists of 1797 images. Figure 3 shows the sample images of eight different classes of skin cancer. In the proposed approach, images have been resized to 224 × 224 × 3.

Ensemble Methods
The motivation behind the development of ensemble models with diverse leaner is to deal with the complexity of multiclass problem by utilizing the pattern extraction capabilities of CNNs and improving the generalization of multiclass problems with the help of ensemble systems. In the machine learning model, as the number of classes increase, the complexity of the model increases, resulting in a decrease in accuracy. Ensemble methods combine the results of individual learners to enhance accuracy by exploiting their diversity and improving the generalization of the learning system. Machine Learning models are bounded by their hypothetical spaces due to some bias and variance. Ensemble techniques aggregate the decision of individual learners to overcome the limitation of a single learner that may have a limited capacity to capture the distribution (variance) of data. Therefore, making a decision by aggregating the multiple diverse learners may improve the robustness as well as reduce the bias and variance. Ensemble learning employs various techniques to generate a robust and accurate combined model by aggregating the base learners. The combining strategies may consist of voting, averaging, cascading or stacking. Voting strategies consist of majority voting and weighted majority voting whereas, averaging strategy consists of averaging and weighted averaging. In this work, we have developed an ensemble model using majority voting, weighted majority voting, and weighted averaging strategies. The basis of ensemble learning is diversity. The ensemble model may fail to achieve better performance if there is no diversity in base learners [51]. There are various methods to incorporate the diversity, such as (1) the exploitation of features spaces, (2) taking subsamples from the training dataset, and (3) the selection of diverse base/individual learners [52]. In the proposed ensemble systems, the diversity is incorporated by combing the individual learners with various properties to capture the morphological, structural, and textural variations present in the skin cancer images for better classification.

Deep Neural Network Models
To develop the proposed ensemble models, five deep neural network models, namely ResNet, InceptionV3, DenseNet, ResNetInceptionV2, and VGG-19, have been developed by fine-tuning the model parameters. Short details of the models are described below.

ResNet
In this deep neural network model, residual learning is introduced and was chosen as a component model of the ensemble. It constructs a deep network with a large number of layers that keep learning residuals to match the predicted labels with the actual labels. The essential components of the model are the convolution and pooling layers that are fully connected and stacked one over the other. The identity connection among the layers of the residual network differentiates between the normal network and the residual network. The residual block of the ResNet is shown in Figure 4. To skip one or more layers in the ResNet, it introduces the "skip connection" and "identity shortcut connection" in the model. The residual block F(X) of the ResNet model can be represented mathematically by Equation (1).
where X and Y are the input and output, respectively, and F is the function applied on the input given to the residual block.

Inception V3
The motivation for choosing the inception neural network as a component model of the ensemble is the inception module that consists of 1 × 1 filters followed by the convolutional layers of different sizes. Due to this, the inception neural network is able to extract more complex features. Inception V3 is inspired by GoogleNet. It is the third edition of Google Inception built from symmetric and asymmetric blocks, including convolution, average pooling, max pooling, dropouts, and fully connected layers. The batch normalization is used extensively throughout the model architecture.

DenseNet
DenseNet is chosen as a component model of the ensemble due to improvement in the declined accuracy caused by the vanishing gradient. In neural networks, the information may vanish before it reaches the last layer due to the longer path between the input and output layers. In the DenseNet model, every layer receives additional information from the preceding layers and then passes its feature maps to all subsequent layers. Concatenation of information is performed in the model and each layer gets a "collective knowledge" from all preceding layers.

ResNetInception V2
This is a variant of the Inception V3 model developed on the basis of the main idea taken from the ResNet model. It has simplified the ResNet block, which facilitates the development of the deeper network. The study in [53] shows that the residual connections play an essential role in accelerating the training of the inception network.

VGG-19
VGG-19 was developed by the Visual Geometry Group, and the number 19 stands for the number of layers with trainable weights. It is a simple network, as the model is made up of sixteen convolutional layers and three fully connected layers. VGG uses very small size filters (2 × 2 and 3 × 3). It uses max pooling for downsampling. The VGG-19 model has approximately 143 million parameters learned from the ImageNet dataset.

Development of Individual Models
The brief architecture of the five deep learning models is given in Section 5. In this section, the training and fine-tuning details of the individual models are provided. DenseNet is fine-tuned by adding a fully connected layer containing eight neurons with SoftMax activation function to classify the eight classes of skin cancer. It is trained using 50 epochs, an Adam Optimizer, and a learning rate of 1e-4. InceptionResNetV2 is fine-tuned by adding two dense layers containing 512 and eight neurons with ReLU and SoftMax activation functions, respectively. GlobalAvgPool2D pooling is applied to downsample the feature map. Moreover, the model is trained with 50 epochs, a stochastic gradient descent (SGD) optimizer, and a learning rate of 0.001 with a batch size of 25. VGG-19 is fine-tuned by applying GlobalAvgPool2D to downsample the feature maps and adding two dense layers containing 512 and eight neurons with ReLU and SoftMax activation functions, respectively. The model is trained with 50 epochs, a learning rate of 1e-4, an SGD optimizer, and a categorical cross-entropy loss function. After retraining and fine-tuning individual models, the test dataset S ts = < Z (i) , t (i) > i = m i = 1 , (m = 1797) is used to validate the trained component models.

Development of Ensemble Models for Skin Cancer Classification
In this stage, individual models trained using different parameters are combined using different combination rules. The details of different combination rules can be found in [54]. Many empirical studies show that simple combination rules, such as majority voting and weighted majority voting, show remarkably improved performance. These rules are effective for the construction of ensemble decisions based on class labels. Therefore, for the current multiclass classification, majority voting, weighted-majority voting, and weighted averaging rules are applied to combine the decision of individual models. For the weighted averaging ensemble, the same weights are assigned to every single model. The final softmax-based results from all the learners are averaged by ∑ p i N , where N is number of learners. For weighted-majority voting weights of each model can be set proportional to the classification accuracy of each learner on the training/test dataset [55]. Therefore, for the weighted majority-based ensemble, weights are empirically estimated for each learner (W ResNet , W Inception , W DenseNet , W InceptionResNet , W VGG ) with respect to their average accuracy on the test dataset. The obtained weights W k ; k = 1, . . . , 5 are normalized so that they add up to 1. This normalization process will not affect the decision of the weighted majoring-based ensemble.
The ensemble decision map is constructed by stacking the decision values of the individual learners for each image Z in the test dataset, i.e., d . The ensemble decision values are obtained for two well-known ensemble methods of majority voting and weighted majority voting. For each image the vote given to the jth class is computed using indicator function ∆(d (i) k , c j ); which matches the predicted value of the kth individual model with the corresponding class label as in Equation (2).
The total votes votes j (i) received from individual models for jth class are obtained using majority voting as in Equation (3).
However, with the weighted majority voting rule the votes for jth class are obtained for the learners k = 1 to 5 as in Equation (4).
The ensemble decision class values,ĺ (i) Ens Ḿ Ens and l (i) Ens M Ens are obtained using majority voting and weighted majority voting rules as in Equations (5) and (6).
The image is assigned to the class that receives the maximum votes.

Performance Measures
The classification performance of the five deep learners and proposed ensemble models has been evaluated using the following quality measures.

Accuracy
Accuracy is a performance measure that indicates the overall performance of classifier as the number of correct predictions divided by the total number of predictions. It shows the ability of the learning models to correctly classify the images data samples. It is computed as in Equation (7).
where TP is true positive, FP is false positive, TN is true negative, and FN is false negative.

Precision
Precision is a performance measure that shows how accurately a classification model predicts the same result when a single sample is tested repeatedly. It evaluates the ability of the classifier to predict the positive class data samples. It is calculated as in Equation (8).

Recall
Recall is a classification measure that shows how many truly relevant results are returned. It reflects the ratio of all positive class data samples predicted as positive by the learner. It is calculated as in Equation (9).

F1 Score
F1 score is calculated based on precision and recall. It can be considered as the weighted average of precision and recall. Its value range between [0, 1]. The best value of F1 score is 1 and the worst is 0. It is computed as in Equation (10).

Results and Discussion
The performance of the proposed models have been evaluated using the measures of accuracy, precision, recall, f1-score, and support. Tables 2 and 3 show the comparative classification performance of individual deep learners of ResNet, InceptionV3, DenseNet, InceptionResnetV2, VGG-19, and the proposed ensemble model. It is observed from the table that the ensemble model outperforms the individual models in terms of precision, recall, f1 score, and accuracy. The accuracy of individual learners of ResNet, InceptionV3, DenseNet, InceptionResnetV2, VGG-19 is 92%, 72%, 92%, 91%, and 91%, respectively.
However, the accuracy measures of the majority voting, weighted averaging based ensemble, and weighted majority voting-based ensemble models are 98%, 98.2%, and 98.6%, respectively. Figure 5 shows that the accuracy of the ensemble approach is much higher than the individual models.    Table 4 shows the performance comparison of the individual deep learning models developed in [10][11][12]19,31,32,47,56,57] and the deep learning models developed in the proposed work for eight classes of skin cancer. It is observed from the table that the individual fine-tuned deep learning models perform better than the individual deep learning models developed in [13,32,47,57]. Table 4 shows classification results with different numbers of classes. Usually, in machine learning models, as the number of classes increases the classification accuracy decreases due the increased model complexity. It is shown in the Table 4 that the individual models developed for the eight classes can perform in comparison to the models developed for the lesser number of classes. The comparison has been made with the classification model that uses the ISIC or HAM dataset that has been used in the ISIC 2018 challenge (Task 3) and is available on (https: //challenge2018.isic-archive.com/, accessed on: 10 October 2021).  Table 5 shows the performance comparison of the proposed ensemble model with the recent deep learning-based ensemble models proposed in [18,19,21,31,49]. It is observed from the table that the majority voting, weighted averaging, and weighted majority ensemble models have an accuracy of 98%, 98.2%, and 98.6%, respectively, which is much higher than the ensemble models proposed in [18,19,21,31,49]. It is observed from the literature that in classification, as the number of classes increases, the classification accuracy decreases. The previous works carried out in [18,19,31,49] have lower accuracy as compared to the proposed ensemble models. Our ensemble models have outperformed both the dermatologists and the recently developed deep learning-based models for multiclass skin cancer classification without extensive pre-processing. Figure 6 shows the training accuracy of individual deep learning models. Confusion matrices of individual and ensemble models are shown in Figure 7. The motivation for adopting the ensemble learning models is that they improve the generalization of the learning systems. Machine learning models are bounded by the hypothetical spaces that have bias and variance. The ensemble models combine the decision of individual weak learners to overcome the problem of the single learner that may have a limited capacity to capture the distribution (causing variance error) present in the data. Our results show that making a final decision by consulting multiple diverse learners may help in improving the robustness as well as reducing the bias and variance error.

Conclusions
Various research has been performed for the classification of skin cancer, but most of them could not extend their study for the classification of multiple classes of skin cancer with high performance. In this work, better-performing heterogeneous ensemble models were developed for multiclass skin cancer classification using majority voting and weighted majority voting. The ensemble models were developed using diverse types of learners with various properties to capture the morphological, structural, and textural variations present in the skin cancer images for better classification. It is observed from the results that the proposed ensemble models have outperformed both dermatologists and the recently developed deep learning methods for multiclass skin cancer classification. The study shows that the performance of convolutional neural networks for the classification of skin cancer is promising, but the accuracy of individual classifiers can still be enhanced through the ensemble approach. The accuracy of the ensemble models is 98% and 98.6%, which shows that the ensemble approach classifies the eight different classes of skin cancer more accurately than the individual deep learners. Moreover, the proposed ensemble models perform better than recently developed deep learning approaches for multiclass skin cancer classification.

Conflicts of Interest:
The authors declare no conflict of interest.