1. Introduction
Cancer can cause death if not diagnosed and treated in a timely fashion and can start almost anywhere in the human body. Skin cancer is a common type of cancer, as more than three million Americans are diagnosed with skin cancer each year (
https://www.skincancer.org/skin-cancer-information/skin-cancer-facts, accessed on: 24 October 2021). If skin cancer is diagnosed early, it can usually be treated. There are eight categories of skin cancers: melanoma (MEL), melanocytic nevi (NV), basal cell carcinoma (BCC), benign keratosis lesions (BKL), actinic keratosis (AK), dermatofibroma (DF), squamous cell carcinoma (SCC), and vascularl Lesions (VASC) [
1]. MEL is the most dangerous type of cancer, as it spreads to other organs very rapidly. It develops in body cells called melanocytes. MEL is not common as compared to other categories of skin cancer. NV are pigmented moles and vary in different colors of skin tones. It mostly develops during childhood and early adult life as the number of moles increases up to the age of 30 to 40. Thereafter, the number of naevi tends to decrease. BCC develops in cells of the skin called basal cells. The basal cell performs the functionality of producing new skin cells as old ones die off. AK is a pre-cancer that develops on skin affected by chronic exposure to ultraviolet (UV) rays. BKL is one of the common benign neoplasms of the skin. DF occur at all ages and in people of every ethnicity. It is not clear if DF is a reactive process or a neoplasm [
2]. The lesions are made up of proliferating fibroblasts. Vascular lesions are relatively common abnormalities of the skin and underlying tissues. SCC is the most accruing form of skin cancer after melanoma and usually results from exposure to UV rays.
In literature, machine learning approaches such as support vector machine (SVM) [
2], neural networks [
3], naïve Bayes classifier [
4], and decision trees [
5] have been used for skin cancer classification. The problem with machine learning approaches is the requirement of human-engineered features. In the last decade, deep learning approaches, such as convolutional neural networks (CNN) became popular due to their ability with regard to automatic feature extraction [
6,
7,
8,
9], and have been extensively used in research [
10,
11,
12,
13]. Dorj et al. [
14] worked on skin cancer classification using deep CNN. Romero et al. in [
15] performed melanoma cancer classification with the dermoscopy images using CNN. The technique has an accuracy of 81.3% on the International Skin Imaging Cancer (ISIC) archive dataset. Jinnai et al. [
16] carried out pigmented skin lesion classification using the clinical images and faster region-based CNN. The classification accuracy of the method was compared with the ten-board certified dermatologist diagnosis accuracy. Esteva et al. [
11] performed multiclass skin cancer classification using dermoscopy images with the different variants of CNN. Adegun et al. in [
17] developed a probabilistic model to achieve the better performance of a fully convolutional network-based deep learning system for the analysis and segmentation of skin lesion images. The probabilistic model achieved an accuracy of 98%.
Recently, researchers have proposed ensemble methods to enhance classification performance [
18,
19,
20]. Bajwa et al. in [
21] developed ensemble model using ResNet-152 [
22], DenseNet-161 [
22], SE-ResNeXt-101 [
23], and NASNet [
23] for the classification of seven classes of skin cancer using the ISIC dataset and achieved an accuracy of 93%. The ensemble is a machine learning method that combines the decision of several individual learners to increase classification accuracy [
24]. The ensemble model exploits the diversity of individual models to make a combined decision; therefore, it is expected that the ensemble model increases classification accuracy [
25,
26]. The binary class skin cancer classification has been performed in [
15,
27,
28,
29], but many researchers could not address multiclass classification with better results. The recent approaches developed in [
11,
19,
30,
31,
32] for multiclass skin cancer classification also failed to achieve higher accuracy. In this research, improved performance heterogeneous ensemble models are developed for multiclass skin cancer classification using majority voting and weighted majority voting. The ensemble models are developed using diverse types of learners with various properties to capture the morphological, structural, and textural variations present in the skin cancer images for better classification. The proposed ensemble methods perform better than both the individual deep learning models and deep learning-based ensemble models proposed in the literature for multiclass skin cancer classification.
The following contributions are made in this research work:
Five pre-trained models are developed, and their decision is combined using majority weighting and weighted majority voting for the classification of eight different classes of skin cancer.
The pre-trained models with different structural properties are trained to capture the morphological, structural, and textural variations present in the skin cancer images with the following idea: residual learning, extraction of more complex features, improvement in the declined accuracy caused by the vanishing gradient, feature invariance through the residual learning, and extraction of the fine detail present into the image.
The proposed ensemble methods perform better than the expert dermatologists and previously proposed deep learning-based ensemble models for multiclass skin cancer classification.
A comparative study is conducted for the performance analysis of five fine-tuned deep learning models and their ensemble models on the ISIC dataset to determine the model with better performance.
In our proposed method, no extensive pre-processing has been performed on the images, and no lesion segmentation has been carried out to make the work more generic and reliable.
The rest of the paper is organized as follows:
Section 2 presents related work.
Section 3 describes the proposed method. Ensemble methods are discussed in
Section 4, whereas
Section 5 discusses different deep neural network models followed by individual models in
Section 6. The quality measures used to measure the performance of the proposed study are presented in
Section 7.
Section 8 discusses the results, followed by the conclusion.
2. Related Work
Skin cancer is usually diagnosed with the physical examination of the skin or with the help of biopsy. The detection of skin cancer through the physical examination requires a great degree of experience and expertise, and biopsy-based examination is a tedious and time consuming task as it also requires expert pathologists. Currently, the macroscopic and dermoscopy images are used by the dermatologist during the detection procedure of skin cancer. But even with the dermoscopy images, accurate skin cancer detection is a challenging task, as multiple skin cancers may appear similar in initial appearance. Furthermore, even the expert dermatologists have limited studies and exposure experience to different types of skin cancer through their lifetimes. Expert dermatologists have a skin cancer detection accuracy of 62% to 80% [
33,
34]. The reports on diagnostic accuracy of clinical dermatologists have shown 62% accuracy with the clinical experience of three to five years. However, the dermatologists with more than 10 years of experiernce have a diagnostic accuracy of 80%, and diagnostic performance falls even more for dermatologists with the less experience [
34]. Therefore, dermoscopy in the hands of an inexperienced dermatologist may reduce diagnostic accuracy [
33,
35,
36].
In early 1980s computer-aided diagnosis (CAD) systems were developed to assist the dermatologist to meet the challenges faced during the process of diagnosis [
37]. Initially, CADs were developed using a dermoscopy image for binary classification of melanoma and benign [
37]. Since then, much research has been carried out to solve this challenging problem. Several studies [
37,
38,
39] have been performed using the manual evaluation methods developed using the ABCD method proposed by Nachbar et al. in [
40]. Moreover, machine learning classifiers have been developed with handcrafted features. These classifiers include support vector machine (SVM) [
2], naïve Bayes [
4], k-nearest neighbor [
41], logistic regression [
42], and artificial neural networks [
3]. But the existence of high intraclass and low interclass variations in MEL have caused the unsatisfactory performance of handcrafted features-based cancer detection [
43].
With the advent of deep learning, CNN caused a breakthrough for the solution of many problems, including skin cancer classification. CNN gave higher detection accuracy and it also reduced the burden of hand-crafted feature extraction by automatically extracting the feature [
44]. As CNNs require huge datasets for better feature leaning at abstract levels [
45], transfer learning has been introduced to meet the limitation of huge dataset requirements [
20,
32], where a model trained for a task is reused for another task. Khalid et al. proposed a skin lesion detection technique by developing a transfer learning- based AlexNet model for the classification of three skin lesions in [
46]. To develop the proposed method, data augmentation is applied to enhance the dataset to achieve an accuracy of 98.61%. Kawahara et al. developed a skin cancer detection technique using dataset consisting of 1300 images to construct a linear classifier using the features extracted by the CNN in [
47]. The technique does not require preprocessing or skin lesion segmentation. They carried out classifications of five and 10 classes and achieved an accuracy of 85.8% and 81.9%, respectively. In [
32], a novel CNN architecture has been developed. The architecture relies on multiple tracts to perform the skin lesion classification. The authors used a pretrained CNN for single resolution and retained it for multi-resolution on publicly available datasets and obtained an accuracy of 79.15% for the ten classes. In [
48], Ali et al. developed a skin lesion classification approach using deep convolutional neural networks (DCNN) to classify benign and malignant skin lesions. To develop the proposed method, the authors carried out the preprocessing consisting of noise removal by applying the filter, input normalization, and data augmentation steps and achieved an accuracy of 91.93%.
Ensemble methods are developed to enhance classification performance. Ensemble methods exploit the diversity in individual models to obtain their higher accuracy. Recently, multiclass skin cancer classification techniques have been developed in the literature using ensemble approaches. Harangi et al. [
18] proposed how an ensemble of CNNs models can be developed for enhancement of skin cancer classification accuracy and developed an ensemble model for three classes of skin cancer and achieved an accuracy of 84.2%, 84.8%, 82.8%, and 81.4% for the models of GoogleNet, AlexNet, ResNet, and VGGNet, respectively. The authors enhanced the accuracy of 83.8% with the ensemble model of GoogleNet, AlexNet, and VGGNet. In [
20], Nyiri and Kiss developed different ensemble methods using CNNs. To develop the proposed technique, the authors performed the preprocessing on ISIC2017 and ISIC2018 datasets using different preprocessing methods and got an accuracy of 93.8%. In [
49] Shahin et al. carried out skin lesion classification using ensemble of deep learners and developed an ensemble by aggregating the decision of ResNet50 and Inception V3 models to carry out the classification of seven skin cancer classes with an accuracy of 89.9%. In [
19], Majtner et al. developed the ensemble of VGG16 and GoogleNet architectures using the ISIC 2018 dataset. To develop the proposed ensemble methods, the authors carried out the data augmentation and colour normalization on the dataset. The proposed method achieved an accuracy of 80.1%. [
50] Rahman et al. developed a multiclass skin cancer classification approach using a weighted averaging ensemble of deep learning approaches using ResNeXt, SeResNeXt, ResNet, Xception, and DenseNet as individual models to develop the ensemble for the classification of seven classes of skin cancer with an accuracy of 81.8%.
Previous work for skin cancer classification based on dermoscopy images not only lacks the generality but also has lower accuracy for multiclass classification [
11,
19,
32]. In this paper, we propose a multiclass skin cancer classification using diverse types of learners with various properties to capture the morphological, structural, and textural variations present in the skin cancer images for better classification. The proposed ensemble models perform better than both the individual deep learning models and deep learning-based ensemble models proposed in the literature for multiclass skin cancer classification.
3. Proposed Methodology
The proposed work is performed in two stages. In the first stage, we have developed five diverse deep learning-based models of ResNet, Inception V3, DenseNet, InceptionResNet V2, and VGG-19 using transfer learning with the ISIC 2019 dataset. The selection of five pre-trained models with different structural properties is made to capture the morphological, structural, and textural variations present in the skin cancer images with the following idea: residual learning, extraction of more complex features, improvement in the declined accuracy caused by the vanishing gradient, feature invariance through the residual learning, and extraction of the fine detail present into the image. At the second stage, two ensemble models have been developed. For ensemble model development, the decisions of deep learners have been combined using majority voting and weighted majority voting to classify the eight different categories of skin cancer.
Figure 1 and
Figure 2 shows the overall block diagram of the proposed system.
ISIC developed an international repository of dermoscopy images known as the ISIC Archive (
https://www.isic-archive.com, accessed on 24 October 2021) for technical research. The ISIC 2019 repository contains a training dataset consisting of 25,331 dermoscopy images across eight different categories. Details of dataset and the distribution of data samples for each class have been shown in
Table 1. It is observed from
Table 1 that distribution of data samples across different classes varies. For example, the melanocytic nevi(NV) class consists of 12,875 images. Similarly, the melanoma class consists of 4522 images, and basal cell carcinoma(BCC) consists of 3323 images. To prepare the dataset for the development of the proposed ensemble models, 1500 images have been randomly selected from each of the NV, BCC, Melanoma, and BKL classes. From the rest of the four classes, all available images in the ISIC repository have been added into the dataset. Thus, the dataset has been formed with 7487 images. Then it has been splitted into two parts: training and test dataset. The training dataset consists of 5690 images and the test dataset has been formed by taking 25% of the total dataset. Thus, the test dataset consists of 1797 images.
Figure 3 shows the sample images of eight different classes of skin cancer. In the proposed approach, images have been resized to 224 × 224 × 3.
5. Deep Neural Network Models
To develop the proposed ensemble models, five deep neural network models, namely ResNet, InceptionV3, DenseNet, ResNetInceptionV2, and VGG-19, have been developed by fine-tuning the model parameters. Short details of the models are described below.
5.1. ResNet
In this deep neural network model, residual learning is introduced and was chosen as a component model of the ensemble. It constructs a deep network with a large number of layers that keep learning residuals to match the predicted labels with the actual labels. The essential components of the model are the convolution and pooling layers that are fully connected and stacked one over the other. The identity connection among the layers of the residual network differentiates between the normal network and the residual network. The residual block of the ResNet is shown in
Figure 4. To skip one or more layers in the ResNet, it introduces the “skip connection” and “identity shortcut connection” in the model. The residual block
of the ResNet model can be represented mathematically by Equation (
1).
where
X and
Y are the input and output, respectively, and
F is the function applied on the input given to the residual block.
5.2. Inception V3
The motivation for choosing the inception neural network as a component model of the ensemble is the inception module that consists of 1 × 1 filters followed by the convolutional layers of different sizes. Due to this, the inception neural network is able to extract more complex features. Inception V3 is inspired by GoogleNet. It is the third edition of Google Inception built from symmetric and asymmetric blocks, including convolution, average pooling, max pooling, dropouts, and fully connected layers. The batch normalization is used extensively throughout the model architecture.
5.3. DenseNet
DenseNet is chosen as a component model of the ensemble due to improvement in the declined accuracy caused by the vanishing gradient. In neural networks, the information may vanish before it reaches the last layer due to the longer path between the input and output layers. In the DenseNet model, every layer receives additional information from the preceding layers and then passes its feature maps to all subsequent layers. Concatenation of information is performed in the model and each layer gets a “collective knowledge” from all preceding layers.
5.4. ResNetInception V2
This is a variant of the Inception V3 model developed on the basis of the main idea taken from the ResNet model. It has simplified the ResNet block, which facilitates the development of the deeper network. The study in [
53] shows that the residual connections play an essential role in accelerating the training of the inception network.
5.5. VGG-19
VGG-19 was developed by the Visual Geometry Group, and the number 19 stands for the number of layers with trainable weights. It is a simple network, as the model is made up of sixteen convolutional layers and three fully connected layers. VGG uses very small size filters (2 × 2 and 3 × 3). It uses max pooling for downsampling. The VGG-19 model has approximately 143 million parameters learned from the ImageNet dataset.
6. Development of Individual Models
The brief architecture of the five deep learning models is given in
Section 5. In this section, the training and fine-tuning details of the individual models are provided. First, the training dataset
has been used to train and optimize the parameters of the individual models, where
Z represents the image. The training dataset consists of
images. It is used to develop the classifiers of ResNet, InceptionV3, ResNetInceptionV2, DenseNet, and VGG-19. To train and fine-tune the ResNet model, global average pooling (GlobalAvgPool2D) is applied to downsample the feature maps so that all the spatial regions may contribute to the output. Moreover, a fully connected layer containing eight neurons with the SoftMax activation function are added to classify eight different classes. The ResNet model is trained with 50 epochs, adaptive moment estimation (Adam) optimizer for the quick optimization of the model, learning rate of 1e-4, and categorical cross-entropy loss function. Inception V3 is fine-tuned by applying GlobalAvgPool2D to downsample the feature maps, adding two dense layers at the end containing 1028 and eight neurons with a rectified linear unit (ReLU) and SoftMax activation functions, respectively. The model is trained using 50 epochs, a learning rate of 0.001, and an RMSprop optimizer, as it uses plain momentum. Additionally, RMSprop maintains a moving average of the gradients and uses that average to estimate the variance.
DenseNet is fine-tuned by adding a fully connected layer containing eight neurons with SoftMax activation function to classify the eight classes of skin cancer. It is trained using 50 epochs, an Adam Optimizer, and a learning rate of 1e-4. InceptionResNetV2 is fine-tuned by adding two dense layers containing 512 and eight neurons with ReLU and SoftMax activation functions, respectively. GlobalAvgPool2D pooling is applied to downsample the feature map. Moreover, the model is trained with 50 epochs, a stochastic gradient descent (SGD) optimizer, and a learning rate of 0.001 with a batch size of 25. VGG-19 is fine-tuned by applying GlobalAvgPool2D to downsample the feature maps and adding two dense layers containing 512 and eight neurons with ReLU and SoftMax activation functions, respectively. The model is trained with 50 epochs, a learning rate of 1e-4, an SGD optimizer, and a categorical cross-entropy loss function. After retraining and fine-tuning individual models, the test dataset , (m = 1797) is used to validate the trained component models.
Development of Ensemble Models for Skin Cancer Classification
In this stage, individual models trained using different parameters are combined using different combination rules. The details of different combination rules can be found in [
54]. Many empirical studies show that simple combination rules, such as majority voting and weighted majority voting, show remarkably improved performance. These rules are effective for the construction of ensemble decisions based on class labels. Therefore, for the current multiclass classification, majority voting, weighted-majority voting, and weighted averaging rules are applied to combine the decision of individual models. For the weighted averaging ensemble, the same weights are assigned to every single model. The final softmax-based results from all the learners are averaged by
, where
N is number of learners. For weighted-majority voting weights of each model can be set proportional to the classification accuracy of each learner on the training/test dataset [
55]. Therefore, for the weighted majority-based ensemble, weights are empirically estimated for each learner
with respect to their average accuracy on the test dataset. The obtained weights
are normalized so that they add up to 1. This normalization process will not affect the decision of the weighted majoring-based ensemble.
The ensemble decision map is constructed by stacking the decision values of the individual learners for each image
Z in the test dataset, i.e.,
,
,
,
and
. The ensemble decision values are obtained for two well-known ensemble methods of majority voting and weighted majority voting. For each image the vote given to the
jth class is computed using indicator function
; which matches the predicted value of the
kth individual model with the corresponding class label as in Equation (
2).
The total votes
received from individual models for
jth class are obtained using majority voting as in Equation (
3).
However, with the weighted majority voting rule the votes for
jth class are obtained for the learners
k = 1 to 5 as in Equation (
4).
The ensemble decision class values,
and
are obtained using majority voting and weighted majority voting rules as in Equations (
5) and (
6).
The image is assigned to the class that receives the maximum votes.
8. Results and Discussion
The performance of the proposed models have been evaluated using the measures of accuracy, precision, recall, f1-score, and support.
Table 2 and
Table 3 show the comparative classification performance of individual deep learners of ResNet, InceptionV3, DenseNet, InceptionResnetV2, VGG-19, and the proposed ensemble model. It is observed from the table that the ensemble model outperforms the individual models in terms of precision, recall, f1 score, and accuracy. The accuracy of individual learners of ResNet, InceptionV3, DenseNet, InceptionResnetV2, VGG-19 is 92%, 72%, 92%, 91%, and 91%, respectively.
However, the accuracy measures of the majority voting, weighted averaging based ensemble, and weighted majority voting-based ensemble models are 98%, 98.2%, and 98.6%, respectively.
Figure 5 shows that the accuracy of the ensemble approach is much higher than the individual models.
Table 4 shows the performance comparison of the individual deep learning models developed in [
10,
11,
12,
19,
31,
32,
47,
56,
57] and the deep learning models developed in the proposed work for eight classes of skin cancer. It is observed from the table that the individual fine-tuned deep learning models perform better than the individual deep learning models developed in [
13,
32,
47,
57].
Table 4 shows classification results with different numbers of classes. Usually, in machine learning models, as the number of classes increases the classification accuracy decreases due the increased model complexity. It is shown in the
Table 4 that the individual models developed for the eight classes can perform in comparison to the models developed for the lesser number of classes. The comparison has been made with the classification model that uses the ISIC or HAM dataset that has been used in the ISIC 2018 challenge (Task 3) and is available on (
https://challenge2018.isic-archive.com/, accessed on: 10 October 2021).
Table 5 shows the performance comparison of the proposed ensemble model with the recent deep learning-based ensemble models proposed in [
18,
19,
21,
31,
49]. It is observed from the table that the majority voting, weighted averaging, and weighted majority ensemble models have an accuracy of 98%, 98.2%, and 98.6%, respectively, which is much higher than the ensemble models proposed in [
18,
19,
21,
31,
49]. It is observed from the literature that in classification, as the number of classes increases, the classification accuracy decreases. The previous works carried out in [
18,
19,
31,
49] have lower accuracy as compared to the proposed ensemble models. Our ensemble models have outperformed both the dermatologists and the recently developed deep learning-based models for multiclass skin cancer classification without extensive pre-processing.
Figure 6 shows the training accuracy of individual deep learning models. Confusion matrices of individual and ensemble models are shown in
Figure 7. The motivation for adopting the ensemble learning models is that they improve the generalization of the learning systems. Machine learning models are bounded by the hypothetical spaces that have bias and variance. The ensemble models combine the decision of individual weak learners to overcome the problem of the single learner that may have a limited capacity to capture the distribution (causing variance error) present in the data. Our results show that making a final decision by consulting multiple diverse learners may help in improving the robustness as well as reducing the bias and variance error.