1. Introduction
Skin cancer is considered one of the most dangerous types of cancer in the world [
1,
2], and the number of deaths is increasing daily as a result of this disease [
3,
4]. Moreover, it is one of the fastest spreading types of cancer [
5]. However, treatment is possible if it is detected in its early stages [
6]. According to recent statistics, it was reported that 20% of skin cancer reached a point where survival is not possible due to the disease progression [
7]. Worldwide, approximately 50,000 people die each year from skin cancer [
7,
8], which represents 0.7 of the death rate due to cancer [
8]. The estimated cost of treatment is approximately USD 30 million, which is prohibitive for treatment [
5].
Doctors use multiple methods to detect skin cancer [
9]. Visual detection is the initial way to identify the possibility of the disease [
10,
11]. The American Center for the Study of Dermatology developed a guide for the possible shape of melanoma, which is called ABCD (asymmetry, border, color, diameter) [
2,
12,
13] and is used by doctors for initial screening of the disease. If a suspected skin lesion is found, the doctor takes a biopsy of the visible lesion on the skin [
14], and examines it microscopically for a benign or malignant diagnosis and the type of skin cancer [
15]. Dermoscopy is a technique that doctors use to diagnose skin cancer [
16]. It involves taking bright pictures of the shape of the skin lesion, which comes in the form of dark spots [
17]. However, this method faces many difficulties, the most important of which is the inability to determine the nature of the lesion due to the surrounding conditions such as the presence of hair, blood vessels, correct lighting, inability to take the correct shape of the spot, and the similarity of the shape of the spots among cancerous and non-cancerous diseases [
18,
19]. Moreover, some people may ignore skin lesions due to poverty, lack of access to proper healthcare, or misdiagnosis. Given an image of a skin lesion, the goal of this work to easily and automatically classify this image into benign or possible cancer. Such a system can be deployed as an easy-to-use smartphone application.
The contributions of this paper are as follows:
Develop an artificial intelligence-based screening system for skin cancer (melanoma and non-melanoma) using dermoscopic images of the skin lesions as input. Such a system can aid in clinical screening tests, reduce errors, and improve early diagnosis;
Implement transfer learning of 13 deep convolutional neural networks models for the classification of skin lesion images into seven categories, including melanoma, benign keratosis-like lesions, and five other non-melanoma cancers;
Evaluate classification performance using common relevant metrics for all models. In addition, the training behavior and time requirements were also included.
The remainder of this paper is organized as follows: the related work is discussed in
Section 2, the dataset, deep learning models, and performance evaluation metrics and setup are explained in detail in
Section 3,
Section 4 presents the performance evaluation results along with a comparison to the related literature and discussion of the models, and we conclude in
Section 5.
2. Related Work
Recent advances in artificial intelligence (AI) during the past decade and specifically in the field of deep learning and convolutional neural networks (CNNs) have opened the door for the development of reliable screening and diagnosis image-based medical systems [
20]. The research landscape has recently witnessed a shift from image segmentation (i.e., separation of relevant areas in the image) and feature extraction toward automated classification using deep learning. The literature in the context of skin cancer detection/screening followed a similar trajectory with the traditional approach of image processing to remove irrelevant artifacts (e.g., hair) being overcome by using sophisticated deep learning artificial intelligence. Such recent techniques do not require explicit feature extraction and are generally immune to noise factors that affect images (e.g., light intensity, color, translation, reflection, etc.) [
21]. However, they tend to be computationally intensive [
22].
Li et al. [
1] proposed digital hair removal (DHS) to filter the hair out of the skin lesion image, and analyzed the effect of hair removal using intra-structural similarity (Intra-SSIM). In another study, Liu et al. [
23] developed a new method using deep learning to segment lesion images according to regions of interest (ROI). They used a new mid-level feature representation, where pre-trained neural networks (e.g., ResNet and DenseNet) were used to extract information from the ROI. Similarly, Pour and Seker [
24] used convolutional neural networks for the segmentation of lesions and dermoscopic features. They used the CIELAB color space in addition to RGB color channels instead of excessive augmentation or using a pertained model. Almansi et al. [
25] proposed a new segmentation methodology using full-resolution convolutional networks (FrCN). They worked on the image without pre/post-processing, and their results showed that the proposed method (FrCN) yielded better results than the other deep learning segmentation approaches. Dash et al. [
26] proposed a new segmentation method based on a deep fully convolutional network comprised of 29 layers. Xie et al. [
27] proposed the segmentation of dermoscopy images based on a convolutional neural network with an attention mechanism, which can preserve edge details. Serte and Demirel [
28] proposed a novel Gabor wavelet-based deep learning model for the classification of melanoma and seborrheic keratosis. This model builds on an ensemble of seven Gabor wavelet-based CNN models. Furthermore, their model fuses the Gabor wavelet-based model and an image-based CNN model. The performance evaluation results showed that an ensemble of the image and Gabor wavelet-based models outperformed the individual separate image and Gabor wavelet-based models. This ensemble also outperformed the group of only Gabor wavelet-based CNN models.
Deep transfer learning has been widely deployed in the medical imaging literature for powerful, automatic, and internal (i.e., implicit) feature extraction. In this regard, Manzo et al. [
29] employed a three-step approach for melanoma detection. In the first step, the images are converted into the proper size and the dataset is balanced. After that, deep transfer learning is used for feature extraction. These features feed an ensemble of traditional classification algorithms, including support-vector machine (SVM), logistic label propagation (LLP), and k-nearest neighbors (KNN). Jain et al. [
30] compared six different transfer learning networks for multiclass lesion classification. However, their reported results relied upon increasing the size of the dataset by augmentation. Augmentation is typically used to introduce changes into the input images without duplication. Thus, making several augmented copies of the same image in the dataset will result in biased results that do not represent the actual performance [
21].
4. Results and Discussion
The related work in the literature has already established that high performance is achievable in binary (i.e., benign vs. melanoma) or ternary (i.e., benign vs. melanoma vs. non-melanoma) classification of skin lesion images. The goal of the experiments was to evaluate the ability of transfer learning of the deep convolutional network models to correctly classify skin lesion images into one of the seven aforementioned categories in the dataset. Moreover, the training was repeated for 10 times to account for variability in the random data split of images into training and validation, and the mean values were reported. In addition, due to the high computational cost of deep learning models, the training and validation times were also included in the results.
Table 1 shows the mean overall performance metrics over 10 runs of each of the 13 deep learning models and using 70% of the data for training. All models achieved comparable accuracy values, with Resnet101 performing the best with 76.7%. The sample confusion matrix with row and column summaries in
Figure 2 provides further insight into the results. First, due to the imbalanced number of images in each class and with smaller-sized classes achieving lower accuracies, the F1 score numbers are lower than the accuracy values. The NV class with the largest number of images achieved the highest precision (92.5%; see the NV column summary) and highest recall (82.5%; see the NV row summary). In comparison, the melanoma class was detected with 71% sensitivity (i.e., recall) but 43.1% precision. However, the other classes show less precision/recall variation.
Figure 3 shows a sample training/validation progress curve for Resnet101 and a 70/30 data split. This figure shows two possible observations: first, the model is unable to achieve consistently reduced loss and produce high testing accuracy, even when the number of epochs is increased (not reported here), and second, due to the small number of images in most classes (deep learning requires large datasets [
43]), there is an obvious gap between the validation vs testing performance (i.e., overfitting or inability to generalize to the validation data).
Table 2 shows the mean overall performance metrics over 10 runs of each of the 13 deep learning models using 80% of the data for training. The 10% increase in the size of the training set did not have a significant effect on the performance metrics, with the best F1 score being 66.1% (DenseNet201 model). The confusion matrix in
Figure 4 shows that a major source for errors was the misclassification of NV images as melanoma. Most classes achieved relatively high precision but low recall. Moreover, the same training and overfitting trends appear in
Figure 5.
A further 10% increase in training data made the percentage of testing images 90% of the dataset.
Table 3 shows the mean overall performance metrics over 10 runs of each of the 13 deep learning models. Three of the models (i.e., DenseNet201, DarkNet53, and ResNet101) achieved an accuracy above 80% with a corresponding F1 score of 74.4% for DenseNet201. The table shows steady improvement for most models with a larger set of training data over all metrics, except for the small model SqueezeNet. Generally, deep learning models, unlike traditional machine learning, benefit from larger datasets [
44], which may be the reason for improved performance. The sample confusion matrix for DarkNet-53 in
Figure 6 shows considerably better performance in terms of entries with one or fewer false misclassifications. However, the training/validation progress curve in
Figure 7 still shows signs of overfitting.
Although an increased size of the training dataset showed signs of promise, much is still desired to reach a reliable diagnosis system that surpasses screening requirements. However, some of the results were affected by the small number of images in each class. For example, in
Figure 6, the class DF had 11 images, VASC had 14 images, and AKIEC had 32 images. Such numbers are extremely low for an effective deep learning model, and single errors will have a profound effect on overall performance indices.
To assess the computational cost of training the deep learning models, the time required for each model was reported for each strategy of data split; see
Table 4. In general, the required time increases linearly in less than 10% increments with each increase in the size of the training dataset. SqueezeNet is the fastest model, but DarkNet-53 is the best model that combines classification prowess with speed of training, followed by Resnet101.
A comparison to the related literature is shown in
Table 5. Although the referenced studies achieve high performance values, they tackle a far easier problem in classifying fewer number of classes (two or three). Moreover, some of these studies require explicit feature extraction, which is not needed by deep transfer learning. Others, including Pezhman Pour and Seker [
24] and Lie et al. [
1], do not address the classification problem directly but rather on processing techniques for lesion segmentation (i.e., separation of lesion from other artifacts in the image) and hair removal from lesion images, respectively.
Special Cases
Further investigation of the classification performance and training behavior was conducted in order to shed light on shortcomings, as follows:
Maximum number of epochs. Increasing the number of epochs will require more training time and may achieve better performance if the model has more room to learn, especially in large datasets. However, an exaggerated value for this hyper-parameter may lead to overfitting. Three models were retrained with a maximum number of epochs = 50. These were: Resnet101 with a 70/30 data split, DenseNet201 with an 80/20 data split, and DarkNet-53 with a 90/10 data split. In comparison to the values in
Table 1,
Table 2 and
Table 3, the F1 score for Resnet101 improved slightly to 67.2% (was 64.3%), DenseNet201 performed a little worse with an F1-score of 63.7%, down from 66.1% in
Table 2 (i.e., the model started to overfit the training data), and Darknet-53 improved to an F1-score of 83.1%. The other performance metrics showed similar trends to the F1 score.
Figure 8,
Figure 9 and
Figure 10 show the corresponding confusion matrices;
Classifying a lesser number of skin cancer types. Since the dataset is highly imbalanced with some classes having a significantly smaller number of images in the dataset (e.g., 115 DF and 142 VASC), it is worthwhile to explore several subsets of the classification problem as follows:
- −
Eliminate the DF and VASC classes and perform 5-class classification. The same three models and corresponding data split as in the previous case with a maximum number of epochs = 10 were used. Surprisingly, in comparison to
Table 1,
Table 2 and
Table 3, the F1 score displayed very small change (Resnet101: 64.8%, DenseNet201: 65.2%, and DarkNet-53: 67.1%), which was similar to the trend in the other performance metrics;
- −
Eliminate the BCC (514 images), AKIEC (327 images), DF, and VASC classes and perform 3-class classification. The Resnet101 (70/30 data split), DenseNet201 (80/20 data split), and DarkNet-53 (90/10) were used with a maximum number of epochs =10. An easier classification problem has resulted in an improved F1 score for Resnet101 and DarkNet-53 of 71.1% and 72.8%, respectively. However, DenseNet201 performed worse at 62.3%, probably due to overfitting;
- −
Using the same setup as above, perform pair-wise 2-class classification on the three classes, NV, MEL, and BKL. For the MEL vs. BKL classification, the F1 score of Resnet101 = 80.6%, DenseNet = 73.44%, and DarkNet201 = 83.7%. For the NV vs. MEL classification, all models performed badly. The F1 score for Resnet101 = 58.8%, DenseNet201 = 55.13%, and DarkNet-53 = 63.4%. Although the two classes have a good number of images, it seems like the similarities between the two types are too difficult to spot. Moreover, the lack of proper image cropping (i.e., elimination of useless parts of the images and keeping the lesion) contributed to this factor as it consumes a significant part of the image representation, especially that these algorithms require a scaled-down copy of the input, as mentioned in
Section 3. The last pair-wise classification problem is NV vs. BKL, for which Resnet 101 achieved an F1 score = 72.8% (93% accuracy), DenseNet201 reported a 71.8% F1 score and 91.9% accuracy, and DarkNet-53 managed a 70.0% F1 score and 89.9% accuracy.
Surprisingly, lowering the number of classes did not result in improved performance in general. Although deep transfer learning has been effective in many medical and image-based applications, it seems like its application in this scenario requires more investigation and probably larger datasets.
5. Conclusions
Skin cancer in both melanoma and non-melanoma types is common and leads to many yearly deaths worldwide. Early diagnosis has been show to drastically reduce therapy time, cost, and suffering from the prolonged traditional treatment methods (e.g., chemotherapy). However, accurate screening/diagnosis requires specialist knowledge of the different types of cancers and how they appear in the form of skin lesions. Some people may ignore such lesions due to ignorance, indifference, cost, or doctor appointment scheduling delays. Recently, the field of deep learning and artificial intelligence has opened the door for the development of reliable image-based medical systems for screening and diagnosis. In this paper, we have used a well-known dermoscopy dataset of seven common types of cancerous skin lesions, utilized recent advances in the design of deep convolutional neural networks, and applied deep transfer learning to the application of screening/diagnosing skin lesion images. Such an approach has the capability to achieve high accuracies that reduce the burden on specialists. Moreover, it can be easily implemented and used in real-life applications due to the elimination of explicit feature extraction or manual image processing. Future work will focus on improving the balance of the dataset by collecting specific dermoscopy images of underrepresented skin lesion types and making those publicly available in the research domain.