A Study on Multiple Factors Affecting the Accuracy of Multiclass Skin Disease Classiﬁcation

: Diagnosis of skin diseases by human experts is a laborious task prone to subjective judgment. Aided by computer technology and machine learning, it is possible to improve the efﬁciency and robustness of skin disease classiﬁcation. Deep transfer learning using off-the-shelf deep convolutional neural networks (CNNs) has huge potential in the automation of skin disease classiﬁcation tasks. However, complicated architectures seem to be too heavy for the classiﬁcation of only a few skin disease classes. In this paper, in order to study potential ways to improve the classiﬁcation accuracy of skin diseases, multiple factors are investigated. First, two different off-the-shelf architectures, namely AlexNet and ResNet50, are evaluated. Then, approaches using either transfer learning or trained from scratch are compared. In order to reduce the complexity of the network, the effects of shortening the depths of deep CNNs are investigated. Furthermore, different data augmentation techniques based on basic image manipulation are compared. Finally, the choice of mini-batch size is studied. Experiments were carried out on the HAM10000 skin disease dataset. The results show that the ResNet50-based model is more accurate than the AlexNet-based model. The transferred knowledge from the ImageNet database helps to improve the accuracy of the model. The reduction in stages of the ResNet50-based model can reduce complexity while maintaining good accuracy. Additionally, the use of different types of data augmentation techniques and the choice of mini-batch size can also affect the classiﬁcation accuracy of skin diseases.


Introduction
Artificial intelligence (AI), which has profoundly changed our everyday lives, has been extensively studied for several decades [1][2][3][4]. Some prominent applications are: AIempowered autonomous driving, which has been employed in numerous electric vehicles from various automakers such as Tesla and Ford [5]; AlphaGo developed by Google using artificial neural networks [6]; and the prevailing TikTok app, which has succeeded greatly due to its recommendation algorithms [7].
An important aspect of AI is machine learning. Artificial neural networks, especially convolutional neural networks (CNNs), are extending the reach of machine learning to a broad range of applications [8,9]. CNNs are particularly useful for analyzing visual imagery. The convolution kernels used in CNNs are extremely useful for the extraction of image features. Classifying images is a challenging task. However, with the aid of powerful graphic processing units, the classification of images has become efficient using deep CNN architectures. Numerous off-the-shelf architectures are available for use and can be easily adjusted to suit new tasks by using deep transfer learning [10,11].
Training deep CNNs usually requires large annotated image datasets. However, in specific areas such as the medical image domain, a large volume of publicly available datasets is not always available, and collecting and annotating samples can also be difficult and costly. Therefore, a data augmentation technique has been extensively studied.
In this paper, multiple factors are studied for potential improvement in the classification accuracy of skin diseases using deep transfer learning. Due to the good generalization ability of deep CNNs, the off-the-shelf architectures AlexNet and ResNet50 are compared for use in the building of transfer learning models for skin disease classification. These deep CNN architectures are initially trained to classify 1000 classes in the ImageNet database. However, as only a few classes are needed for classification in the skin disease dataset, complex deep CNNs seem too heavy. Therefore, the effects of trimming the depths of deep CNNs to reduce the complexity and number of parameters in the networks are evaluated. Comparison is made between using transfer learning and training from scratch. Additionally, different image manipulation techniques for data augmentation are compared. The choice of mini-batch sizes is also discussed. The aforementioned factors are studied through experiments on the HAM10000 skin disease dataset.

Related Work
Skin diseases are common types of illness worldwide. It is of importance to diagnose and treat skin disease early, as some severe skin diseases might cause death [12]. With the help of deep neural networks, skin disease classification has been actively studied. A multiclass skin disease classification scheme was proposed using pretrained AlexNet for feature extraction and error-correcting output codes for support vector machine as the classifier and achieved 86% accuracy on five skin lesion categories [13]. However, the dataset was not balanced for each class, and further work is needed. Research on the combination of four popular machine learning algorithms, namely artificial neural networks, linear discriminant analysis (LDA), naïve Bayes, and support vector machines (SVMs), with two feature sets, namely color and texture, was explored [14]. It was found that LDA and SVMs show better accuracy. Automatic classification of clinical skin disease images was proposed using high-level position information [15]. Instead of hand-crafted features, the work used high-level position information to generate better deep visual features and outperformed state-of-the-art clinical skin disease classification methods. A methodology of using the rough set method to extract the best features and feedforward neural network to predict the existence of skin disease was proposed [16]. Morphological-and wavelet-based fractal texture features were used along with stacked auto-encoder-based features to classify four skin diseases [17]. The combined feature set used greatly improved the accuracy of identifying melanoma, nevus, basal cell carcinoma, and seborrheic keratosis diseases, with at least 96% accuracy. A survey of skin disease classification from images was conducted [18]. Traditional techniques and deep learning-based skin disease classification were compared, and it was concluded that the deep learning approach is more efficient and faster for extracting features.

Skin Disease Classification
In this section, the multiclass skin disease dataset studied in this paper is first introduced. Then, the potential ways to improve the classification accuracy of skin diseases are discussed, including the choice of the network architecture, the depths of the deep network, the use of transfer learning, the types of data augmentation techniques, and the selection of mini-batch size.

Skin Disease Dataset
The skin disease dataset used in this paper is from the HAM10000 dataset [19]. It includes a representative collection of important diagnostic categories of pigmented lesions. It is a collection of dermatoscopic images of pigmented skin lesions released to tackle the lack of diversity and small size of available dermatoscopic image datasets to train neural networks for automated classification. The goal of this paper is to classify four skin disease classes, namely basal cell carcinoma, benign keratosis, melanoma, and melanocytic

Choice of Network Architectures
Using off-the-shelf deep networks, which have been proven effective in massive and challenging datasets, is common practice in image classification. In this paper, two network architectures, AlexNet and ResNet50, were used to construct our models for skin disease classification.
AlexNet [20] is a well-known CNN that won the ImageNet contest in 2012. It achieves 37.5% top-1 error rates and 17.0% top-5 error rates on test data. It consists of five convolutional layers and three fully connected layers, as shown in Figure 2. It has over 60 million parameters and 65,000 neurons. The network uses ReLU nonlinearity to accelerate the training and employs various techniques to prevent overfitting, such as using dropout and data augmentation.

Conv1
Conv2 ResNet [21] is a very deep residual learning framework developed to reduce the effort of network training. It is easy to optimize and can gain accuracy from increased network depth. The original ResNet with 152 layers won first place in the ILSVRC 2015 classification task and achieves 3.57% error on an ImageNet test set. ResNet introduces a

Choice of Network Architectures
Using off-the-shelf deep networks, which have been proven effective in massive and challenging datasets, is common practice in image classification. In this paper, two network architectures, AlexNet and ResNet50, were used to construct our models for skin disease classification.
AlexNet [20] is a well-known CNN that won the ImageNet contest in 2012. It achieves 37.5% top-1 error rates and 17.0% top-5 error rates on test data. It consists of five convolutional layers and three fully connected layers, as shown in Figure 2. It has over 60 million parameters and 65,000 neurons. The network uses ReLU nonlinearity to accelerate the training and employs various techniques to prevent overfitting, such as using dropout and data augmentation.

Choice of Network Architectures
Using off-the-shelf deep networks, which have been proven effective in massive and challenging datasets, is common practice in image classification. In this paper, two network architectures, AlexNet and ResNet50, were used to construct our models for skin disease classification.
AlexNet [20] is a well-known CNN that won the ImageNet contest in 2012. It achieves 37.5% top-1 error rates and 17.0% top-5 error rates on test data. It consists of five convolutional layers and three fully connected layers, as shown in Figure 2. It has over 60 million parameters and 65,000 neurons. The network uses ReLU nonlinearity to accelerate the training and employs various techniques to prevent overfitting, such as using dropout and data augmentation. ResNet [21] is a very deep residual learning framework developed to reduce the effort of network training. It is easy to optimize and can gain accuracy from increased network depth. The original ResNet with 152 layers won first place in the ILSVRC 2015 classification task and achieves 3.57% error on an ImageNet test set. ResNet introduces a ResNet [21] is a very deep residual learning framework developed to reduce the effort of network training. It is easy to optimize and can gain accuracy from increased network depth. The original ResNet with 152 layers won first place in the ILSVRC 2015 classification task and achieves 3.57% error on an ImageNet test set. ResNet introduces a shortcut connection technique to overcome the degradation problem, i.e., the accuracy gets saturated and degrades rapidly when the network depth increases. A smaller version of ResNet50 consists of four stages, as shown in Figure 3. It was initially trained to classify 1000 classes on the ImageNet dataset and has a fully connected layer with a size of 1000. shortcut connection technique to overcome the degradation problem, i.e., the accuracy gets saturated and degrades rapidly when the network depth increases. A smaller version of ResNet50 consists of four stages, as shown in Figure 3. It was initially trained to classify 1000 classes on the ImageNet dataset and has a fully connected layer with a size of 1000.

Transfer Learning
The approach of using already pretrained off-the-shelf deep networks as the starting point to construct a model for a new task, and transfer the learned features from the ImageNet database to the new task, is known as transfer learning. It is highly effective and easier to train, as only a small number of training images is required, while training the network completely from scratch with randomly initialized weights is much harder.
In this paper, AlexNet architecture is used first. The 1000-way fully-connected (FC) layer is replaced with only four ways for the classification of four skin diseases. The constructed model is named Model I. Similarly, the original ResNet50 model is modified by replacing the last FC layer as well, as shown in Figure 4a. This four-stage ResNet50 Network is denoted by Model II.

Transfer Learning
The approach of using already pretrained off-the-shelf deep networks as the starting point to construct a model for a new task, and transfer the learned features from the ImageNet database to the new task, is known as transfer learning. It is highly effective and easier to train, as only a small number of training images is required, while training the network completely from scratch with randomly initialized weights is much harder.
In this paper, AlexNet architecture is used first. The 1000-way fully-connected (FC) layer is replaced with only four ways for the classification of four skin diseases. The constructed model is named Model I. Similarly, the original ResNet50 model is modified by replacing the last FC layer as well, as shown in Figure 4a. This four-stage ResNet50 Network is denoted by Model II.
Even though ResNet50 is largely simplified from ResNet152 by largely reducing the number of layers, it is still a complex network. The above Model II built directly from ResNet50 is still seemingly heavy for tasks that only require classifying four different skin diseases. Therefore, we reduced the layers by removing the last stage, and Model III was constructed with only three stages, as shown in Figure 4b. Further, we constructed Model IV with only two stages, as shown in Figure 4c.

Data Augmentation Techniques
Data augmentation is commonly used in deep learning when there are limited data. It includes various techniques that can enhance the size and quality of the training dataset. It is highly useful to mitigate the overfitting problem. Through data augmentation, more information can be extracted from the original dataset. Basic image manipulation techniques for data augmentation include geometric transformations, color space transformations, and kernel filters, etc., which are applied to images in the input space. Deep-

Network Depth
Even though ResNet50 is largely simplified from ResNet152 by largely reducing the number of layers, it is still a complex network. The above Model II built directly from ResNet50 is still seemingly heavy for tasks that only require classifying four different skin diseases. Therefore, we reduced the layers by removing the last stage, and Model III was constructed with only three stages, as shown in Figure 4b. Further, we constructed Model IV with only two stages, as shown in Figure 4c.

Data Augmentation Techniques
Data augmentation is commonly used in deep learning when there are limited data. It includes various techniques that can enhance the size and quality of the training dataset. It is highly useful to mitigate the overfitting problem. Through data augmentation, more information can be extracted from the original dataset. Basic image manipulation techniques for data augmentation include geometric transformations, color space transformations, and kernel filters, etc., which are applied to images in the input space. Deep-learning-based augmentation techniques are also widely studied. A typical example is the generative adversarial network, which can generate plausible new images from original images. In this paper, multiple image manipulation techniques, including random scale, random rotation, random reflection, random shear, and a combination of different techniques, are compared for skin disease image classification.

Batch Normalization and Mini-Batch Size
In order to overcome the overfitting problem, where the network performs very well in the training set but poorly in the test set, batch normalization layers are used. Batch normalization can help train the network faster and in a more stable manner, less sensitive to the initial random weights, and it is said to be able to solve the internal covariate shift problem. In Models II, III, and IV, a batch normalization layer is attached to each of the convolutional layers. The skip connection in ResNet50, as well as in Models II, III, and IV, is used to mitigate the gradient explosion problem caused by using batch normalization. In training the deep neural network, batch normalization standardizes the inputs to a layer for each mini-batch. By transforming the data to have a mean of 0 and a deviation of 1, the distribution of the inputs during the weight update will not change dramatically, and the training of the network can be stabilized and accelerated. The generalization error can also be reduced.
The choice of hyperparameter mini-batch size is also important. If the mini-batch size is too small, the distribution of the mini-batches will be largely different from the actual dataset. The differences in the standardized inputs between training and using the model after training can result in noticeable differences in performance. In this paper, different mini-batch sizes are compared for the evaluation of the deep networks on a skin disease dataset.

Experimental Results and Discussion
In order to evaluate the effects of different factors on the classification accuracy of skin diseases, experiments were carried out on the custom four-class HAM10000 dataset. The four-fold cross-validation setup was used for all experiments to ensure less bias on the estimation of models' performance. Each fold contained 1500 images in the training set and 500 images in the validation set.
The choice of the base architecture was first evaluated. The AlexNet-based Model I was tested. The images were resized to 227 × 227 pixels to fit the input size of the AlexNet input layer. The transferred layers were frozen by assigning small learning rates to ensure only the newly added FC layer was trained and the features learned from the ImageNet database could be appropriately transferred. For each fold of cross-validation, the Adam solver was used to train the network for 10 epochs. A mini-batch size of 10 was used. No data augmentation was applied during training. The obtained average cross-validation accuracy was 0.7100, as listed in Table 1.
An experiment was carried out on Model I again with the same experimental setup as above, except that all the weights in the transferred layers were removed, and the model was trained from scratch. The average cross-validation accuracy obtained was 0.6640. The ResNet50-based Model II was tested. The four-stage Model II was trained using the Adam solver for 10 epochs for each fold, with a mini-batch size of 10. Transfer learning was used to transfer the weights of ResNet50 pretrained on the ImageNet database. No data augmentation techniques were used. The average classification accuracy for four folds was 0.7640. The Model II architecture with all the pretrained weights removed was trained from scratch and showed an average accuracy of 0.6605. Because the ResNet50-based Model II clearly has higher accuracy than the AlexNet-based Model I when using transfer learning, the former was chosen to be further studied for network depth reduction and potential accuracy improvement.
The ResNet50-based models, III and IV, with different network depths were then evaluated with the same four-fold cross-validation setup. Similarly, transfer learning was applied to Model III and Model IV, and the average cross-validation accuracy values for the two models were 0.7700 and 0.7835, respectively.
The above results are summarized in Table 1. It can be found that in the context of transfer learning, the choice of architecture has a huge impact on the classification accuracy of skin diseases, The ResNet50-based models outperform the AlexNet-based model by at least 0.05, which shows the supremacy of ResNet50 over AlexNet in skin disease classification. As for transfer learning versus training from scratch, no matter which architecture is used, whether the AlexNet-based model or the ResNet50-based model, transfer learning is evidently better than training from scratch, as compared in Table 1. As for the ResNet-50-based models with different depths, using the transferred weights learned from the large ImageNet database improves the accuracy by at least 0.08 compared with those trained from scratch. For the complexity of the ResNet50-based network, by reducing the network depth, the classification accuracy is observed to be increased. The two-stage Model IV with the shortest depth has the best accuracy, the complete four-stage Model II has the least accurate result, and the three-stage Model III ranks in the middle. All three transfer learning models had an accuracy above 0.76. This suggests that complex networks do not necessarily mean better accuracy when it comes to classification tasks on a simple skin diseases dataset with only a few classes. Instead, it shows that simplifying the network can obtain enough accuracy and also reduce the effort of training. The confusion matrix of Model IV with the best accuracy using transfer learning and the confusion matrix of Model II with the worst accuracy without transfer learning are shown in Figure 5.
Next, the effects of the types of augmentation techniques on the accuracy of skin disease classification were evaluated using Model IV. With no augmentation techniques applied at all, the model yields a 0.7835 accuracy. When a random scale with a factor from 0.6 to 1.4 is applied, an average accuracy of 0.7785 is obtained. When a random rotation from 0 to 360 degrees is applied, the model shows 0.7620 accuracy. As for when random reflection in the left-right direction and top-bottom direction is applied, the accuracy is 0.7735. An accuracy of 0.7765 is obtained when applying random horizontal and vertical shear from 0 to 45 degrees. With all the above types of augmentation combined, the model gives a 0.7430 accuracy. The comparison of augmentation types is summarized in Table 2. It can be seen that the worst result of 0.7430 comes from using the augmentation techniques combined, using no augmentation techniques at all yields the best accuracy of 0.7835, and all other augmentation techniques used reduce the classification accuracy. The results suggest that for skin disease classification, using basic image manipulation techniques for data augmentation does not necessarily help in improving the classification performance of the networks and should be used cautiously. In Table 2, confusion matrices of Model IV without using data augmentation and using combined data augmentation are shown in Figure 6. Next, the effects of the types of augmentation techniques on the accuracy of skin disease classification were evaluated using Model IV. With no augmentation techniques applied at all, the model yields a 0.7835 accuracy. When a random scale with a factor from 0.6 to 1.4 is applied, an average accuracy of 0.7785 is obtained. When a random rotation from 0 to 360 degrees is applied, the model shows 0.7620 accuracy. As for when random reflection in the left-right direction and top-bottom direction is applied, the accuracy is 0.7735. An accuracy of 0.7765 is obtained when applying random horizontal and vertical shear from 0 to 45 degrees. With all the above types of augmentation combined, the model gives a 0.7430 accuracy. The comparison of augmentation types is summarized in Table 2. It can be seen that the worst result of 0.7430 comes from using the augmentation techniques combined, using no augmentation techniques at all yields the best accuracy of 0.7835, and all other augmentation techniques used reduce the classification accuracy. The results suggest that for skin disease classification, using basic image manipulation techniques for data augmentation does not necessarily help in improving the classification performance of the networks and should be used cautiously. In Table 2, confusion matrices of Model IV without using data augmentation and using combined data augmentation are shown in Figure 6.  Experiments were carried out to compare the choices of different mini-batch sizes ranging from 5 to 80, and the results are summarized in Table 3. It can be seen that the smallest mini-batch size of 5 has the lowest accuracy of 0.7560, while a mini-batch size of 40 yields the best accuracy of 0.7905. The corresponding confusion matrices are shown in Figure 7. The results show the accuracy can be minorly improved by choosing a proper mini-batch size. Table 3. Comparison of different mini-batch sizes.  Experiments were carried out to compare the choices of different mini-batch sizes ranging from 5 to 80, and the results are summarized in Table 3. It can be seen that the smallest mini-batch size of 5 has the lowest accuracy of 0.7560, while a mini-batch size of 40 yields the best accuracy of 0.7905. The corresponding confusion matrices are shown in Figure 7. The results show the accuracy can be minorly improved by choosing a proper mini-batch size.

Conclusions
In this paper, potential approaches to improve the classification accuracy of skin diseases are investigated based on deep learning. Transfer learning using off-the-shelf deep networks is common practice for image classification. However, the use of complete offthe-shelf networks feels too complex and is not necessary. Multiple factors are studied regarding the classification accuracy of skin diseases. Experiments were carried out using the HAM10000 skin disease dataset. The results show that the choice of the network architecture has a huge impact on accuracy. The ResNet50-based model largely outperforms

Conclusions
In this paper, potential approaches to improve the classification accuracy of skin diseases are investigated based on deep learning. Transfer learning using off-the-shelf deep networks is common practice for image classification. However, the use of complete off-the-shelf networks feels too complex and is not necessary. Multiple factors are studied regarding the classification accuracy of skin diseases. Experiments were carried out using the HAM10000 skin disease dataset. The results show that the choice of the network architecture has a huge impact on accuracy. The ResNet50-based model largely outperforms the AlexNet-based model. The model trained from scratch is much less accurate than that using transfer learning and pretrained on the massive ImageNet database. In order to reduce the complexity of the models, the depths of the networks were reduced, and the two-stage model shows the best accuracy compared with the three-stage model and the four-stage model, which suggests reducing the network depths does not necessarily sacrifice accuracy in skin disease classification. Additionally, the use of different data augmentation techniques actually lowers the accuracy compared with no augmentation applied at all and should be used with caution. Furthermore, carefully choosing the mini-batch size can also help in improving accuracy.