SBXception: A Shallower and Broader Xception Architecture for Efficient Classification of Skin Lesions

Simple Summary Skin cancer is a major concern worldwide, and accurately identifying it is crucial for effective treatment. we propose a modified deep learning model called SBXception, based on the Xception network, to improve skin cancer classification. Using the HAM10000 dataset, consisting of 10,015 skin lesion images, the model achieved an impressive accuracy on a test set. SBXception also showed significant improvements, requiring fewer parameters and reducing training time compared to the original model. This study highlights the potential of modified deep learning models in enhancing skin cancer diagnosis, benefiting society by improving treatment outcomes. Abstract Skin cancer is a major public health concern around the world. Skin cancer identification is critical for effective treatment and improved results. Deep learning models have shown considerable promise in assisting dermatologists in skin cancer diagnosis. This study proposes SBXception: a shallower and broader variant of the Xception network. It uses Xception as the base model for skin cancer classification and increases its performance by reducing the depth and expanding the breadth of the architecture. We used the HAM10000 dataset, which contains 10,015 dermatoscopic images of skin lesions classified into seven categories, for training and testing the proposed model. Using the HAM10000 dataset, we fine-tuned the new model and reached an accuracy of 96.97% on a holdout test set. SBXception also achieved significant performance enhancement with 54.27% fewer training parameters and reduced training time compared to the base model. Our findings show that reducing and expanding the Xception model architecture can greatly improve its performance in skin cancer categorization.


Introduction
Among other cancers, skin cancer is considered one of the deadliest diseases. Around 1.2 million people died in 2020 due to skin cancer only [1]. According to the WHO [1], skin cancer was one of the most common cancers in terms of new cases in 2020, and the number of new cases is increasing dramatically [2,3]. One of the common causes of skin cancer is exposure of the skin to UV (ultraviolet) rays directly coming from the sun [4]. It is said that such rays affect fair-skinned people and those with sensitive skin more than dark-skinned ones [5].
Most deaths are caused by invasive melanoma, which constitutes only 1% of total skin cancer cases. From historical data, it is found that melanoma skin cancer cases are rising rapidly. According to the most recent report from the American Cancer Society, adopted the DarkeNet19 model and trained it on multiple datasets such as HAM10000, ISBI2018, and ISBI2019. They fine-tuned this model and achieved 95.8%, 97.1%, and 85.35% accuracy for the HAM10000, ISBI2018, and ISBI2019 datasets, respectively. In comparison, another study [30] trained three different models (InceptionV3, ResNet, and VGG19) on a dataset containing 24,000 images retrieved between 2019 and 2020 from the ISIC archive. They concluded that InceptionV3 outperformed the rest of the models regarding accuracy. On the other hand, Khamparia et al. [31] incorporated transfer learning while training different deep learning architectures and proved that transfer learning and data augmentation helped to improve the results.
When training deep learning models, data imbalance is always an issue. There are many ways by which authors improve datasets by incorporating data augmentation techniques [32]. Ahmad et al. [33] used a data augmentation technique called generative adversarial networks (GAN), which creates artificial images similar to the original images to improve the dataset. With the help of this technique, they claim that their model accuracy was enhanced from 66% to 92%. In another study Kausar et al. [34] used some fine-tuning techniques to improve state-of-art deep learning image classification models. They achieved an accuracy of 72%, 91%, 91.4%, 91.7%, and 91.8% for ResNet, InceptionV3, DenseNet, InceptionResNetV2, and VGG-19, respectively. Khan et al. [35] proposed a multiclass deep learning model trained on the HAM10000, ISBI2018, and ISIC2019 datasets. They also incorporated transfer learning, and their results showed that the proposed model achieved an accuracy of 96.5%, 98%, and 89% for the HAM10000, ISBI2018, and ISIC2019 datasets, respectively. In another study, Deepa et al. [36] trained the ResNet50 model on the International Skin Image Collaboration (ISIC) dataset and achieved 89% accuracy. Tahir et al. [37] proposed a deep learning model called DSCC_Net, trained that on three datasets, ISIC 2020, HAM10000, and DermIS, and achieved an accuracy of 99%. They further compared their model with other state-of-art models and concluded that it outperformed them all. In another study, Shaheen et al. [38] proposed a multiclass model using particle swarm optimization trained on the HAM1000 dataset. They claim that their model achieved 97.82% accuracy.
As described above, there is a general trend for image processing based on deep learning to gradually adopt deeper networks. The benefit of using deeper networks is obvious, i.e., a deeper network provides stronger nonlinear representation capability. This means that, for some specific problems, a deeper network may be better able to learn more complex transformations and thus fit more complex feature inputs. However, previous research (see, e.g., [39]) has also shown ways in which network depth may negatively affect classification performance in cases where relatively simpler features are involved. Here, we first quantitatively assess the effect of network depth on classification performance and then develop a shorter and broader variant of the originally selected model (termed SBXception). The main contributions of this paper are the following:

•
We analyze the characteristics of the adopted dataset (HAM10000) to show that network depth beyond an optimal level may not be suitable for classification tasks on this dataset. • A new, shorter, broader variant of the Xception model is proposed to classify various skin lesions efficiently.

•
The proposed modified model architecture is used to provide better classification performance compared to the state-of-the-art methods.

The Proposed Approach
This work proposes an approach to accurately classify skin lesions into seven classes pertaining to the most common types. The overall architecture of the approach, shown in Figure 1, involved three main stages. First, the dataset was prepared to make it more suitable for the classification task. Second, the effect of network depth on the classification performance was quantitatively explored, leading to the development of SBXception-a shorter and broader variant of the original model. Third, the proposed SBXception model was used for the classification task. In the following subsections, we provide a detailed discussion of the various stages involved in the development of the system. performance was quantitatively explored, leading to the development of SBXception-a shorter and broader variant of the original model. Third, the proposed SBXception model was used for the classification task. In the following subsections, we provide a detailed discussion of the various stages involved in the development of the system.

Dataset and Input Images Preparation
In order to carry out the experiment, this work utilized a public dataset, the HAM10000 dataset [40]. This dataset is a widely-used collection of dermatoscopic images of skin lesions. It contains a total of 10,015 images acquired from individuals across various demographic regions. It includes images from different age groups, ethnicities, and geographical locations, providing a representative sample of skin conditions worldwide. Each image in the HAM10000 dataset includes a unique identifier (lesion_id) containing a 7-digit number corresponding to a unique patient record number. This allows each image to be linked to a single patient record. This way, the dataset enables researchers to accurately correlate an image with its respective patient, facilitating comprehensive analysis and longitudinal studies.
The dataset contains images of seven classes of skin lesions, including actinic keratoses and intraepithelial carcinoma (AKIEC) (327), basal cell carcinoma (BCC) (514), benign keratosis (BKL) (1099), dermatofibroma (DF) (115), melanoma (MEL) (1113), melanocytic nevi (NV) (6705), and vascular skin lesions (VASC) (142). It is important to note that the original size of each image is 600 × 450. In contrast, in this experiment, the size was modified to 229 × 229 for efficient processing using the modified Xception architecture used in this research. Figure 2 shows some sample images of skin lesions. The same dataset was split into training, testing, and validation sets to ensure that the results were consistent.

Dataset and Input Images Preparation
In order to carry out the experiment, this work utilized a public dataset, the HAM10000 dataset [40]. This dataset is a widely-used collection of dermatoscopic images of skin lesions. It contains a total of 10,015 images acquired from individuals across various demographic regions. It includes images from different age groups, ethnicities, and geographical locations, providing a representative sample of skin conditions worldwide. Each image in the HAM10000 dataset includes a unique identifier (lesion_id) containing a 7-digit number corresponding to a unique patient record number. This allows each image to be linked to a single patient record. This way, the dataset enables researchers to accurately correlate an image with its respective patient, facilitating comprehensive analysis and longitudinal studies.
The dataset contains images of seven classes of skin lesions, including actinic keratoses and intraepithelial carcinoma (AKIEC) (327), basal cell carcinoma (BCC) (514), benign keratosis (BKL) (1099), dermatofibroma (DF) (115), melanoma (MEL) (1113), melanocytic nevi (NV) (6705), and vascular skin lesions (VASC) (142). It is important to note that the original size of each image is 600 × 450. In contrast, in this experiment, the size was modified to 229 × 229 for efficient processing using the modified Xception architecture used in this research. Figure 2 shows some sample images of skin lesions. The same dataset was split into training, testing, and validation sets to ensure that the results were consistent.

Model Architecture
Existing image classification solutions increasingly use deeper neural networks as computing power improves and more solutions to the gradient disappearance problem become available. The benefit of this approach is self-evident, i.e., a deeper network pro-

Model Architecture
Existing image classification solutions increasingly use deeper neural networks as computing power improves and more solutions to the gradient disappearance problem become available. The benefit of this approach is self-evident, i.e., a deeper network provides stronger nonlinear representation capability. This means that, in general, a deeper network may be better able to learn more complex transformations and thus fit more complex feature inputs [41]. However, our experiments (reported in Section 3) revealed that a deeper network is not necessarily beneficial for the classification task on the HAM10000 dataset. Therefore, the current study modified the base model by decreasing its depth and increasing its width to better suit the given classification problem. The following subsections discuss this in more detail.

The Base Model
To gain insight into the data set, the initial experiments were conducted using the Xception [42] network. The structure of the Xception network is shown in Figure 3. As shown, its architecture is based on modified depth-wise separable convolution layers. The input has to go through three flows, i.e., the entry flow, middle flow (which repeats eight times), and exit flow. Each of the convolution and separable convolutional layers is followed by batch normalization. The middle flow is the core structural part of the Xception network, comprising a nine-layer structure that repeats eight times. Each of the nine layers in the structure contains three combinations of ReLU, separable Conv2D, and batch normalization layers.

Model Architecture
Existing image classification solutions increasingly use deeper neural networks as computing power improves and more solutions to the gradient disappearance problem become available. The benefit of this approach is self-evident, i.e., a deeper network provides stronger nonlinear representation capability. This means that, in general, a deeper network may be better able to learn more complex transformations and thus fit more complex feature inputs [41]. However, our experiments (reported in Section 3) revealed that a deeper network is not necessarily beneficial for the classification task on the HAM10000 dataset. Therefore, the current study modified the base model by decreasing its depth and increasing its width to better suit the given classification problem. The following subsections discuss this in more detail.

The Base Model
To gain insight into the data set, the initial experiments were conducted using the Xception [42] network. The structure of the Xception network is shown in Figure 3. As shown, its architecture is based on modified depth-wise separable convolution layers. The input has to go through three flows, i.e., the entry flow, middle flow (which repeats eight times), and exit flow. Each of the convolution and separable convolutional layers is followed by batch normalization. The middle flow is the core structural part of the Xception network, comprising a nine-layer structure that repeats eight times. Each of the nine layers in the structure contains three combinations of ReLU, separable Conv2D, and batch normalization layers.

Shortening the Architecture
As shown in Figure 3, the core structure of Xception repeats eight times. Hence, it has a huge number of convolution layers. Network deepening often helps improve performance when dealing with images containing complex information, such as scenes containing human behavioral aspects, from which an exceedingly high number of features can be extracted. Network depth does not necessarily improve performance in several situations. For example, when the data set contains a limited number of objects with few details, see, e.g., [43], limited features will exist in the image, and consequently, the fault tolerance will be poor. Similarly, due to the very heavy optimization of gradient backpropagation, deep learning models tend to significantly overfit when the data are insufficient [44]. On the other hand, studies have shown that lowering the number of convolution layers has a significant impact on network performance; see, for example, [45,46]. Furthermore, limiting the number of convolution layers also has a significant impact on network performance in terms of computation efficiency [47,48]. Therefore, inspired by previous works [43,47], this study explored the relationship between network depth and its performance specifically using the HAM10000 dataset. To this end, we kept the network's non-core (non-repeating) Cancers 2023, 15, 3604 6 of 15 structure unchanged and experimented with repeating the core part for different numbers of times. Recall that each repetition in the core contains a combination (named RCB) of ReLU, separable Conv2D, and batch normalization layers. Therefore, by adopting a different number of RCB layers, seven modified forms of the Xception network (named Xception-mN) were created such that Xception-m1 contains one RCB, Xception-m2 has two RCBs, and so on, up to Xception-m7, which contains seven RCBs. To investigate the impact of network depth on classification accuracy, the performance of each of the modified networks was monitored. Here, the shortest network, i.e., Xception-m1 (with only one RCB layer), scored the highest in terms of accuracy and number of parameters. This showed that shortening the network enhanced the classification in terms of computational efficiency (decreased number of network parameters) as well as accuracy. Hence, it was concluded that classification of the HAM10000 dataset could not benefit from a deeper network. Next, we developed a network widening mechanism to further increase the classification performance.

Broadening the Architecture
Depth and breadth are two characteristics of a convolutional neural network that have the potential to affect its performance significantly. If the network has appropriate depth and width, it can learn a great deal of features and have higher nonlinear representational capabilities [49]. When optimizing the network structure, deepening the network is generally preferred over widening it, as it typically results in greater performance increases [50]. However, studies have found that once the network has reached a certain depth, adding further depth either makes the network harder to train with insignificant performance gains or, occasionally, causes its performance to degrade [39]. Similarly, several studies have shown that shallow and wide networks can achieve higher or at least as much accuracy as their deep and narrow counterparts [41,51,52]. Furthermore, in our initial experiments on the HAM10000 dataset, the shallowest Xception structure emerged as more suitable than the deeper alternatives. Thus, inspired by [43], we experimented with broadening the Xception network structure to improve its performance. The broadening mechanism essentially works by introducing a new add layer to obtain fusion of the horizontal channels by stacking the outputs of various branches. To achieve this, the network width can be increased in more than one way, such as by increasing the convolution layer channels or using a concatenate layer to connect the two expanded branches of the core structure. Here, we adopted a strategy similar to Shi et al. [43]. Specifically, we first expanded a branch of the core structure and then connected the output of two branches with the add layer. With this mechanism, there was no need to include a 1 × 1 convolution layer in the residual connection, because the number of channels before and after connection remained constant. Furthermore, we deemed it important to gauge the classification performance with different numbers of layer combinations in the broader architecture. Thus, by adopting different numbers of ReLU, Conv2D, and Batch Normalization (RCB) layers, eight shorter, broadened variations of the Xception network (named Xception-sbN) were created such that Xception-sb1 contains one RCB, Xception-sb2 contains two RCBs, and so on, up to Xception-sb8, which contains eight RCBs. The performance of each of the network architectures was monitored. The architecture with three RCBs, i.e., Xception-sb3 (shown in Figure 4), yielded the highest scores considering the accuracy and number of parameters. Therefore, this structure was used for the rest of the experiments, as detailed in the next section.
(RCB) layers, eight shorter, broadened variations of the Xception network (named Xception-sbN) were created such that Xception-sb1 contains one RCB, Xception-sb2 contains two RCBs, and so on, up to Xception-sb8, which contains eight RCBs. The performance of each of the network architectures was monitored. The architecture with three RCBs, i.e., Xception-sb3 (shown in Figure 4), yielded the highest scores considering the accuracy and number of parameters. Therefore, this structure was used for the rest of the experiments, as detailed in the next section.

Fine-Tuning and Testing
For fine-tuning, the following HAM10000 splits were used, and the augmentation techniques described above were used. The dataset was split to use 80% for training and 20% for testing. Furthermore, 20% of the training dataset was used for validation. Table 1 shows the details of the numbers of images of each class used for training (before and after augmentation) and testing. The batch size was set to 32. The model's performance was evaluated using diverse network configurations, employing various optimizers, learning rates, and momentum values. In cases where loss reduction was not evident for more than ten epochs, the learning rate was reduced by a factor of 1/10. The network configuration that produced the best results was adopted for testing, as well as for the remaining experiments, which will be discussed in the next section.

Fine-Tuning and Testing
For fine-tuning, the following HAM10000 splits were used, and the augmentation techniques described above were used. The dataset was split to use 80% for training and 20% for testing. Furthermore, 20% of the training dataset was used for validation. Table 1 shows the details of the numbers of images of each class used for training (before and after augmentation) and testing. The batch size was set to 32. The model's performance was evaluated using diverse network configurations, employing various optimizers, learning rates, and momentum values. In cases where loss reduction was not evident for more than ten epochs, the learning rate was reduced by a factor of 1/10. The network configuration that produced the best results was adopted for testing, as well as for the remaining experiments, which will be discussed in the next section.

Experiments
We performed several experiments to evaluate the proposed approach in terms of its capability to correctly classify the various types of skin lesions as well as to compare it with existing methods. The classification system was implemented in Python using Keras with Three basic categories of experiments were performed. First, the ideal network configuration was selected based on several factors, such as the Xception structure and the layers in the core part. Second, various measures were used to assess the performance of the proposed model for the specific classification task. Third, the method's effectiveness was evaluated compared to state-of-the-art skin lesion classification techniques.

Performance Evaluation of the Proposed Approach
We tweaked the base Xception network architecture by reducing the depth and increasing the breadth of the middle flow of the network. We performed several experiments to analyze the effect of these changes in the depth and breadth of the neural network. As the original architecture deployed eight repetitions of the nine-layer structure of RCB layers described above, we measured the network's performance with one (Xception-m1) to seven repetitions (Xception-m7) to analyze the impact of varying network depth. We measured the accuracy of these architectures for the seven types of skin lesions mentioned above. Table 2 shows the classification accuracy for these skin lesions given different network depths. The number of parameters for each architecture is also presented in the table. It can be seen that the shallowest network (Xception-m1), i.e., the network with one repetition of the nine RCB layers, achieved the highest classification accuracy. The only exception was for NV skin lesions, for which both Xception-m1 and Xception-m2 achieved the same classification accuracy. Intuitively, the number of parameters was also the least for the shallowest architecture, thus making it the most efficient among all architectures. Xception-m1 reduced the number of parameters by 54.27% compared to the base Xception architecture. After identifying the optimal depth of the Xception architecture, we experimented with varying breadths of the architecture by adding concatenate layers. Table 3 shows classification accuracy for the modified architectures with one (Xception-sb1) to eight concatenate layers (Xception-sb8). The experimental results showed that Xception-sb3 and Xception-sb4 achieved comparable results. While Xception-sb3 achieved higher accuracy scores for BCC, DF, NV, and VASC, Xception-sb4 outperformed it for AKIEC and MEL classification accuracy. Both architectures produced the same accuracy for BKL. We selected Xception-sb3 because of the lower number of parameters. Xception-sb3 reduced the number of parameters by 38.76% compared to the base Xception architecture.  Figure 5 compares the accuracy and loss curves between the base Xception and the proposed Xception-sb architectures. It is evident that our proposed technique produced slightly better results compared to the base architecture.
Cancers 2023, 15, x FOR PEER REVIEW 9 of 15 classification accuracy. Both architectures produced the same accuracy for BKL. We selected Xception-sb3 because of the lower number of parameters. Xception-sb3 reduced the number of parameters by 38.76% compared to the base Xception architecture.  Figure 5 compares the accuracy and loss curves between the base Xception and the proposed Xception-sb architectures. It is evident that our proposed technique produced slightly better results compared to the base architecture.  (Xception (a,c) vs. Xception-sb (b,d)). Figure 6 shows the confusion matrix of the proposed approach, showing predictions made for each class in terms of percentages and number of images from the test set correctly and incorrectly classified for each lesion type. It can be seen that the proposed technique correctly classified the highest number of images (i.e., 1319 of 1341) for the NV class, which had the most images to learn from. Images of VASC were also correctly classified at a high rate of 97.33%. MEL, BKL, BCC, and AKIEC lesions were correctly classified at rates of 91.03%, 90.02%, 89.35%, and 87.02%, respectively. With the lowest value of 72.45%,  Figure 6 shows the confusion matrix of the proposed approach, showing predictions made for each class in terms of percentages and number of images from the test set correctly and incorrectly classified for each lesion type. It can be seen that the proposed technique correctly classified the highest number of images (i.e., 1319 of 1341) for the NV class, which had the most images to learn from. Images of VASC were also correctly classified at a high rate of 97.33%. MEL, BKL, BCC, and AKIEC lesions were correctly classified at rates of 91.03%, 90.02%, 89.35%, and 87.02%, respectively. With the lowest value of 72.45%, the correct classification of DF proved to be the most challenging for the proposed technique.    Table 4 shows recall, precision, accuracy, F1, and MCC (Matthew Correlation Coefficient) scores for each class of skin lesion. The proposed technique achieved the highest recall for the NV class (0.9832), while the lowest recall was recorded for DF (0.7175). On the other hand, the VASC class produced the best result for precision (0.9576). The NV class achieved the lowest precision (0.8365). The VASC class was classified with the highest accuracy (0.9893), and the lowest accuracy was recorded for the DF class (0.9532). The proposed technique produced high accuracy for all lesions in general. Finally, the best F1 score was achieved for the VASC class (0.9628), while the DF class was on the other side of the spectrum with the lowest score of 0.8153. As far as MCC scores are concerned, VASC yielded the best result (0.9565). MEL, NV, BCC, and AKIEC performed well, all scoring closely. DF scored the lowest MCC value (0.7989). Overall, the proposed technique produced the best classification results for the VASC class of lesions.  Table 4 also shows macro and weighted averages of recall, precision, accuracy, F1, and MCC scores to show the overall performance of the proposed technique. The proposed technique achieved a high overall level, with macro and weighted averages of 0.9689 and 0.9697, respectively. The overall recall of our approach was also recorded to be high, with macro and weighted averages of 0.8915 and 0.9543, respectively. The overall precision was calculated to comprise macro and weighted averages of 0.8946 and 0.8534, respectively. The macro and weighted average values for F1 score were measured to be 0.8899 and 0.8996, respectively. Finally, the respective macro and weighted average scores for MCC were calculated to be 0.8740 and 0.8848. Table 5 compares the base Xception network architecture with the proposed optimal architectures regarding depth and breadth. Our depth-optimized Xception-m1 architecture outperformed the base architecture with a 0.83% improvement in accuracy, 54.27% improvement in the number of parameters, and 30.46% improvement in training time. Similarly, the proposed breadth-optimized Xception-sb3 architecture improved the accuracy of the base model by 2.63%, reduced the number of parameters by 38.77%, and resulted in a time reduction of 22.12% for training the network. We also compared the proposed technique with state-of-the-art works in the problem domain. As shown in Table 6, our proposed technique outperforms the other works regarding accuracy and recall. The overall accuracy of our proposed approach was slightly better than the best results achieved by Naeem et al. [53]. We achieved significant improvement in recall compared to the existing works. Our overall recall of 0.9543 was about 2.38% better than the best recall achieved in previous studies. However, with a precision of 0.9292, Calderon et al. [54] still outperform all the existing methods, including our proposed technique.

Conclusions and Future Work
Skin cancer is considered one of the most serious and widespread health concerns worldwide, with a significant impact on patients' quality of life and survival. The timely and accurate diagnosis of skin cancer is essential for effective treatment and improved outcomes. Deep learning models have shown considerable promise in assisting dermatologists with skin cancer diagnosis in recent years. In this study, we utilized a modified Xception model (called SBXception) to classify skin cancer lesions using the HAM10000 dataset. Our results demonstrated that SBXception, with its reduced and expanded architecture, had significantly improved performance in skin cancer classification, achieving an accuracy of 96.97% on a holdout test set. However, there are still some limitations to our study that need to be addressed in future research. Firstly, while our modified model achieved high accuracy on the HAM10000 dataset, its performance needs to be evaluated on other datasets to ensure its generalizability. Secondly, this study considered only the seven types of skin lesions found in the dataset. Additionally, the current work did not focus on the model's interpretability to enhance its clinical applicability or other related factors, such as demographic bias.
In terms of future directions, one possible avenue for research is to explore the potential of combining multiple deep learning models for skin cancer diagnosis to further improve accuracy. Another future direction could be the development of a mobile application that can be created by adopting deep learning models to detect skin cancers. This application could provide an easy-to-access system for initial skin cancer diagnosis for people in remote areas. Additionally, a comprehensive dataset could be developed containing images from different populations and skin colors to ensure that the deep learning models can detect skin lesions from people with different colors.