End-to-End Decoupled Training: A Robust Deep Learning Method for Long-Tailed Classiﬁcation of Dermoscopic Images for Skin Lesion Classiﬁcation

: Due to its increasing incidence, skin cancer, and especially melanoma, is considered a major public health issue. Manually detecting skin lesions (SL) from dermoscopy images is a difﬁcult and time-consuming process. Thus, researchers designed computer-aided diagnosis (CAD) systems to assist dermatologists in the early detection of skin cancer. Moreover, SL detection naturally exhibits a long-tailed distribution due to the complex patient-level conditions and the existence of rare diseases. Very limited research for handling this issue exists on SL detection. In this paper, we propose an end-to-end decoupled training for the long-tailed skin lesion classiﬁcation task. Speciﬁcally, we initialized the training of a network with a novel loss function Lf able to guide the model to a better representation of the features. Then, we ﬁne-tuned the pretrained networks with a weighted variant of Lf helping to improve the robustness of the network to class imbalance. We evaluated our model on the ISIC 2018 public dataset against existing methods for handling class imbalance and existing approaches for SL detection. The results demonstrated the superiority of our framework, outperforming all compared methods by a minimum margin of 2% with a single model.


Introduction
Skin cancer is an invasive disease caused by the abnormal growth of skin cells in the body. Skin cancer incidences have increased dramatically throughout the last decade [1]. Melanoma is the most dangerous type of skin cancer. Although its occurrence rate is 4%, it is responsible for about 75% of all skin-cancer-associated deaths [2]. The only way to prevent patient death from melanoma is to diagnose it earlier.
The clinical diagnosis of skin cancer starts with a visual examination of the suspect areas followed by a histopathological analysis. This protocol is time-consuming, complex and subjective due to the fact that the accuracy of diagnosis is strongly related to the dermatologist's experience [3]. Therefore, it is deemed desirable to invest research efforts in the development of methods that can assist clinicians in the early detection of skin cancer.
An active strand of work aimed to tackle the challenging skin lesion (SL) detection with the help of computer-aided diagnosis (CAD) systems. In particular, CAD based on deep learning models through convolutional neural network (CNNs) has been achieving remarkable results in the automated detection of SL, outperforming dermatologists' level in an experimental context [3][4][5].
Existing approaches to develop CAD for SL diagnosis can be categorized as follows: systems based on one single CNN [6][7][8], systems using multiple CNNs [9][10][11], and systems using CNNs combined with other classifiers [12][13][14]. The review articles in [2,15,16] can be referred to for detailed insights of deep learning approaches used for SL detection.
The rise of modern deep learning techniques has led to a great performance improvement on the challenging task of SL detection. However, the use of such systems in a real clinical context is still delayed by the fact that SL datasets present skewed data distributions where a few classes (head classes) contain a large number of samples, while most classes (tail classes) are under-represented [17]. The difficulty of training a model on a long-tailed dataset mainly comes from two aspects. First, deep learning methods are hungry for data, but annotations of tail classes might be insufficient for training. Second, the model tends to bias towards head classes since the head class objects are the overwhelming majority in the entire datasets [18]. For example, the popular public dataset of SL ISIC 2018 [19,20] has a ratio between rare and majority classes greater than fifty, indicating a serious classimbalance issue. Figure 1 illustrates the long-tailed distribution of the ISIC 2018 dataset. Very limited research on the robustness of methods to design CAD systems able to alleviate the long-tailed imbalance problem is available in the area of SL detection [16]. Developing methods to construct CAD systems robust to class imbalance is therefore crucial to spread the use of such systems in a real clinical context.  [19,20]. The dataset exhibits a long-tailed distribution with a ratio between rare and majority classes greater that about fifty. Head corresponds to lesions in the dataset that are over-represented and Tail to lesions in the dataset that are under-represented.
In this work, we propose a novel deep learning framework using a single CNN to design a CAD system for SL detection which is robust to class imbalance. Current existing approaches dealing with class imbalance can be subdivided into three approaches [17]: data processing, cost-sensitive weighting, and decoupling methods. The decoupled training seems to achieve better performance than the reweighting methods [21]. In general, a decoupled training involves a two-stage pipeline that learns representations under the imbalance dataset at the first stage, then rebalances the classifier with a frozen representation at the second stage. However, one of the main drawbacks of this approach is that the representation could be suboptimal since it is not jointly learned with the classifier [18]. Inspired by this, we propose a two-stage end-to-end training with two novel loss functions (L f and Lc) able to meet the two objectives of the decoupled training without disjoining the training of deep features and classifiers. The first stage uses the L f loss and guides the model to learn better representations for weight initialization. The L f loss helps to improve the performance of the feature model in the first stage of the decoupled training and outperforms cross-entropy with an instance-balancing strategy which is widely adopted in decoupled training. Then, the second stage focuses on dealing with the skewed distribution of the data. Specifically, the second training phase uses the Lc loss which reduces the loss contribution of easy and outlier examples, while maintaining a high-loss contribution for harder examples, allowing the model to give attention to the informative samples, making it robust to class imbalance. We conduct several experiments to demonstrate the effectiveness of our approach on the ISIC 2018 dataset.
In summary, our key research contributions are: • We propose two new loss functions, L f and Lc, able to weight samples more efficiently so as to guide the network to focus on informative samples; • We propose an approach able to handle both the class imbalance issue and the outlier issue; • We propose a new learning scheme for the decoupled training following an end-to-end process; • We demonstrate the strength of our method on the ISIC 2018 long-tail benchmark dataset and show improved performance over both existing methods that deal with the class imbalance problem and prior works on the same tasks.
The remainder of the manuscript is organized as follows: some related work is discussed in Section 2. In Section 3, we formally describe the problem and present a preliminary analysis of its impact. Section 4 describes the materials and methodology applied. Then, the experimentation results and discussion are provided in Section 5. The conclusions of the research are discussed in Section 6.

Design of CAD System for Skin Lesion Detection
The current trends in designing SL diagnosis systems can be subdivided into three types of approaches [15]: those based on one CNN, those that combined multiple CNNs through an ensemble method, and those that combined CNNs with other classifiers.

CAD Based on One CNN
The first breakthrough of applying CNNs on SL came from Esteva et al. [5]. They trained a CNN using a very large dataset with 129,450 clinical images and 2032 different diseases and tested its performance against 21 board-certified dermatologists on biopsy-proven clinical images to perform a binary classification between two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. Their results showed that the automatic system achieved similar performance to experts, demonstrating a level of competence comparable to that of dermatologists. Lucius et al. [6] evaluated the performances of eight CNNs in categorizing the seven most common pigmented SL. They observed that the least accurate CNN outperformed general practitioners and that a CNN could improve a general practitioner's diagnosis accuracy in a routine clinical scenario. Zhang et al. [8] proposed an attention residual learning CNN model. Their proposed network aimed to exploit the intrinsic self-attention ability of a CNN and generated attention maps at lower layers to improve classification performance. Yao et al. [7] combined the focal loss [22], class-balanced loss [23] and the RandAugment [24] augmentation strategy to design a CAD based on a single CNN model for the multiclass classification of SL and reached a balanced accuracy score of 0.86 on the ISIC 2018 dataset.

CAD Based on an Ensemble of CNN
Another successful technique to improve CAD systems for SL detection is by assembling a finite set of CNNs. Harangi et al. [25] fused the outputs of four CNNs by applying a weighted fusion strategy in a three-class classification task, achieving an area under the receiver operating curve (AUROC) of 0.89 which was superior to the performance of each CNN individually. Jordan Yap et al. [9] proposed a method that considered several image modalities, including patient's metadata, to improve the classification results. The ResNet50 network they used was differently applied over dermoscopic and macroscopic images, and their features were fused to perform the final classification. Their multimodal classifier outperformed the basic model using only macroscopy with an AUROC of 0.866.
Gessert et al. [10] assembled some well-known CNNs to perform a multiclass classification of SL. They first applied multiple model input resolutions and employed a cropping strategy to train their models. Then, they created a large ensemble with the optimal subset of models based on the cross-validation performance. In the same context, Foahom et al. [11] applied an ensemble and aggregation method along with a directed acyclic graph technique to develop a diagnostic system classifying SL into three classes: seborrheic keratosis, nevi, and melanoma. Their approach showed improvement in performance compared to a previous ensemble method of multiclass CNNs.

CAD Based on CNNs Combined with Other Classifiers
As mentioned earlier, some studies design CAD systems by combining CNNs with other classifiers. In this context, Mahbod et al. [13] proposed a fully automatic computerized method that was an ensemble of deep features from several well-established CNNs at different abstraction levels in combination with a support vector machine classifier to distinguish malignant melanomas from benign lesions. Similarly, Hagerty et al. [14] presented an approach that combined conventional image processing with deep learning by fusing the features from the individual techniques. Their method led to a 7% AUC improvement over the CNN model alone. Almaraz-Damien et al. [12] proposed a new CAD system based on a fusion of handcrafted features related to the medical algorithm ABCD rule (asymmetry borders-colors-dermoscopic structures) and deep learning features employing mutual information measurements. The deep features used for the fusion were obtained by transfer learning on pretrained CNNs. Abunadi et al. [26] also proposed a hybrid CAD system that combined handcrafted features such as wavelet transform, gray-level co-occurrence matrix, and local binary pattern with an artificial neural network.
As mentioned earlier, the objective of this study was to alleviate the class imbalance issue in the development of CAD for SL detection. To that end, we based our approach on the construction of a robust CAD system using a single CNN. We believe that, once we have successfully solved the issue of class imbalance, the proposed method may be easily integrated to an ensemble scheme to improve its performance.

Methods for Handling Long-Tail Distributions
Various methods have been proposed to reduce the bias of classifiers trained on longtailed distribution datasets. Existing methods can be divided into three categories [17,27]: data-level approaches, classifier-level approaches, and decoupled training.

Data-Level Approach
The data-level approach focuses on adjusting the class ratio in the input dataset to achieve a balanced class distribution. This approach often employs sampling techniques such as undersampling, oversampling, or a combination of both.
Oversampling consists of generating new minority-class samples from the available unbalanced data. Random oversampling is one oversampling strategy that consists of randomly replicating instances of the minority class. Another strategy, called focused oversampling, consists of resampling only instances of minority classes near the classification boundary. However, both strategies present major shortcomings. Random oversampling increases the possibility of overfitting the classifier and increases the computational cost, while focused oversampling leads to a more specific decision region of the minority class [28]. The synthetic minority oversampling technique (SMOTE) [29] is an algorithm proposed to address these issues. SMOTE attempts to create more diversity among the minority class data by generating synthetic samples. These new minority class samples are obtained by linearly interpolating the existing observations from minority classes. More recently, some strong oversampling techniques have been proposed. For example, mixup generates new images by taking a convex combination of images in the dataset [30]. Other related methods are Cutmix [31] and Cutout [32]. Cutmix blends two images by cutting a patch from one image and inserting it into another, while Cutout zeroes out some parts of the input examples. Another oversampling approach uses GANs to generate realistic samples from minority classes; However, not only is their training difficult, it also generalizes poorly on diverse datasets [33][34][35].
Undersampling is another common technique for handling class imbalance. In contrast to oversampling, which adds minority class data, undersampling removes data from the majority class to form a balanced dataset. The main limitation of undersampling methods is that they may remove critical information required by the model to learn. Thus, several works proposed methods for intelligently choosing the majority samples to preserve valuable information for learning. Mani et al. [36], for example, proposed several algorithms that removed majority class samples based on their distance from minority samples predicted by the K-NN algorithms.

Classifier-Level Methods
Classifier-level methods aim to adjust the learning or the decision process in a way that facilitates the learning task, specifically with respect to the minority class samples. Several disparate techniques exist in this category, including cost-sensitive learning and margin loss.
Cost-sensitive learning works by altering the loss function to make the classifier more sensitive toward minority classes [37]. Intuitively, applying different weights to training samples is similar to oversampling those data points with the appropriate frequencies.
The popular way of applying this approach consists of weighting the loss by the inverse number of samples for each class [38]. Cui et al. [23] designed a class-balanced loss, which weighted the loss by the inverse of the effective class frequencies within the neighboring region rather than the number of samples for each class. Ren et al. [39] proposed to use the label frequencies to adjust model predictions during training, so that the bias from the class imbalance could be alleviated by prior knowledge. Lin et al. [22] proposed a reformulated version of cross-entropy loss that added a weighting factor that downweighted the correctly classified sample. Similarly, Tan et al. [40] proposed a novel loss which directly downweighted the loss values of negative samples for the rare categories.
Other classifier-level methods include regularizers that encourage the minority classes to have larger margins. Cao et al. [41] proposed a label-distribution-aware margin loss (LDAMLoss) that minimized a margin-based generalization bound. Similarly, Menon et al. [42] proposed a modification of the softmax cross-entropy that encouraged a large relative margin between a pair of rare and dominant labels. A margin loss for imbalanced datasets was also proposed and studied in [43,44].

Decoupled Training
Decoupled training methods decouple the learning process into representation learning (first stage) and classifier training (second stage) [17,27]. The paper by Kang et al. [45] was the pioneer work on the introduction of the two-stage training scheme. They used a standard instance-balanced sampling to learn the representation stage. Then, for the second stage, they evaluated three different approaches for classifier's learning: classifier retraining, nearest-class-mean classifier, and τ-normalized classifier. Their approach established a new state-of-the-art performance on three long-tailed benchmarks. Similarly, Kang et al. [46] developed a k-positive contrastive loss to learn a more class-balanced and class-discriminative feature space, which led to better long-tailed learning performance. Other recent studies innovated on the decoupled training scheme by enhancing the classifier training stage. For example, Zhang et al. [47] applied an additional layer to calibrate the original classifier by matching the distribution of predictions with a relatively balanced distribution of classes. Wang et al. [48] proposed a unified distribution alignment strategy for long-tail visual recognition. Their approach transferred the statistics from relevant head classes to infer the distribution of tail classes in the second stage.
The decoupled training has been fully discussed in recent works [48][49][50], but some issues still persist and need to be resolved. First, the choice of the right loss to obtain the best features model remains insufficiently discussed. Second, the adopted resampling or reweighting methods for the second stage still have some limitations, especially focusing on head classes' learning [51], and last but not least, the two-stage learning strategy defies the expectation of end-to-end training sought in deep learning [17].
This work attempts to resolve each of the previously mentioned issues. We started by analyzing the currently used loss functions to determine the one matching the best features' representation in the first stage of the decoupled training. Then, for the second stage, we investigated whether we could design a novel loss function helping the model be more robust to class imbalance. Different from prior works, our approach followed an end-to-end training.

Problem Setting
We consider a dataset D = (x i , y i ) N i=1 with N training samples and C classes, x i is the training image and y i ∈ 1, 2, ...., C is its label. We denote by D k a subset of D containing all the samples belonging to the class k. N k represents the number of samples of D k . D is considered a long-tail dataset if we have N 1 ≥ N 2 ≥ ... ≥ N C and N 1 N C after sorting N k . The task of long-tail visual recognition is thus to learn a model on a long-tail training dataset that generalizes well on a test dataset.
Let M(x i , w) denote a CNN model parameterized by w. In its most general form, M contains two components: a feature extractor f (x i ) = x i and a discriminative classifier h(x i ) = z i , where x i denote the deep features of input x i and z i denotes the logit output of the classifier. The prediction probability p i is generally calculated by So f tmax(z i ). The feature extractor comprises several stacked layers of convolution, activation, and pooling that are designed to learn hierarchical feature representations of x i , while the discriminative classifier is built with fully connected layers that aim to interpret the extracted features x i and perform the classification task.
The reason why it is challenging to train M(x i , w) in a long-tailed visual task are two-fold. First, the number of tail samples is small, which makes it difficult to train the feature extractor f (x i ) on the long-tailed training split that generalizes well on tail classes. Second, the over-representation of head classes makes the classifier h(x i ) biased to the head classes, that is, the prediction score of head classes is much higher than that of tail classes. The two training stages of our proposed method aim to tackle these challenges.

Analysis
In this section, we investigate how the popular cross-entropy loss function (CE) and its weighted version (CS) are suitable for the first stage of the decoupled training. We also analyze how the imbalanced data distribution influences the training of M(x i , w). To that end, we conducted two toy examples on ISIC2018 with the EfficientNetB3 model. We first trained the network with CE and CS for 50 epochs to evaluate the first stage of the decoupled training. Then, we trained the network with CE for all epochs with early stopping to analyze the distribution of probabilities during a full training session (see Section 5.2 for implementation setting).
We visualize in Figure 2, with the T-distributed stochastic neighbor embedding (t-SNE) algorithm, the distribution of deep features of the validation dataset for the network trained, respectively, with CS ( Figure 2a) and CE ( Figure 2b). As shown in Figure 2a, the decision boundary between categories is blurry for the network trained with CS. The feature points near the decision boundary are not discriminative, leading to many false positives. On the other hand, the network trained with CE generates features that are more discriminative in the two-dimension feature space. These observations suggest that features produced by class-balancing sampling loss functions during the first stage of the decoupled training are worse than those produced by non-weighted losses. For an in-depth study of the influences of long-tailed distribution in the training of a model, we visualize in Figure 3 the probability distributions during training of the head class (Nevi) and the tail class (Dermatofibroma) on the validation split. We first observe that at the initialization of the model, all the probabilities have values in the interval [0.1, 0.3], which is normal because the neurons of the classifier are initialized considering that all the classes have the same probabilities (in our case we have seven classes), thus giving probabilities around 0.14. Then, we observe that for the head class Nevi, the learning is done easily with the prediction probabilities which very quickly become more and more confident with a convergence approximately reached from epoch 20. On the other hand, learning from the tail class is much less straightforward. We observe at the beginning of the training that the model has difficulty in discriminating this class with prediction probabilities up to about epoch 22

Theoretical Motivation
Our efforts here are focused on SL classification which presents a skewed distribution between classes. Specifically, we wish to design a learning framework aiming to construct a CAD system robust to class imbalance. To that end, inspired by decoupled training works [45] and the analysis presented in Section 3.2, we define a two-stage training based on two novel loss functions L f and Lc. Figure 4 illustrates both functions with the crossentropy criterion. The L f loss function is used during the first training phase and guide the model to learn a better feature representation of the task. The Lc loss function is used during the second training phase, and its objective is to deal with class imbalance issues. Both stages work in an end-to-end manner. For a sample image x i , let p i be the probability derived from the SoftMax function applied to the logit z i output by the model. y i is the ground-truth label of x i . We denote by p C ∈ [0, 1] the predicted probability generated by a model that x corresponds to its label c.
Revisiting cross-entropy loss formulation: The Softmax cross-entropy loss is defined by: Revisiting mining samples definition: An easy sample x i is a sample for which the model predicted with a high probability (p c > 1 − exp(−η), η > 0 with η large). Otherwise, when the predicted probability is low (p C around .5), the sample is considered a hard sample. Prior works on deep learning [52][53][54] have demonstrated that hard samples own more discriminative information than easy samples. On the other hand, Li et al. [55] defined as outliers, samples with very large gradients (p c < 1 − exp(−η), η > 0 with η large). They observed that these samples existed stably even when the model converged. This is similar to the observation we made in the analysis section. We believe that outliers can also be assimilated to mislabeled data.

Definition of Loss Functions
Based on our observations made during the analysis study, as an initialization of the network, we needed to downweight the loss contribution for easy samples to prevent header classes from overwhelming the total loss contribution during training, while maintaining a higher loss contribution of harder samples to help the model better discriminate tailed classes. Moreover, non-class-balancing losses are suitable to improve the representation learning of the feature model. To define a loss function meeting these criteria, and being continuous and derivable, we were inspired by signal theory and borrowed the cardinal sine function. We thus introduced the function L f defined as: In Equation (2), the cardinal sine factor allowed us to define a distribution of costs following the same dynamics as the cross-entropy while maintaining a very low contribution for easy samples. The gradients were computed by differentiating L f with respect to the input p i with the following formulation: Once the model had learned a good representation of the features, we needed to guide its learning to discriminative samples to make it robust to class imbalance. To that end, we wanted to mitigate the contribution of the very large gradients preventing them from affecting the convergence of the model and leading it to focus on discriminative samples. Thus, we modified the L f function by subtracting the sine cardinal term with another cardinal sine of a higher frequency and added a dumping factor through the exponential function to smooth the oscillation induced by the subtraction of both terms. The resulting loss function Lc could thus be defined by: The Lc Loss satisfied the following mathematical properties: • When the gradient of a sample was very large, corresponding to p i near 0, the loss went to 0, and the model was less affected by outliers.
• When the gradient of a sample was very low, corresponding to p i near 1, the loss went to 0, which prevented the model from being overwhelmed by easy samples.
Lc(p, y) = 0 (6) In practice, for the second stage, we used a weighted version of the Lc loss weighted by a classical weighting method such as the inverse of the class frequencies. From our experiments, we observed that the models generally achieved the best performance for δ ≥ 1000. The setting of δ was done by a grid search.
It can be seen from Figure 4 that the L f loss (green curve) considerably reduces the loss of the well-classified samples (p c > 1 − exp(−η), η > 0 with η large) compared to the CE loss (blue curve), which helps to prevent easily classified samples from dominating the gradient while maintaining the contribution of harder samples similar to the CE loss. The Lc loss (red curve) follows the same distribution as L f except that it also downweights the loss of very hard samples preventing the model from being affected by outliers.

Description of the Proposed Learning Framework
The overall steps of the proposed framework are shown in Figure 5. This framework is composed of two main phases: training and testing. In both phases, a preprocessing step is performed on the input images. The training phase is subdivided into two stages. In the first stage, a CNN is finetuned with the L f loss. This stage aims to guide the feature extraction of the CNN to learn a good representation of features for the given task. The second stage begin when the number of training epochs reaches a threshold T. In this stage, the CNN is finetuned with a weighted version of the Lc loss. This stage aims to guide the learning of the classifier to balance the head and tail classes. It is advantageous to set T as the epoch when the model has begun to converge to the local minimum. In our study, T was automatically defined as the epoch for which the model performance on the validation split had not improved in terms of balanced accuracy for 10 epochs. The testing phase of the proposed framework performs the evaluations. The codes and models used in this paper are available in open source via the link provided in the supplementary material. Figure 5. The overview of the proposed framework. In the first stage, we train the pretrained DCNN with the L f loss to guide the model to learn a better discriminative representation of features. In the second stage, when the number of epochs is greater than a threshold T, we continue the training of the model with the weighted version of Lc to perform the final classification task.

Dataset Description and Preparation
The evaluation of our approach was conducted on the ISIC 2018 dataset. The ISIC 2018 dataset includes 10,015 dermoscopic images across seven different categories: melanoma (MEL), melanocytic nevus (NEV), basal cell carcinoma (BCC), actinic keratosis (ACK), benign keratosis (BEK), dermatofibroma (DEF), and vascular lesion (VAL). Samples of each of the seven categories present in the dataset are illustrated in Figure 6. As shown in Figure 3, the ISIC 2018 dataset presents a long-tailed distribution than can be subdivided into head classes (MEL, NEV, and BEK), medium classes (BCC and ACK), and tail classes (DEF and VAL) for a more in-depth study on class imbalance robustness. The images are in high resolution. We used 80% of the images as training data, 10% of the images as validation data, and 10% of the images as testing data. We also performed standard preprocessing techniques for SL images [11]. Specifically, we center-cropped the image to preserve the aspect ratio and then resized it to 300 × 300 using a bicubic interpolation and performed a color standardization using the gray-world color constancy algorithm [56]. We also applied standard data augmentation techniques namely horizontal flipping, vertical flipping, and random rotation.

Training of the Convolutional Neural Network
We used a pretrained EfficieNetB3 [57] as the backbone to conduct all our experiments. Only the classification layer was modified to adapt the models to a multiclass task of seven classes. The number of blocks, the name and kernel size of the convolution layers in a corresponding block, the size of each filter, and the number of layers are described in Table 1. We used the Adam optimizer with the following settings: beta 1 = 0.9, beta 2 = 0.999, epsilon = 1 ×10 −7 and amsgrad = false. Models were trained with a batch size of 128 for 100 epochs. Similar to [58], we used the cyclical learning rate (CLR) proposed by [59] to schedule the learning rate during training in the range from 0.001 to 0.00001. We also applied regularization to avoid overfitting by stopping the training early when the balanced accuracy on the validation set did not improve after 20 epochs and selected the best saved model with the highest balanced accuracy score. The best obtained value of the hyperparameter δ was 10 7 . Conv 10 × 10 1 × 1 1 10 Global pooling 10 × 10 1 11 Dense layer 10 × 10 1

Evaluation Metrics
A normal accuracy would favor and encourage the correct classification of overrepresented classes, which is critical considering the unbalanced dataset. Therefore, we opted for the balanced accuracy (BACC) for ranking approaches, which is defined as: where S i denotes the sensitivity of class i and C the number of classes. Another wellused metric for medical analysis that we used is the area under the receiver operating characteristic curve (AUROC), which reflects the level of separability between classes.

Results
This section presents and discuss the results from our experiments. As the training of the neural network is a stochastic process, for all our experiments, we ran each of the involved methods ten times with different random seeds and reported the average performance associated with its standard deviation.
To validate our approach, we performed the following experiments: • We conducted an ablative study to analyze which of the commonly used loss function CE and L f was more appropriate for stage one; • We compared our full pipeline with common methods in the literature proposed for handling class imbalance, namely cost-sensitive loss (CS) [38], class-balanced loss by effective number of classes (CB) [23], focal loss (FL) [22], label-distribution-aware margin loss (LDAM) [41], influence-balanced Loss (IB) [60], bag of tricks (BAGs) [50] and decoupled training [21]; • We compared our approach with prior works developing CAD systems for SL classification; • We analyzed the best performance achieved with our pipelines. Table 2 summarizes the models' performance on the test set of state-of-art approaches for handling class imbalance compared to our pipelines. By analyzing the performance obtained by a group of classes according to the level of imbalance (head, medium, and tail), we observed that our approach helped to improve the performances for the classes belonging to the medium group (2% improvement) and tail group (3% improvement) while maintaining a good performance for the head group. Moreover, our method achieved the best overall performance, reaching an average BACC of 87% with a minimum margin of 2% compared to other methods. This result suggested that our approach allowed us to build a system robust to class imbalance. Table 2. Balanced accuracy rate of EfficientNetB3 trained with various methods for handling the long-tailed distribution and our pipelines on the testing split. Our approach achieves the best overall performance. † indicates our reimplementation.

Performance of the Best Model with Our Approach
To further investigate the performance of our approach, we generated the receiver operating characteristic curves for each lesion of the best model obtained with our pipelines (see Figure 7). Our model performed well with an AUROC at least higher than 95% on all classes. Interestingly, we note that both tail classes obtained an AUC of 100%, thus confirming the previous conclusion on the robustness of our approach for class imbalance.

Comparative Study with Other CAD Systems for Skin Lesion Detection
We report in Table 3 the performances of several CAD systems for SL detection. The reported performance values are taken from the original papers. Our approach obtained the best performance. Moreover, despite the fact that we used an approach with a single CNN model, we still managed to outperform some works that used a set of several CNNs including the works of Gessert et al. [10] and Barata et al. [61]. To evaluate the effectiveness of the first stage of our decoupled method, we compared the performance of our cost function L f with the commonly used CE for this stage. As presented in Table 4, the L f loss function obtained a better performance with a BACC of 83% compared to the cross-entropy criterion, which obtained a BACC of 82%. This showed that our cost function was more suitable than the CE function to train the feature model in stage one of the decoupled training. In order to have a better analysis of our approach, we plotted the learning curves on the training and validation datasets of the EfficientNetB3 model trained with the CS cost function ( Figure 8a) and with our approach (Figure 8b). We can observe on Figure 8a that the model trained with the CS function reached its convergence around epoch 40, then its performance started to stagnate with no hope of improvement. On the other hand, for the model trained with our approach (see Figure 8b), we observe that the second training phase prolonged the convergence of the model. Indeed, when the model started to stagnate around epoch 40, the switch from the L f cost function to the Lc cost function allowed us to obtain a significant performance gain, which could be observed through the difference obtained on the loss around epoch 60. As a reminder, we had defined a delay of 10 epochs before being able to automatically switch functions. This result highlighted the contribution of our approach compared to a classical learning procedure.
(a) (b) Figure 8. Learning curve of EfficientNetB3 trained with (a) cost-sensitive cross-entropy loss (CS) and (b) our learning scheme. We observe that the second training phase of our approach has prolonged the convergence of the model with a significant performance gain that can be observed through the gap obtained around epoch 60. This highlights the contribution of our approach compared to a classical learning procedure.

Conclusions
In this work, we presented an end-to-end decoupled training framework to develop computer-aided diagnosis (CAD) systems for skin lesion. The proposed approach aimed to tackle the issue of CAD trained on a long-tailed skin lesion dataset and thus construct a CAD robust to class imbalance. We conducted comprehensive ablatives studies and experiments to demonstrate the effectiveness of our method. With a single CNN, our approach was able to outperform all CAD systems with which we compared it by at least a 2% margin, achieving a BACC of 88% on the classification of the seven skin lesion types in our task. Moreover, our approach outperformed existing approaches proposed to handle class imbalance. For further work, we plan to integrate our method into an ensemble scheme which we believe will allow us to greatly improve the detection accuracy of our CAD. Moreover, adding some recent deep learning techniques such as test-time augmentation could also help our method to reach better performance.

Conflicts of Interest:
The authors declare no conflict of interest.