Towards Accurate Skin Lesion Classification across All Skin Categories Using a PCNN Fusion-Based Data Augmentation Approach

Deep learning models yield remarkable results in skin lesions analysis. However, these models require considerable amounts of data, while accessibility to the images with annotated skin lesions is often limited, and the classes are often imbalanced. Data augmentation is one way to alleviate the lack of labeled data and class imbalance. This paper proposes a new data augmentation method based on image fusion technique to construct large dataset on all existing tones. The fusion method consists of a pulse-coupled neural network fusion strategy in a non-subsampled shearlet transform domain and consists of three steps: decomposition, fusion, and reconstruction. The dermoscopic dataset is obtained by combining ISIC2019 and ISIC2020 Challenge datasets. A comparative study with current algorithms was performed to access the effectiveness of the proposed one. The first experiment results indicate that the proposed algorithm best preserves the lesion dermoscopic structure and skin tones features. The second experiment, which consisted of training a convolutional neural network model with the augmented dataset, indicates a more significant increase in accuracy by 15.69%, and 15.38% respectively for tanned, and brown skin categories. The model precision, recall, and F1-score have also been increased. The obtained results indicate that the proposed augmentation method is suitable for dermoscopic images and can be used as a solution to the lack of dark skin images in the dataset.


Introduction
Advances in neural network architecture, computation power, and access to big data have favored the application of computer vision to many tasks. Esteva et al. in [1] have demonstrated the effectiveness of convolutional neural networks (CNN) in computer vision tasks such as skin lesion classification. CNNs identify and extract useful and best features to classify images. Research has shown that training deep models with millions of parameters requires relatively large-scale datasets to reach high accuracy. It is then according to [2], a generally accepted notion that larger dataset improves classification performance.
However, assembling huge datasets can quickly become tricky due to the manual work required to collect and label the data. Building big medical image datasets is especially tricky due to the rarity of diseases, patient privacy [3], requirement of medical experts for labeling, and the high cost of medical imaging acquisition systems. These obstacles have led to the creation of several data augmentation methods such as color space transformations, geometric transformations, kernel filters, mixing images, and random erasing [4]. More complex augmentation methods based on generative models and image fusion strategy have recently been developed for medical image classification, but these methods are not all adapted to all task [5].
This paper proposes a novel image augmentation algorithm that combines the structure of a dermoscopic image with the color appearance of another to construct augmented images considering all the existing tones. The algorithm is based on a pulse-coupled neural network fusion strategy in a nonsubsampled shearlet transform domain.
The remainder of the paper is structured as follows. Section 2 is a comprehensive review of deep-learning-based data augmentation techniques and image fusion methods. Section 3 presents the proposed approach, followed by experimental results and discussion in Section 4. The article ends with the conclusion and future perspectives.

Data Augmentation Methods
In the literature, there are two categories of data augmentation. The first category is based on basic image manipulations, and the other category is based on deep learning.
Data augmentations from basic image manipulations commonly consist of image rotation, reflection, scaling (zoom in/out), shearing, histogram equalization, enhancing contrast or brightness, white balancing, sharpening, and blurring [6]. Those easy-tounderstand methods have been proven to be fast, reproducible, and reliable and their implementation code is relatively easy and available to download for the most known deep learning frameworks, and thus more popular [7].
The literature distinguishes two forms of deep-learning-based data augmentation: convolutional-layer-based methods and generative adversarial networks (GANs) based methods. Hui et al. in [8] proposed a deep-learning-based method named Densefuse. The method combines convolutional layers and a dense block as encoder to extract deep features and convolutional layers as decoder to reconstruct the final fused image. Addition strategy and l1-norm strategy are used to combine features. The results indicate the effectiveness of the proposed architecture for infrared and visible image fusion tasks. Zhang et al. in [9] proposed a method named IFCNN. This method framework consists of feature selection with convolutional layers, fusion rule, and features reconstruction with convolutional layers. Th results demonstrate good generalization potential. Subbiah Parvathy et al. in [10] proposed a deep learning concepts based method that optimizes the threshold of fusion rules in shearlet transform. The proposed method has high efficiency for different input images.
GAN-based data augmentation originally proposed by Goodfellow et al. brought a breakthrough in the synthetic data generation research field. A GAN framework consists of two separate networks called the discriminator and generator, training competitively. According to Bowles et al. in [11], GANs generate additional information from a dataset. The intended task of the discriminator is to distinguish synthesized samples from original ones, whereas the generator is tasked with generating realistic images that can fool the discriminator. Since then, GANs were introduced in 2014 [12] and various works on GAN extensions such as DCGANs, CycleGANs, and progressively growing GANs [13] were published in 2015, 2017, and 2018, respectively. In medical image analysis, GANs are widely used for image reconstruction [7,14,15], segmentation [16], classification [17,18], detection [19], registration, and image synthesis such as brain MRI image [20,21], liver lesion [22], and skin lesion synthesis [23,24].
Zhiwei Qin et al. [25] proposed a style-based GANs model. The model modifies the structure of style control and noise input in the original generator, adjusts both the generator and discriminator to efficiently synthesize 256 × 256 skin lesion images. By adding the synthesized 800 melanoma images to the training set, the accuracy, sensitivity, specificity, average precision, and balanced multiclass accuracy of the classifier were improved by 1.6%, 24.4%, 3.6%, 23.2%, and 5.6% respectively. Alceu Bissoto and al. in [24] proposed an image-to-image translation model named pix2pixHD. Instead of generating the image from noise (usual procedure with GANs), the model synthesizes new images from a semantic label map (segmentation mask) and an instance map (an image where each pixel belongs to a class). Synthetic images generated contain features that characterize a lesion as malignant or benign. Even more, synthetic images contain relevant features that improve the classification network by an average of 1.3 percentage points and keep the network more stable. Kora Venu et al. in [26] generated X-ray images for the underrepresented class using a deep convolutional generative adversarial network (DCGAN). Experiments results show an improvement of a CNN classifier trained with the augmented data.
In conclusion, Table 1 presents the strengths and the limitations of deep-learning-based methods. As presented, several works applied deep-learning-based data augmentation to correct the class imbalance, by generating realistic images. However, although the generated images by GANs are realistic, there is a problem of partial collapse mode [27,28]. Mode collapse refers to scenarios in which the generator produces multiple images containing the same color or texture themes, which favors duplicates in generated images. Table 1. Summary of medical fusion methods and deep-learning-based data-augmentation methods.

Images Fusion Methods
Image fusion generates an informative image via the integration of images obtained from multiple source images in the same scene. The input source images in an image fusion system can be acquired either from various kinds of imaging sensors or from one sensor with different optical parameter settings. An efficient image fusion can preserve relevant features by extracting all important information from the images without producing any inconsistencies in the output image [37]. Image fusion techniques have been widely used in computer vision, surveillance, medical imaging, and remote sensing. According to the literature, there are two main branches of image fusion, namely the spatial domain method and the transform domain method [5]. Spatial domain methods consist in merging the source images without transformation by choosing the pixels regions or blocks. Transform There are a variety of transforms that have been used for image fusion, such as those based on sparse representation [38], discrete wavelet transform [29,36], curvelet transform, contourlet transform, dual-tree complex wavelet transform, non-subsampled contourlet transform [30], and shearlet transform [31]. There are also a variety of fusion strategies that have been used for images fusion, such as sparse representation (SR) [30,35], enhanced sparse representation [37], modified sum-modified Laplacian (SML) [30], coupled neural P (CNP) systems [32], pulse coupled neural network (PCNN) [33], and PCNN variant [29].
The literature on medical image fusion is growing and methods combining decomposition methods and fusion strategies have been proposed. Sarmad Maqsood et al. in [37] proposed a method for computed tomography and magnetic resonance imaging images fusion. In this method, they have used spatial gradient-based edge detection technique to transform into detail layer and base layer each source image, an enhanced sparse representation approach as fusion strategy and have formed the fused image by linear integration of final fused detail layer and fused base layer. Five metrics, entropy, spatial structural similarity, mutual information, feature mutual information, and visual information fidelity, were used to confirm the superiority of the proposed method on other methods. Yuanyuan Li et al. in [30] proposed a fusion technique based on non-subsampled contour transformation (NSCT) and sparse representation (SR). NSCT is applied for the source images decomposition to obtain the corresponding low pass and high pass coefficients. The low pass and high pass coefficients are fused using SR and the sum-modified Laplacian (SML), respectively. The final fused image is obtained by applying the inverse transform on the fused coefficients. Experiments show that the proposed solutions achieve better performance on structural similarity and detail preservation in fused images. Similarly, Li Liangliang et al. in [34] applied NSCT for images decomposition and refine the fused image based on energy of the gradient (EOG). Visual results and evaluation fusion metrics results show a significant performance of the proposed technique. Xiaosong et al. in [35] proposed an image fusion and denoising method that decomposes images into high-frequency layer, low-frequency structure and low-frequency texture. They applied sparse representation, absolute maximum, and neighborhood spatial frequency as fusion rules on the different layers respectively to generate the fused layers. The fusion result is finally obtained by reconstructing the three fused layers. The results show that the method responds well to noisy image fusion problems. Bo Li et al. in [32] proposed a method based on coupled neural P (CNP) systems in the NSST domain. They first compared the method to others fusion methods and then compare the method to deep-learning-based fusion methods. Experimental results have demonstrated the advantages of the proposed fusion method for multimodality medical images fusion. Shehanaz, S. et al. in [36] proposed a multimodalities fusion method based on discrete wavelet transform (DWT) for image decomposition and using particle swarm optimization for optimal fusion of coefficients. Wang et al. [29] proposed an image fusion method based on wavelet transformation. The method also uses the discrete wavelet transform (DWT) to decompose the source images, then fuse the coefficients with dual-channel pulse coupled neural network (PCNN) and applied inverse DWT for fused image reconstruction. The effectiveness of the proposed method was demonstrated by experimental comparisons of different fusion methods. Li et al. in [33] combined PCNN and weighted sum of eight neighborhood-based modified Laplacian (WSEML) integrating guided image filtering (GIF) fusion rules in non-subsampled contourlet transform (NSCT) domain. The proposed method fused multimodal medical images well.
In conclusion, according to the literature, there are several fusion methods, and their efficiency depends on the decomposition method and the fusion strategy. Table 1 summarizes the strengths and limitations of the presented image fusion methods. The main advantages of fusion methods are that they preserve features, are easy to implement and fast.

Proposed Method
This paper proposes a solution to correct skin tones imbalance observe in all skin lesion datasets. The proposed method illustrated in Figure 1 is composed of two main parts. The first part consists of source images decomposition and fused coefficients reconstruction using non-subsampled shearlet transform (NSST) because NSST-based algorithms are shift invariant and can eliminate edge effects efficiently. The second one performs coefficients fusion using an updated pulse-coupled neural network (PCNN) for its simplicity, speed, and efficiency.

Non-Subsampled Shearlet Transform (NSST)
NSST, as shown in Figure 2, is a multiscale decomposition used to efficiently represent high and low-frequency information of source image [39]. Firstly, the source image is decomposed into low-pass and high-pass bands using the non-subsampled Laplacian pyramid (NLSP) transform. For each level of decomposition, the high-pass bands are submitted to translation invariance shearlet filters and the low-pass bands are further decomposed into low-pass and high-pass bands for the following level. Then inverse NSST transform is applied by taking the sum of all shift-invariant shearlet filter responses at the respective levels of decomposition, and inverse non-subsampled Laplacian pyramid transform is finally applied to get the reconstructed image.

Pulse Coupled Neural Network (PCNN)
The PCNN introduced by Johnson, J.L. et al. [40] is a neuron based on the visual cortex of small mammals as cats and is composed of three modules: the receptive field, the modulation field, and the pulse generator [41].
PCNN is a two-dimensional M × N network, in which each neuron corresponds to a specific pixel of the image. Figure 3 presents an original PCNN neuron. By an iterative calculation combining these different modules, the following equations are used to activate the neuron. The index (i, j) refers to a pixel location in image, (k, l) refers to neighborhood pixels around a pixel, and n denotes the current iteration.  The receptive field described by Equations (1) and (3), consists of F and L channel. L, the linking parameter receives local stimulus from surrounding neurons. On simplified PCNN neuron F the feeding neuron receives external stimulus from I the input signal. On original PCNN F is described by (2).
The modulation field consists of U as presented in Equation (4). U the internal activation modulates the information of the above module with β the linking strength.
Pulse generator is described in Equations (5) and (6). The output module compares U with θ the dynamic threshold. If U ij is larger than θ ij , then the neuron is activated and generate a pulse, which is characterized by Y ij = 1, otherwise Y ij = 0. The excitation time of each neuron is denoting T represented in Equation (7).
The weight matrices W and M are local interconnections and V θ is a large impulse, V F and V L are the magnitudes scaling terms. α F , α L and α θ are the time decayed constants associated with F, L and θ respectively.
According to Equation (4), the parameter β highly influence the neuron internal activation. The PCNN neuron has been improved by changing β as adaptive local value instead of global value. As demonstrated in Equation (8), it consists in using a sigmoid function to normalize between 0 and 1 the gradient magnitude G of 3 × 3 local region of source images.
The linking strength is therefore dynamically adjusted according to the magnitude gradient. This modification allows the PCNN model to better preserve the image details in the final image.

Detailed Algorithm
The schematic diagram of the proposed data augmentation method is shown in Figure 1. The proposed algorithm can be summarized as the following steps in Algorithm 1.

Input: Dermoscopic images in dataset A and unaffected darker tones images in dataset B Output: Fused image
Step 1: Image decomposition with NSST Randomly select source images and decompose each source image into five levels with NSST, to obtain (A Low ,A k,l High ) and (B Low ,B k,l High ). A Low and B Low are the low-frequency coefficients of A and B. A k,l High and B k,l High represent the l-th high-frequency sub-band coefficients in the kth decomposition layer of A and B.
Step 2: Fusion strategy The low-frequency sub-band contains texture structure and background of source images. The fused low-frequency coefficients are obtained as follows: L(x, y) = aA Low + bB Low (9) where a and b denote weighted coefficients. High-frequency sub-bands contain information about details in images. The fused high-frequency sub-bands are obtained by computing the following operations on each pixel of each high-frequency sub-bands.
where F denotes the fused sub-bands coefficients. If T ij,A is larger than T ij,B , then the pixel located at (i, j) in the sub-image from A has more remarkable characteristics than the corresponding pixel in the same place of the sub-image from B. Thus, the former is chosen as the pixel in the fused sub-band. Conversely, the latter will be selected.
Step 3: Image reconstruction with inverse NSST

Dataset
The dataset contained two source images: source A and source B. Source A images were dermoscopic images of melanomas and nevus obtained by combining the ISIC2019 and ISIC2020 Challenge datasets [42][43][44]. Source B images were RGB images of darker skin tones. Some of the source images used in the experiments are shown in Figure 4. To quantify the skin tone categories present in source A dataset, skin images were segmented to extract non diseased regions and the individual typological angle (ITA) metric [45] was used to characterize the skin tone of that region. ITA is an objective classification tool computed from images. The pixels from the nonaffected part are converted to CIELab-space to obtain the luminance L of each pixel and b the amount of yellow in each pixel. The mean ITA value is in degrees and is calculated using Equation (11) [45]. As presented in Table 2, ITA values classify skin tones into six categories: very light, light, intermediate, tanned, brown, and dark.  [46] also reveal the imbalance of skin types in the Fitzpatrick 17k dataset and in datasets in general.

Experimental Setup
The proposed method was compared to other fusion approaches to certify its efficiency and superiority. The comparative study was performed with color transfer (CT) [47] method, wavelet and sparse representation-based method (DWT_SR) [37], wavelet and color transfer-based method (DWT_CT), sparse representation, and sum-modified Laplacian in NSCT domain-based method (NSCT_SR_SML) [30] and the proposed method (NSST_PCNN).
Six objective image evaluation metrics were adopted for quantitative evaluation: gradient-based fusion performance Q G [48] to evaluate the amount of edge information that is transferred from sources images to the fused image; Q S [34], Q C [48], and Q Y [38] to evaluate similarities between saliency maps and structural information of the fused image and sources images; and the Chen-Blum metric Q CB [38], which is a human perception inspired fusion metric to evaluate the human visualization performance of fused images.
Additionally, the augmented dataset and the real dataset were used separately to train two Gabor-based convolutional neural network [49] inspired models. A comparative study was performed on the accuracy, precision, recall, and F1 score of each model for the different skin tones to assess the effectiveness of the proposed data augmentation method on skin lesion classification for underrepresenting skin tones.
Experiments were conducted by MATLAB R2020b with an Apple M1 chip, eight cores and 16GB memory. A five-level NSST decomposition was performed in source images. For PCNN, the number of iterations was set to 100 and the parameters W were set as W = 0, 707 1 0, 707 1 0 1 0, 707 1 0, 707 .

Visual and Qualitative Evaluation
A visual quality comparison of three fused images using six different methods is displayed in Figure 5. The result obtained by NSCT_SR_SML was unnatural. Compared with DWT_SR, the proposed method combined input images effectively and preserved distinctly dermoscopic structures such as pigment networks, amorphous structureless areas (blotches), and dots and globules between the dermoscopic images, as seen in Figure 5(a1)-(a3) and the generated images Figure 5(f1)-(f3). Compared to CT and DWT_CT, the proposed method achieved better performance on skin tone preservation. Visually, the pigmentation of Figure 5(a1)-(a3) lesions was different for the fused images in Figure 5(f1)-(f3) but, according to [50], the result is real as skin lesions on dark skin are characterized by central hyperpigmentation and a dark brown peripheral network.
Although visual evaluation results show that the dermoscopic structures were preserved, it should be noted that visual evaluation is a subjective method. Table 3 lists the results of five objective evaluation metrics applied on different methods. Q G , Q S , Q C , Q Y metrics assess the amount of transferred edge information and similarities between fused and source images, the highest values of the metrics are 1. The proposed method, compared to other methods, exceeded on average in Q G , Q S , and Q CB metrics. This performance was followed by those of the DWT_CT and DWT_SR methods, which showed better results for the Q C and Q Y metrics respectively. It can then be indicated that result images obtained by the proposed method better preserved details and similarity with source images. Furthermore, human visualization performance metric Q CB values support the pigmentation difference observed on visual evaluation. The fused images were not just duplicates of skin lesions images.

Visual and Qualitative Evaluation
A visual quality comparison of three fused images using six different methods is displayed in Figure 5. Figure 5(a1)-(a3) indicates dermoscopic features in very light and light skin tones. Images obtained by CT, DWT_SR, DWT_CT, NSCT_SR_SML, and proposed method NSST_PCNN respectively are displayed in Figure 5(b1)-(f3).   [42][43][44]. (b1-f3) Results obtained using respectively CT, DWT_SR, DWT_CT, NSCT_SR_SML, and the proposed method light skin tones. Images obtained by CT, DWT_SR, DWT_CT, NSCT_SR_SML, and proposed method NSST_PCNN respectively are displayed in Figure 5(b1)-(f3).  Table 4 shows the average time taken by each algorithm to generate an image. The results indicate that the proposed method had a longer execution time than most methods. This high value can be explained by the fact that the linking strength is adaptive, which increases the computation time. First, a convolutional neural network inspired by the model proposed in [49] was trained with 80% of the dataset and tested with 20% of the dataset. The accuracy, precision, recall, and F1-score of Model 1 are reported in Table 5. Second, the training model was reinforced with the augmented images and then tested with the same 20% of the dataset. To verify the neural network's generalization, the models were tested with only real images. The accuracy, precision, recall, and F1-score of model 2 are reported in Table 6.  Figure 6 shows the accuracy, precision, recall, and F1-score of models based on these experiments. Compared to model 1, model 2's accuracy increased by 2.22%, 0.73%, 1.67%, 15.69%, and 15.38% respectively for the very light, light, intermediate, tanned and brown categories. The model precision also increased by 1.36%, 0.68%, 4.00%, 13.04%, and 16.67% respectively on different categories. Similarly, for true positive recall and F1_score, the results are more significant with a larger increase observed in medium-brown and darkbrown tones. As for GAN-based models proposed by Zhiwei Qin et al. in [25] and Alceu Bissoto and al. in [24], data augmentation improved the model, with the particularity that the proposed data augmentation also corrected skin tone imbalance. Data augmentation therefore favored the reinforcement and generalization of the classifier. Introducing new images inspired by real images but which also contain other features, made it possible to promote the generalization of the classifier.

Conclusions and Future Works
In this paper, a data augmentation method based on multiscale image decomposition and PCNN fusion strategy is proposed as a solution to alleviate the lack of labeled dermoscopic data in all existing skin tones. Compared to the existing methods, the proposed method presented more informative dermoscopic structure and detail. Particularly, the proposed method has the advantage of being suitable for dermoscopic images. Experiment results also prove this method improved the performance and accuracy of a convolutional neural network-based skin lesion classifier, even for under-represented skin tones. To conclude, this work is innovative because the proposed method is simple, does not require training, and effectively augments dermoscopic images.
A limitation of this method is the computation time. The high value can be explained by the fact that the linking strength is adaptive, which increases the computation time. Therefore, reducing the processing time of the algorithm is an improvement that can be made in the future. Future work will also focus on applying the proposed algorithm in other areas of medical imaging to test and improve the efficiency and generalization of the algorithm. Finally, the development of these results should also focus on increasing the skin lesion dataset and strengthening skin lesion classifiers on the darker tone categories. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://challenge2020.isic-archive.com (accessed on 9 February 2022), https: //challenge2019.isic-archive.com/data.htm (accessed on 9 February 2022).

Conflicts of Interest:
The authors declare no conflict of interest.