Improved U ‐ Net: Fully Convolutional Network Model for Skin ‐ Lesion Segmentation

: The early and accurate diagnosis of skin cancer is crucial for providing patients with advanced treatment by focusing medical personnel on specific parts of the skin. Networks based on encoder–decoder architectures have been effectively implemented for numerous computer ‐ vision applications. U ‐ Net, one of CNN architectures based on the encoder–decoder network, has achieved successful performance for skin ‐ lesion segmentation. However, this network has several drawbacks caused by its upsampling method and activation function. In this paper, a fully convolutional network and its architecture are proposed with a modified U ‐ Net, in which a bilinear interpolation method is used for upsampling with a block of convolution layers followed by parametric rectified linear ‐ unit non ‐ linearity. To avoid overfitting, a dropout is applied after each convolution block. The results demonstrate that our recommended technique achieves state ‐ of ‐ the ‐ art performance for skin ‐ lesion segmentation with 94% pixel accuracy and a 88% dice coefficient, respectively.


Introduction
Melanoma is one of the most serious skin ailments worldwide, accumulating 287,723 new cases and over 60,000 estimated fatalities in 2018 [1]. Skin cancer is one of the most significant public health issues, with over 2000 new diagnoses in South Korea just in the past 5 years [2]. New melanoma arising on the skin surface can be easily identified via visual inspection. Unfortunately, most melanomas are not noticed in a timely manner by sufferers [3]. Nevertheless, individual visual inspection by professional dermatologists provides a diagnostic accuracy of ~60%, which means many potentially treatable melanomas are not detected until they are quite advanced [4]. The accurate and early segmentation of skin lesions is critical to the detection and localization of visual dermoscopic features of the skin and the classification of skin-based diseases. Dermoscopy is an imaging technique that decreases the surface reflection of skin, allowing deeper layers to be visually inspected. It is used to improve diagnostic performance and minimize melanoma deaths. Figure 1 displays a few dermoscopic images of melanoma skin lesions. The deep-learning community has applied various techniques to boost traditional computervision activities by utilizing neural networks. Hence, convolutional neural networks (CNN) have revolutionized image classification, scene identification, target recognition, and other capabilities because of their ability to build inner descriptions of images. CNN-based approaches are much better than other technologies for comprehending location and dimension in various forms. CNNs have led to vast improvements in task recognition. Apart from enhancing image classification tasks [5][6][7], they have also gained ground in regional tasks requiring organized results. Such advancements have been made in object detection [8][9][10], part and key-point prediction [11,12] and local correspondences [11,13]. CNNs are used for current semantic segmentation techniques [14][15][16][17][18][19], in which every pixel is classified according to its nearby object or region. Indeed, this is essential when dealing with medical images. Consistent image segmentation is a very important task, and successfully enabling diagnostic capability is the main objective of medical image segmentation, which is principally a pixel-level classification problem.
Fully convolutional networks (FCN) are one of the successful methods among the initially proposed neural networks for image segmentation [5]. By applying FCNs to CNN technology, feature to feature mapping can be achieved without first obtaining spatial information [20][21][22][23][24]. The FCN architecture is a broadened type of CNN and contains only convolutional and pooling layers, which provide the ability to make predictions about inputs. These are generally applied to local tasks instead of global ones [25,26]. A variety of FCN-based methods have been suggested recently to mitigate this problem. For example, the author of [27] suggested a multi-scale CNN that contained sub-networks having different resolution results to gradually expand coarse estimation.
In order to reconstruct accurate forms of target borderlines, sort pixel-wise class labels and calculate segmentation masks, the straightforward deconvolutional step has been replaced with a deep up-convolution network in [28]. Several studies tried to achieve better segmentation accuracy by using spatial information. U-Net has achieved great performance through skip connections by joining features of low-level layers and high-level layers [29].
The foremost weakness of the U-Net architecture is that training may reduce the training speed in the middle layers of deeper neural networks, so there is the threat of overlooking the layers. The main reason for this phenomenon is that gradients become weakened further away from the output layer of a network, where the loss of the training is computed. Additionally, the rectified linear unit (ReLU) activation function was used in the original U-Net paper. The dead neuron problem of the ReLU has been detected by Lu et al. [30]. We substituted ReLU activation function with PReLU nonlinearity for training the network to lessen the effect of this problem.
The foremost responsibility of activation functions is to convert an input signal of a layer in the neural network to an output signal. A rectified linear unit (ReLU) is the key activation function for this activity in deep-neural networks [31]. It expedites the convergence of the training procedure and produces better results compared with standard sigmoid-like activation functions. Apart from the numerous benefits of a ReLU rectifier, recent studies have shown that there are shortcomings [30,32].
The main downside of U-Net is that training can decelerate in the intermediate layers of deeper networks. Thus, there is risk of ignoring layers where abstract attributes are exemplified. This is caused by gradients becoming weakened farther away from the last layer of the network, where the difference between predicted and actual values is calculated, causing slower updates for far-removed weights. Another limitation is that the classic deconvolutional method of generating images, achievements notwithstanding, has produced theoretical concerns that direct artifacts are sometimes produced in images. In our approach, we apply an interpolation method for upsampling to overcome these problems. We achieve an adequate accuracy for skin-lesion segmentation because of the aspects of the bilinear interpolation and a parametric ReLU (PReLU) nonlinearity, which admirably matches the chosen interpolation method.
Deep-neural networks include multiple nonlinear concealed layers, leading to expressive models learning very complex relations between inputs and results. With restricted training data, many of these complex relationships will refer to noise, which leads to overfitting. Several approaches have been applied to avoid this. Strivastava et al. proposed a dropout method to mitigate this problem. The technique randomly drops units from the neural network during training [33]. This prevents excessive perfect co-adaptations. More technically, individual nodes are either dropped from the net with probability of 1 or kept with probability of . The reduced network is then left for training while the inputs and outputs for the dropped-out nodes are eliminated, as illustrated in Figure 2. The main purpose of this paper is achieving an adequate accuracy for skin-lesion segmentation because of the aspects of the bilinear interpolation and a parametric ReLU (PReLU) nonlinearity, which admirably matches the chosen interpolation method. In order to avoid overfitting and speeding up the training step, we use a dropout technique after convolutional layers.

Overview
We propose a novel neural-network architecture that can overcome uneven overlapping and fading gradients in the intermediate layers.
The key advantage of this recommended architecture over the classical U-Net framework entails its upsampling stage. The attained segmentation accuracy is commensurate-to-slightly-better-than that accomplished using the standard U-Net architecture. Furthermore, its accuracy is increased, reducing segmentation artifacts. Our modified architecture is also more computationally effective than the standard transposed convolutional U-Net method. The proposed method has been trained with the following system configuration: a Intel(R) Core (TM) i7-9700K processor, 32 GB of installed memory (RAM) and an 8 GB NVIDIA GeForce RTX 2060 SUPER graphics card.

Dataset
We use the dataset reported in [34], which contains both training and testing data. The training dataset includes 2594 dermoscopic skin-lesion images and corresponding ground-truth response masks. The testing dataset consists of 1000 images. Originally, every image in both the training and testing sets is of different size. First, images should be equal in size and grayscale. We thus resized all the input images to 256 × 256 and converted them to binary prior to training. Second, normalization must be applied before training. For image normalization, the matrices representing training and testing were divided by 255 to place them in the range (0, 1). We then split the training dataset into training and validation at 90% and 10%, respectively. Figure 3 provides a general flowchart of our image segmentation model, through which the input images transit. The initial training step is executed automatically by default. The binary crossentropy loss function is computed by comparing the ground truth with the resultant image segmentation. This procedure is repeated until the properties of the callback function are satisfied during training. At each iteration, the weights and biases are updated via backpropagation. When training is complete, the model checkpoints save the final parameter values. These weight values and biases are then used for testing unseen images. In Figure 4, the model architecture consists of two paths: encoder (left side) and decoder (right side). The encoder path operates like a classic CNN. It includes joint 3 × 3 convolutions with batch normalization, followed by a PReLU activation function. In every down-sampling step, we have two convolution layers. To avoid overfitting, the dropout method is used before applying a max-pooling operation. The channel size is increased twice at each down-sampling step so that the size of the feature map declines. The most important path in CNN-based image segmentation is that of the decoder. At each decoder path step, a 2 × 2 bilinear interpolation is applied for upsampling the feature map. To reform the spatial data, the result of the decoder path and the correspondingly cropped feature map from the encoder feature map are merged and tracked by again operating the same-size convolution layers, batch normalization, and PReLU nonlinearity. A 1 × 1 convolution is used for the final layer to record each feature vector tailed with its sigmoid activation function for extracting the binary image. In the next sub-sections, we discuss the techniques used in the network architecture.

Upsampling Method
U-Net normally includes two sections: downsampling and upsampling. The authors of [29] attained accurate pixel-level localization by concatenating the downsampling and upsampling layers together. Upsampling is the most important part of the U-Net architecture, because the spatial information of the feature map is generated here. Transpose convolution (or deconvolution), having learnable parameters, is used in the upsampling section. Apart from its learnable parameters, this execution method is very time-consuming, because the kernels require additional weights for training. Moreover, it can lead to uneven overlap [35], as depicted by the checkerboard pattern of Figure 5, which causes artifacts at a variety of scales.
Equation (1) represents a bilinear function having coordinates , . Here, we refer to f as an intensity value at given , , , , , and , pixel locations before applying interpolation and as an intensity value of interpolated image matrix in , coordinates, and s are the bilinear weights. The bilinear weights, , , and , are computed by solving the matrix in Equation (2): As a result, , is defined as a linear combination of the gray levels of its four nearest neighbors. The linear combination defined by Equation (1) is, in fact, the value assigned to , when the perfect least-squares planar fit is made to these four neighbors. This process of optimal averaging produces smoother results. In Figure 6, we visually describe bilinear interpolation. In this example, we increase the size of the 2 × 2 matrix into a 4 × 4 one. According to Equation (2), the weight values of specific elements in the up-sampled matrix depend on how near this element is to known elements.
Transpose convolution has learnable parameters for training. The number of parameters is more than 10 million in deep-neural networks, such as U-Net, because of their trainable weight and bias values in transpose convolution. Consequently, this causes adverse impacts triggered by voiding the middle layers' gradients of the U-Net models. We achieve a reduction of this detrimental impact by applying bilinear interpolation to the decoder part.

ReLU
Activation functions are key to artificial neural networks discovering and making sense of complex and nonlinear relationships between input and output, because they bring nonlinearity to the network. Their objective is to convert an input signal layer of the neural network into an output signal. The ReLU is the most common activation function in deep networks. It expedites the convergence of the training procedures and provides better results compared with standard sigmoidlike activation functions [36][37][38]. The function and derivative of this nonlinearity are represented as follows: Geometrically, the ReLU function and its derivative are illustrated as follows: The ReLU is linear for all non-negative values and zero otherwise. The derivative, an important factor for updating gradients, is zero when it takes negative values and one otherwise. Apart from the numerous benefits of the ReLU rectifier, recent studies show that there are some shortcomings [39,40]. Several gradients can be unstable during training and can "die" by causing a weight to update in such a way that it will never activate on any data point again. Thus, the ReLU can result in "dead neurons". For activations in the region (x < 0) of the ReLU, the gradient will be zero, and the weights 0 will not be fine-tuned during descent. Thus, the neurons that go into that state will stop responding to variations in the input. Simply because the gradient is zero, nothing will change. This is called the "dying ReLU" problem [30,32].
Systems using these units sometimes manifest the dying ReLU issue during training, in which the state (x < 0) is very likely for most training examples for nodes within the network. It is comprehensible that the error backpropagation algorithm utilizes a derivative function, f'(x), such that the ReLU nonlinearity is zero for x < 0 (see Figure 7). Thus, patterns that cause x < 0 do not change the unit's parameters. This condition can cause ambiguity in practice, because units that are not active will probably not be trained. The dying ReLU problem is likely to occur when:


The learning rate is too high.  There is a large negative bias. Consider the following equation used to update the weight values during backpropagation: Here, α is the learning rate, b is the bias, and ծ ծ is the derivative of the loss function with respect to the weight value (w). Thus, if the learning rate is too high or the bias is negative, it may bring negative new weight values after the backpropagation step. After the weight becomes negative, the ReLU activation function of that neuron will never be activated and will lead to the death of the neuron.

PReLU
To overcome the downsides of ReLU nonlinearity, the PReLU has been introduced. This is a new generalization of the ReLU, in which the activation function adaptively learns the parameters of the rectifiers and increases accuracy at little additional computational cost. The PReLU attempts to fix the dying ReLU problem by providing a slight negative slope when the input values are negative. This activation function allows the neurons to decide the most correct slope for the negative region. The formal definition of the PReLU and its derivative are given below: In Equation (5), is the input of the activation function, f, in the th node, and is a coefficient managing the angle of the negative section that allows the nonlinearity to differ at each node. When = 0, it becomes a ReLU. If is small and remains the same value, the ReLU becomes a "leaky" ReLU and has an insignificant impact on precision. If is a learnable parameter of the network, the equation becomes a PReLU. Figure 8 shows the PReLU activation function. In the PReLU, the alpha parameter (i.e., the "leak") is included to prevent the gradient from becoming zero. This makes the gradient more robust for optimization, because the weight is adapted for those neurons that are not active within the ReLU. The PReLU has non-constant coefficients in its negative part, and these are adaptively learned by the model. Taking all benefits of the PReLU into consideration, we use this activation function for our proposed neural network.

Training Network
Convolutions of 3 × 3 were applied to each convolution layer followed by the PReLU and max pooling with a size of 2 × 2 and a stride of 2. Sixteen filters were used for the first hidden layer. This number was doubled for each consecutive hidden layer until the end of the encoder path. This caused an increase in the number of channels and a decrease in the size of feature map. The volume of the feature map in the bottleneck (the part of the network between the encoder and decoder path) was 16 × 16 with 256 channels. A 2 × 2 interpolation operated in the decoder path simultaneously enlarged the size of the feature map and decreased the number of channels. After each upsampling process, the convolution layer and the PReLU nonlinearity was utilized again. The predicted output size was the same as the input size: 256 × 256. Overall, the network contained more than 7 million trainable parameters. Batch normalization was used to normalize the input layer by adjusting and scaling the activations. Additionally, this permitted each layer of the network to learn by itself, separately from other layers.

Data Augmentation
The demand for biomedical image data is always high. Data augmentation is a tactic that allows the deep-learning community to significantly increase the diversity of data available for training without collecting new images. The size of the training dataset used in our work was insufficient for satisfactory training. Thus, we implemented data augmentation for the training set with horizontal flipping and a 20% rotation range. This way, we prevented our network from learning irrelevant features and essentially boosted the overall performance.

Evaluation Metrics
Ground truth is represented by the segmentation region defined by individual experts. The deep-neural network output is the segmentation result of the identical image generated by the deeplearning algorithm. The area of overlap defines how the two factors are matched with each other, and this area represents true positives (TP). False negatives result from the difference between the ground truth and TP. The remaining part of the union comprises false positives. Another part of the image that is not predicted by segmentation contains the true negatives. Figure 9 illustrates the abovementioned terms geometrically. To measure our method's performance, we used three types of metric: the dice coefficient (DC), intersection-over-union (IOU), and pixel accuracy. The IOU metric (i.e., the Jaccard index) identifies the percentage overlap between the actual mask and our output prediction. Using the terms in Figure  9, we define IoU as (6) Similarly, the DC can be written as 2 * An alternative metric used to assess semantic segmentation calculates the percentage of pixels in the image that were precisely classified. This pixel accuracy (PA) is usually reported independently for each class as well as globally for all classes:

Results and Discussion
When we compared our recorded result with different U-Net methods such as the original U-Net, FCN-1 (U-Net with the PReLU and transpose convolution), and FCN-2 (U-Net with the PreLU and bilinear interpolation), we achieved acceptable metrics accuracy with the test dataset. Table 1 shows the average accuracies over different models. Via experiments, we identified that using bilinear interpolation jointly with the PreLU was an effective method for skin-lesion segmentation. The training time per epoch is around 6 min for the original U-Net, and the proposed method spends less than 5 min per epoch. Figure 10 depicts image differences from the above-given versions of U-Net. The transpose convolution method for upsampling had more false negative results in the validation set. The experiments showed that a slight underfitting was recorded in the case of using transpose convolution (see Figure 10c,d). When the interpolation method was used with PreLU nonlinearity, the model detected more outliers in the validation set images. Figure 10e shows an image of high variance (overfitting). After applying the dropout technique, we overcame this problem in the interpolation method. The final visual result of the model is given in Figure 10f. We compared the validation set accuracies of each model during training (see Figure 11). After 20 training epochs, the validation accuracy of each method remained nearly the same. In Figure 11, U-Net-1 and U-Net-2 reflect FCN-1 and FCN-2, respectively. The illustration in Figure 11 helps us reach a conclusion about the learning procedure of each method. Transpose convolution-based models tend to learn faster, compared with interpolationbased methods. They almost always achieve their highest accuracy after 12 epochs. Contrarily, interpolation-based methods take much more time to achieve desired accuracies. The reason for this phenomenon is that the dropout effect causes a delay in reaching the desired accuracy. Another explanation for why bilinear interpolation-based approaches (FCN-2 and the proposed method) tend to learn more slowly than transpose convolution-based methods (U-Net and FCN-1) is that transpose convolution uses more learnable parameters compared with interpolation methods. In the initial steps of learning, transpose convolution-based methods try to learn with higher accuracy because of their default parameters. On the other hand, interpolation-based methods do not use learnable weights and biases. Therefore, they start to learn with lower accuracy and slowly reach the desired accuracy.
The graphs in Figure 12 clarify the training and validation accuracies of our technique. There is not much difference between training and validation accuracy, owing to the dropout regularization applied to the network.  The overall dice coefficient was determined by calculating the average means of the individual dice coefficients of all images. Some DC values were relatively small because of noise in the input image. Another reason for the less DC values is that the skin and lesion colors are similar to each other in some test sample images. Consequently, there is the emergence of similar pixel intensities in those sample images. Figure 13 illustrates several examples of high and low dice coefficients in the validation set.

Conclusion and Future Work
In this paper, we used a bilinear interpolation method for the upsampling part of our FCN with a block of convolution layers. Our proposal approach provided good results for skin-lesion segmentation by jointly applying the PReLU and bilinear interpolation methods. We successfully eliminated artifacts caused by theoretical issues and reduced the detrimental impacts of U-Net parameters by using the interpolation method in the decoder part. To avoid overfitting, dropout was applied after each convolution block. We achieved ~94% PA and an 88.33% dice coefficient for skinlesion segmentation. In future work, we plan to determine the reasons for low dice coefficients in some of our experimental cases by applying advanced image-processing techniques.