Attention-Guided Network with Densely Connected Convolution for Skin Lesion Segmentation

The automatic segmentation of skin lesions is considered to be a key step in the diagnosis and treatment of skin lesions, which is essential to improve the survival rate of patients. However, due to the low contrast, the texture and boundary are difficult to distinguish, which makes the accurate segmentation of skin lesions challenging. To cope with these challenges, this paper proposes an attention-guided network with densely connected convolution for skin lesion segmentation, called CSAG and DCCNet. In the last step of the encoding path, the model uses densely connected convolution to replace the ordinary convolutional layer. A novel attention-oriented filter module called Channel Spatial Fast Attention-guided Filter (CSFAG for short) was designed and embedded in the skip connection of the CSAG and DCCNet. On the ISIC-2017 data set, a large number of ablation experiments have verified the superiority and robustness of the CSFAG module and Densely Connected Convolution. The segmentation performance of CSAG and DCCNet is compared with other latest algorithms, and very competitive results have been achieved in all indicators. The robustness and cross-data set performance of our method was tested on another publicly available data set PH2, further verifying the effectiveness of the model.


Introduction
One third of cancers worldwide are skin cancers [1].Currently, 2 to 3 million cases of non-melanoma skin cancer and 132,000 cases of melanoma skin cancer occur globally each year.It is estimated that in the United States, there were 96,480 new cases and 7230 deaths from melanoma in 2019.Melanoma accounts for less than 5% of all skin cancers, but 75% of skin cancer deaths are related to melanoma.Studies have shown that the 5-year survival rate of patients with advanced malignant melanoma is only 15%, while the final cure rate of early patients is as high as 95% [2].Patients with benign melanoma only need to be found early and removed to prevent the disease from being life-threatening [3].Therefore, the diagnosis of benign and malignant melanoma as well as early and late stages can play an extremely important role in the survival of melanoma patients.
Dermoscopy is an image formed based on the imaging principle of removing the reflection of the skin surface and visually enhancing the deeper skin [4].Compared with visual inspection, dermoscopy can improve the diagnostic accuracy rate by 20% [5].Skin lesion segmentation is one of the important steps in the computer-aided diagnosis of various skin diseases.As shown in Figure 1, due to the huge differences in the size, location, shape and color of the lesions in different patients, and a large number of artifacts, including inherent skin features (such as hair, blood vessels) and artificial artifacts (such as bubbles, ruler marks, uneven lighting, incomplete lesion areas, etc.), the automatic segmentation of lesions in dermoscopic images is very challenging.In addition, the low contrast between the lesion and the surrounding texture also hinders the automatic segmentation of the lesion area.With the help of computer-aided diagnosis, segmentation of skin lesion images to obtain lesion areas helps doctors quickly identify the location of the lesion area, which can improve the diagnosis rate and enable patients to be treated early.In the early stage, the skin lesion segmentation image use edge detection, threshold segmentation, active contour (expectation maximization, level set, clustering, etc.) or region-based (region growth, iterative stochastic region merging, etc.) hybrid technology [6,7].Although the above segmentation algorithm has a certain segmentation effect, it relies too much on the quality of manual feature selection and introduces prior information.Moreover, it is difficult for the recognition model based on artificial features to obtain good generalization ability for skin lesions images with highly changed clinical manifestations.
Recently, supervised methods have achieved promising results in the field of computer vision, but they rely on annotated training data sets, which require the proficiency of humans and related background knowledge.In contrast, unsupervised learning makes data-driven decisions by obtaining insights directly from the data itself.Unsupervised learning is applied to all aspects of image processing.Ahmed et al. [8] proposed a lowrank tensor with a sparse mixture of Gaussian (LRTSMoG) decomposition algorithm for natural crack detection.He proposed algorithm models jointly the LRST pattern by using a tensor decomposition framework.In particular, the weak natural crack information can be extracted from strong noise.Gupta et al. [9] investigate the utility of unsupervised machine learning and data visualisation for tracking changes in user activity over time.The goal of semi-supervised and unsupervised image segmentation is to greatly reduce or even eliminate the need for training data, thereby minimizing the burden on clinicians when training the segmentation model.Li et al. [10] present a novelsemi-supervised method for skin lesion segmentation, where the network is optimizedby the weighted combination of a common supervised loss for labeled inputs only anda regularization loss for both labeled and unlabeled data.Pathan et al. [11] proposed a deep clustering architecture and formal image analysis for image segmentation.The main idea is based on an unsupervised learning method, clustering the images of the severity of the disease in the sample of the subject, and then segmenting the image to highlight and outline the area of interest.Feyjie et al. [12] proposed a novel small-scale learning framework for semantic segmentation, in which unlabeled images can be used in each plot.
Recently, the Deep Convolutional Neural Network (DCNN) model segment skin lesions into pixel-level classification problems and achieve remarkable success [13][14][15][16].No matter from the deep full convolutional neural network (FCN), which earliest uses in image segmentation [17], to UNet [18], which extends the architecture of FCN, and various extension models based on UNet architecture, they all have the contradiction between semantic information and spatial location information.The feature semantic information in the deep layers of the network is richer, but through step-by-step pooling, the spatial resolution of the feature is continuously reduced, and the spatial position information is continuously lost.If the output is directly upsampled to the original input resolution, the segmentation result will be rougher [19].For dermoscopy images, due to its small grayscale changes and relatively blurred boundaries, the underlying features with rich spatial location information are particularly important for restoring feature spatial resolution.
In recent years, the attention mechanism has been widely used in deep neural networks and has been widely used in many tasks [20][21][22].Most applications in computer vision and computer graphics involve the concept of image filtering to reduce noise or extract useful image structure.The guided filter [23] is an edge-preserving image filter, which serves as a special extension path that transfers the structural information extracted from the low-level feature map to the high-level feature map.It is effective and efficient in a variety of computer vision and computer graphics applications, including noise reduction, detail smoothing/enhancement, HDR compression, image extinction/feathering, haze removal, and joint upsampling.
In this article, we propose a Channel Spatial Attention-guided network with Densely Connected Convolutional (CSAG and DCCNet) based on deep CNN.The model is equipped with a novel and effective Channel Spatial Fast Attention-guided Filter module (CSFAG) for dermoscopic image segmentation.Compared with the existing attention methods widely used in CNN, the CSFAG module has three advantages.First, we directly impose some constraints on the unknown output by considering the guide image and use spatial attention and channel attention to collect information around the feature map.The calculated attention weight is applied to the guide feature map, so that the model can further capture the correlation of dimensional features, focus on the lesion area, and reduce the influence of noise on the segmentation performance of the model.Secondly, the CSFAG module supports the fusion of multi-resolution features.This module can recover spatial information by filtering low-resolution feature maps and high-resolution feature maps, and merge structural information of various resolution levels to better retain spatial location information.Third, adding the CSFAG module we proposed to the network can greatly improve the performance of lesion segmentation, and the module can be seamlessly integrated into multiple basic segmentation network architectures, with good robustness.
Our specific work content is as follows: (1) We have designed a novel Channel Spatial Fast Attention-guided Filter (referred to as CSFAG).Through a large number of ablation experiments, it is verified that the CSFAG module is superior to other mainstream attention modules and can be combined with multiple basic segmentation networks, which can effectively improve the segmentation performance of the model.(2) Embed the proposed CSFAG module into the jump connection of the M-Net segmentation network to form CSAG and DCCNet.In the last step of the CSAG and DCCNet encoding path, densely connected convolutions are used to replace ordinary convolutional layers, and through the idea of "collective knowledge", the gap between low-level features is bridged and features are effectively aggregated.(3) On the ISIC-2017 data set, the segmentation performance of CSAG and DCCNet was compared with other latest algorithms.The six indicators of accuracy, sensitivity, specificity, Dice coefficient, Jaccard coefficient, and Matthew correlation coefficient were all achieved very competitive results.CSAG and DCCNet trained in ISIC-2017 was tested on another publicly available data set, PH2, to verify the robustness and cross-dataset performance of our method.
The rest of this paper is arranged as follows: related work is introduced in Section 2. Section 3 introduces the proposed network architecture in detail; experiments and comparison results are explained in Section 4; finally, Section 5 gives our conclusion.

Semantic Segmentation Model
The method based on full convolutional network (FCN) has made great progress in semantic segmentation.The initial popular depth learning method for semantic segmentation tasks is patch classification, which uses image blocks around the pixel to classify each pixel independently.J. Long et al. [17] first applied the FCN to the end-to-end training of image segmentation, which makes the convolutional neural network can perform dense pixel prediction without a fully connected layer.This method to produce image segmentation maps of any size.Since then, in the field of semantic segmentation, this model is adopted by almost all advanced methods.Except for the fully connected layer, another problem with convolutional neural networks for semantic segmentation is the use of pooling layers.Although the pooling layer expands the receptive field and aggregates the context, it causes the loss of location information.
There are two different structures to solve this problem.The first is the encoderdecoder structure.The encoder gradually reduces the spatial dimension through the pooling layer, and the decoder gradually repairs the details and spatial dimensions of the object through methods such as bilinear interpolation.The encoder and the decoder are usually embedded with a skip connection, which can better help the decoder to repair the details of the target.U-Net [18] is the most commonly used structure in this method.Since then, the UNet-based structure has derived multiple segmentation networks, such as V-Net [24], UNet++ [25], MultiResUNet [26], UNET 3+ [27].The second method is to use the atrous convolution structure to remove the pooling layer.On the premise of not reducing the spatial dimension, the atrous convolutional layer improves the receptive field index, captures multi-scale context information and maintains the spatial position relative of the feature map, while the pooing will introduce translation invariance.Classical semantic segmentation networks from Deeplab series networks [28][29][30] to PSPNet [31] all use atrous convolution.

Attention Mechanism
As we all know, attention plays an important role in human perception [32,33].The attention mechanism has been widely used in many tasks.Earlier, the Google Deep Mind team first used the attention mechanism on the RNN model for image description problems [34].Subsequently, they proposed a model based on the attention mechanism for the recognition of multiple objects in the image [35].Typical attention models on a single image are residual attention network [36] and squeeze-and-excitation Networks (SENet) [37].The residual attention network includes two attention components, a stacked network structure composed of multiple attention components, and residual attention learning that combines the residual structure with the attention mechanism.SENet includes a squeeze and excitation block to retain the channel attention introduced for each residual block.PSANet [38] aggregates the contextual information of each location through the predicted attention map.A 2 Net [39] propose a dual attention block, which can collect informative global features from the entire time and space distribution of the image.DANet [40] is based on the self-attention mechanism to capture rich contextual relevance to solve the scene segmentation task, while applying spatial and channel attention to collect information around the feature map.Sanghyun Woo [41] proposed an attention module called CBAM, which can be embedded into classic deep networks to improve model performance.Huisi Wu et al. [42] proposed a deep learning model equipped with a new and efficient adaptive dual attention module (ADAM) to automatically segment skin lesions from dermoscopic images.Most of the above methods only focus on feature information from the two parts of spatial and channel.However, in some recent studies, some scholars use both spatial and temporal domains for object segmentation with self-attention.Hamad et al. [43] proposed dilated causal convolution with multi-head self attention for sensor human activity recognition.The multi-headed self-attention is used to enable the model to focus on important and relevant time steps more than the insignificant time steps from the sequential feature maps during recognition.Hu et al. [44] proposes a hybrid multi-dimensional features fusion structure of spatial and temporal segmentation model for automated thermography defects detection.They design a new attention block to provide spatiotempo-ral attention to focus on semantically meaningful regionsof the volumetric data and recalibrate the feature mapsadaptively, based on the weighted channels.
However, all of the above methods only emphasize the local focus of the pixel relationship or the global focus of the entire image.Due to the different shapes, sizes and colors of lesions; skin types and characteristics; inherent skin characteristics, a single attention mechanism or the existing two-dimensional attention module with dimensional integration still cannot cope with the challenges in skin lesion segmentation.In this work, we combined image filtering with attention, and designed a Channel and Spatial Fast Attention-guided Filter (CSFAG) module to preserve the smooth characteristics of the edges of skin lesions without being subject to gradient inversion The effect of artifacts.The CSFAG module performs well in terms of quality and efficiency.

Dermoscopic Image Segmentation
Earlier, Kawahara, etc. proposed a fully convolutional network based on Alex Net to extract the surface features of melanoma [45].Subsequently, Long [17] proposed a fully convolutional neural network FCN, which uses a fully convolutional layer instead of a fully connected layer to convert the classification convolutional neural network into a segmentation network.Lequan uses ResNet based on the FCN network structure to segment the lesion area on the ISIC dermoscopic image data set [46].UNet, proposed by Ronneberger et al. [18], is one of the most popular FCN structures for medical image segmentation.Tschandl P is based on migration learning and uses the LinkNet structure to use the ResNet classification network pre-trained on the ImageNet data set for the coding part of the segmentation.It has achieved a significant performance improvement on the task of segmentation of skin lesions [47].The DeepLab series of work introduced delitated convolution to reduce the loss of the resolution of the coding part and increase the receptive field.Goyal applied DeepLabV3+ to the task of skin lesions segmentatio [48].In order to reduce the number of parameters that make the network lightweight, Md.Hasan et al. uses depthwise separable convolution instead of standard convolution, and projects the learned distinguishing features onto the pixel space at different stages of the encoder [49].However, in clinical practice, due to the complexity of the lesion and the significant increase in the number of dermoscopic images, the lightweight, robustness and ability of the segmentation model to be combined with multiple basic networks are becoming more and more important.In our work, our goal is to develop a segmentation model that is easy to transplant and has good robustness.

Proposed CSAG and DCCNet Model
The CSAG and DCCNet proposed in this paper is an end-to-end multi-label deep network, which consists of multiple novel channel and space fast attention-oriented filter modules (CSFAG), densely connected convolution, multi-scale input layer, UNet and side output layers.In the CSFAG module, the guided filter is combined with the attention module, and a novel attention-oriented filter is designed and embedded in the skip connection of CSAG and DCCNet.In the last step of the CSAG and DCCNet coding path, densely connected convolution is used to replace the ordinary convolutional layer.Using dense connection convolution in the last step of the encoding path can recover spatial information with the help of rich context information, further enriching context information and reducing the difficulty of training.CSAG and DCCNet uses U-Net as the basic network structure, and builds an image pyramid input on the left side of the coding layer to achieve a multi-scale layer of multi-level receiving field fusion; on the right side of the decoder, it introduces a side output layer, average all side output maps as the final prediction map.Simultaneously use the multi-label loss function to update the parameters to train the model The architecture of this model is shown in Figure 2.

CSFAG Module
The concept of image filtering is widely used in computer vision and computer graphics.Simple linear translation invariant (LTI) filters, such as Gaussian filters, Laplacian filters and Sobel filters, are widely used for image blur/sharpening, edge detection and feature extraction.The kernel of the LTI filter is spatially invariant and has no relation to the image content.Considering the above reasons, we hope to merge other information from the given guide image during the image filtering process so that the image filter can establish a connection with the structural information in the dermoscopic image, in order to better improve the feature extraction ability of the model.We consider using guided images to directly impose some constraints on the output, and use spatial attention and channel attention to collect information around the feature map.Therefore, this article is inspired by the above-mentioned methods and designs a channel and space fast attention-oriented filter (CSFAG) module.The module is mainly composed of two parts, the fast guide filter and the attention module.The model architecture of this module is shown in Figure 3.The input of the CSFAG module includes the guided feature map G and the filtered feature map F, and the output is a high-resolution feature map O. First, we use bilinear interpolation to scale the guiding feature map G to generate a low-resolution feature map G l .G l and filtered feature map F through attention module to generate attention feature map A. We minimize the reconstruction error [50] between G l and F to obtain the coefficients W l and B l of the CSFAG module.By bilinearly up-sampling W l and B l , we obtain the coefficients W h and B h which match the guiding feature map G. Finally, the output O of the CSFAG module is obtained through linear transformation.
Specifically, the channel and space fast attention-oriented filter constructs a local square window ω k with a radius of r and a pixel k as the center to realize a local linear model.Assuming that Gl i is the pixel of G l , the output Ok i of ω k is obtained by linear transformation: In order to determine the coefficients W k and B k , it is necessary to minimize the reconstruction error between the outputs Ok i and the filtered feature F i of all pixels in the window ω k , as shown in Equation ( 2), where λ is a regularization parameter which controls the smoothness.A i is the attention weight at position i, obtained by the attention module: min In Equations ( 3) and ( 4), µ k and σ k are the mean and variance of G in the window ω k ,|ω| is the number of pixels in ω k , and F k = 1 |ω| ∑ i∈ω is the average of F in ω k .After calculating the coefficients W k and B k in the window ω k , we get the output Ok i corresponding to each window.Average Ok i obtained from different windows to generate O i which is equal to the average coefficient of all windows overlapping with i.As shown in Equation ( 5), where Ω i is the set of all windows including position i, and * is element-wise multiplication.
After up-sampling W l and B l to obtain W h and B l respectively, the final output is calculated as The calculation process of the CSFAG module is described in detail in Algorithm 1.
In order to further demonstrate that the CSFAG module can help the model generate clearer boundaries, highlight the lesion area and reduce background influence, We randomly selected a melanoma dermoscopy image in the test set of the ISIC2017 data set for visual analysis.We select the CSFAG module on the jump connection in the fourth layer of the CSAG and DCCNet model as the test module.For showing the role of the CSFAG module more clearly, we perform three convolution operations on the input of the CSFAG module, the output of the intermediate attention module and the final output of the CSFAG module to reduce the dimensionality.Then use bilinear interpolation to restore the feature size to 256 × 256.We have drawn a CAM heat map to more clearly show the changes produced by the CSFAG feature map.As shown in Figure 4, we can clearly see from the CAM heat map, after the attention module in the CSFAG module, the features are more focused on the lesion area.In particular, the final output results generated after the fast guide filter, the model further aggregates the characteristics of the space and the channel, enhances the discriminative ability of the network, and makes the network good use of the given characteristics.

Algorithm 1 Channel Spatial Fast Attention-guided Filter
Input: the guided feature map G and the filtered feature map F, parameters are r and λ Output: high-resolution feature map O square window with a radius of r and a pixel k as the ccenter 5:

Channel and Spatial Attention Learning
The attention mechanism on the channel has been proposed in Hu et al.'s SENet [37], and it has been verified that it can improve network performance.The channel attention module and the spatial attention module [41] are shown in Figure 5.The Channel Attention Module compresses the feature map in the spatial dimension to obtain a onedimensional vector before performing operations.As shown in Figure 5a, the input feature map F C ∈ R C×H×W is passed through global max pooling and global average pooling based on width and height respectively, and then passed through a multilayer perceptron (MLP).The MLP output feature maps F C Avg ∈ R C×1×1 and F C Max ∈ R C×1×1 are summed element-by-element, so that channel attention maps Xc can be generated, as shown in Equation (6).After the sigmoid activation operation, the final channel attention feature map d is generated, as shown in Equation (7).The channel attention featuremap and input featuremap are subjected to elementwise multiplication operations to generate the input features required by the spatial attention module.
where ⊕ represents addition element by element.
where C is the number of channels, X i,1,1 is the element with coordinate (i, 1, 1), and is element-by-element contacting.The Spatial Attention Module compresses the channel, and performs average pooling and maximum pooling in the channel dimensions.As shown in Figure 5b, the feature map A c ∈ R C×H×W output by the channel attention module is used as the input feature map of this module.First, the average pooling and maximum pooling operations along the channel axis are used to obtain two two-dimensional feature maps F S Avg and F S Max (the specific process is shown in Equations ( 9) and ( 10)), and then concat the two results based on the channel.The feature map of the merged 2 channels is subjected to a convolution operation to reduce the dimension to 1 channel to generate a spatial attention map Xs, as shown in Equation (8), where W and b represent MLP weight and MLP biase respectively.Cat(.) means concatenate.Conv(.)means convolution operation.
Then generate spatial attention feature A S through sigmoid, as shown in Equation ( 11), and finally multiply the feature and the input feature of the module to obtain the final generated feature.
where m, n represent mth position and nth position respectively.represents the contact element by element.

Densely Connected Convolution Module
The densely connected convolution module [51] is composed of multiple dense blocks, and each dense block performs two convolution operations.The structure of densely connected convolution is shown in Figure 6.The distinctive feature of the densely connected convolution module is that the input of each block is the concatenation of all feature maps generated by all previous blocks, and each layer performs a series of continuous transformations.The idea of densely connected convolution has some advantages over conventional convolution.First, it helps the network learn diversified functional characteristics, rather than redundant functions.In addition, this idea allows information to flow through the network and reuse functions to increase the representativeness of the network.Dense connectivity ensures the maximum information path between layers by connecting all layers.The output features of all convolutional layers in the dense block are connected in series along the channel axis.Let X i−1 and X i be the input and output of the i-th tightly connected dense block, respectively.The output X i of the i-th dense block F i can be obtained in the following manner.
where [.] represents the element series along the channel axis, and the input X 0 is the initial element f 1 .Because the input of the convolutional layer is repeatedly cascaded, the number of channels used for the input of the next convolutional layer increases with the growth rate n.The data used in this article were from the ISIC2017 [52] challenge data set released by the International Skin Imaging Collaboration (ISIC) and the PH2 [53] data set provided by the Pedro Hispano Dermatology Department of the Hospital and the Tecnico Lisboa University of Porto Research Group in Matosinhos, Portugal.The ISIC2017 data set provided 2750 dermoscopic images in RBG format, of which 2000 images were used in the training phase, 600 images were used in the Test Phase, and 150 images were used in the Validation Phase.The PH2 data set contained 200 8-bit RGB dermatoscope images with a fixed size of 768 × 560 pixels.All images provided lesion boundaries given by professional clinicians.Table 1 summarizes the distribution of the two data sets.

Processing
We combined 150 test data and 2000 training data as training data for the network.In order to train our proposed network structure more conveniently, we scaled the width of images of different sizes to 256 px according to the aspect ratio, and then filled the black borders on the top and bottom of the image to increase the height to 256 px.In order to better allow the network to learn the brightness, tone, and vividness of the dermatoscope image, we converted the image in RGB format to HSV format.As shown in Figure 7, the color components of R, G, and B in the RGB image were all related to the amount of light irradiated to the object.Therefore, the image description based on these components made it difficult to distinguish the object.Unlike RGB, HSV separates brightness or image intensity from chromaticity or color information and is more stable to changes in external lighting.HSV images can detect objects with specific colors and reduce the influence of light intensity from the outside.Therefore, we used HSV images as an effective supplement to training data to train the segmentation network.At the same time, the two formats of images were rotated horizontally, vertically, horizontally and vertically to expand the number of training images, and finally 17,200 training images were generated.

Experimental Setup 4.2.1. Evaluation Metrics
We used Sensitivity (SEN), Specificity (SPE), Accuracy (ACC), Jaccard (JAC), Dice Coefficient (DIC) and Matthew Correlation Coefficient (MCC) to more accurately judge the segmentation performance of CSAG and DCCNet proposed in this paper.Among them, SEN, SPE, and ACC are common statistical measures used to judge the performance of binary classification.JAC and DIC are used to evaluate the similarity between the segmentation results and ground truth.MCC is the correlation coefficient between the prediction result and ground truth.The above evaluation indicators are directly calculated from the confusion matrix.The calculation method of these six evaluation indicators refers to Equations ( 13)- (18), where TP represents the correct segmentation of skin lesion pixels, and FN is the wrong segmentation of skin lesion pixels.If the segmentation of non-lesion pixels is correctly classified as non-lesion, it is regarded as TN.Otherwise, they are FP.

Implementation Details
All experiments in this article were run on Ubuntu 16.04 system, performed on a workstation equipped with Intel(R) Xeon(R) Gold 5218 CPU 2.30 GHz, NVIDIA Quadro RTX 6000 (24 G).Use Python 3.6 and Pytorch 1.0.0 deep learning framework for programming.CSAG and DCCNet used stochastic gradient descent with momentum, the momentum parameter was 0.9, the weight decay coefficient was 5 × 10 −4 , the initial learning rate was 0.1, the exponential decay was 10% every 50 epochs, and the mini-batch size was 16.We used the Softmax function for final classification.The radius r of the local square window constructed in the CSFAG module was 4, and the regularization parameter λ was set to 0.001.In the last step of the CSAG and DCCNet encoding path proposed in this article, densely connected convolution was used to replace the ordinary convolutional layer.However, setting several densely connected blocks could achieve the best segmentation results, which is a problem worth discussing.Therefore, this article used 600 dermoscopy images on the ISIC-2017 data set for testing, and the distribution of the 600 test images is shown in Table 2.We used the control variable method.In the skip connection part of CSAG and DCCNet, the CSFAG module was still embedded, and only the number of dense connection blocks D used in the last step of the encoding path changed.We set D = 1, 2, 3, and 4 respectively.When D = 1, there was no densely connected convolution block in this layer, which was similar to the last coding layer of the standard U-Net.The specific experimental results are shown in Table 2.When the number of densely connected blocks was set to 3, the best results were achieved on ACC, SPE, DIC, JAC, and MCC.When D = 4, all indicators fell.In subsequent experiments, three densely connected blocks were set.

Structure Ablation
The CSAG and DCCNet proposed in this paper uses U-Net as the basic network structure, and builds an image pyramid input on the left side of the coding layer to achieve a multi-scale layer of multi-level receiving field fusion; on the right side of the decoder, introduces a side output layer, average all side output maps as the final prediction map.This structure also constitutes M-Net [54].In addition, the use of densely connected convolution and CSFAG module are also key factors for the CSAG and DCCNet model to show excellent performance in skin lesion segmentation.In this regard, we designed a set of ablation experiments, taking the UNet network structure as the baseline model, and adding image pyramid input, side output layers and multi-label loss functions to this basic structure to verify that these methods can improve the segmentation of dermoscopic images.In the last step of the MNet encoding path, densely connected convolutions are used instead of ordinary convolutional layers.We named them M-Net+Dense Convolutions.The CSFAG module is embedded on the MNet skip connection, and we named it M-Net+CSFAG.Experiments have further proved that these two methods are effective.The specific segmentation performance indicators are shown in Tables 3 and 4. The data in all tables were tested using different models on the 600 test set images in the ISIC2017 data set, calculating the various indicators of each test image, and averaging the various indicators of the 600 test images.The results obtained for Nevus cases, Melanoma cases, SK cases, and overall were all strictly trained for 200 epochs.From the results in Tables 3 and 4, we can see that whether it was in Benign Nevus, Melanoma and Seborrheic Keratosis lesions, MNet achieved greater improvement in various indicators compared with UNet, which verified Pyramid input, side output and multi-label loss function methods were helpful to improve the performance of dermoscopy image segmentation.From the results of M-Net+Dense Convolutions and M-Net+CSFAG, compared with MNet, the four key indicators of ACC, DIC, MCC, and JAC all achieved certain improvements.This further proves that these two methods are effective.We applied Dense Convolutions and CSFAG to MNet together to design the CSAG and DCCNet proposed in this article.From the specific segmentation results, especially in the segmentation of melanoma lesions and seborrheic keratosis, very competitive results could be obtained.
As presented in Figure 8, we visually display the segmentation results of UNet, MNet, M-Net+Dense Convolutions, M-Net+CSFAG, and CSAG and DCCNet proposed in this section.For Figure 8f, we zoomed in on the lesion area on the image to visually show the comparison between the five models verified in the structure ablation experiment and the ground truth segmentation results.It can be clearly seen in the figure that, compared with UNet, the segmentation result of MNet had clearer edges and more concentrated recognition of the lesion area.For M-Net+Dense Convolutions, M-Net+CSFAG and ours, compared with MNet, the segmentation results on moles and melanoma lesions had little difference.We believe that for dermoscopic images of moles and melanoma lesions, due to the large difference in the front background and the relatively concentrated lesion area, a simpler model could obtain better segmentation results.However, for the small difference between the front background and the blurred boundary of the lesion area, similar to the dermoscopic image of the last two rows in Figure 8, it was difficult to obtain a more ideal segmentation result using a more basic segmentation network structure such as UNet or MNet.Therefore, CSAG and DCCNet was designed in this paper.From the segmentation results of the last two rows in Figure 8, CSAG and DCCNet could identify larger lesion areas for images with small differences in the front background and blurred boundaries of the lesion area, the segmentation result obtained was closer to the ground truth.

Attention Module Ablation
The Channel Spatial Fast Attention-guided Filter (CSFAG) module designed in this paper is the main reason for the improvement of segmentation performance.In order to verify the effectiveness of the CSFAG module and verify that it had better feature aggregation performance than other attention modules, we designed a set of ablation experiments.We use M-Net+Dense Convolutions in the structural ablation part as the Baseline, and only changed the attention module embedded in the jump connection to verify the advanced nature of the CSFAG module proposed in this article.Three classic, lightweight, and general attention modules were selected for comparison with the CSFAG module.The first was the Squeeze-and-Excitation module in SE-Net [37].The SE module performs attention or gating operations on the channel dimension to drive the model to pay more attention to the channel features with the most information, while suppressing those unimportant channel features.The second attention module is the Feature Pyramid Attention (FPA) module proposed by Li and Xiong et al. [55].This module performs attention operations on pixels, adopts the idea of global pooling of PSPnet [31], and adds the result of pooling to the result of convolution with attention.The last attention module is the CBAM module proposed by Sanghyun Woo [41].This module combines the attention mechanism of spatial and channel, and learned how to effectively emphasize or compress and extract intermediate features, so that the model pays more attention to the target object itself.
Tables 5 and 6 show the segmentation performance of the model formed by the combination of SE Block, FPA, CBAM and Baseline and CSAG and DCCNet (ours) on three skin lesions.From the perspective of various indicators, the segmentation performance of the network structure with the added attention module was greatly improved compared with the baseline.For CSAG and DCCNet, especially in the three key indicators of DIC, JAC, and MCC, CSAG and DCCNet has achieved good results.We think the reasons why CSAG and DCCNet could obtain higher results are considered as follows.SE Block only studied the channel attention mechanism of feature maps, and ignored the information in the spatial dimension.Although the FPA module used the pyramid structure to expand the range of the receptive field, it lost the pixel-level positioning information, which had a greater impact on the dermoscopic image segmentation.CBAM could learn what to pay attention to and where to pay attention in the channel and spatial dimensions, but it is difficult to integrate the structural information contained in low-resolution feature maps and high-resolution feature maps, and retains less edge information.In addition, for the parameters of each model, SE Block only paid attention to the characteristics at the channel level, making the model parameters smaller.The FPA module uses 5 × 5 and 7 × 7 large convolution kernels, resulting in a large amount of model parameters.CBAM applies attention to both channel and spatial dimensions at the same time.Compared with SE Block, the amount of CBAM module parameters also increased.However, the CSFAG module proposed in this paper used the CBAM module as the attention module and combined the guide filter, so the parameter amount was slightly higher than the CBAM module parameter amount.
Table 5. Performance evaluation of sensitivity, specificity, and accuracy performance of different attention modules and CSFAG modules on the ISIC-2017 data set using baseline as the basic structure.Bold data indicates that the value is the maximum value in this indicator.For the problems of the above several attention modules, in order to further improve the segmentation performance of the model, we designed the CSFAG module.In Figure 9, we show the segmentation results of the four network structures for ablation experiments in this section.From Figure 9, we found that compared to the other three attention modules, the segmentation results obtained by CSAG and DCCNet (ours) were closer to ground truth.Especially for dermoscopic images with relatively blurry lesions, CSAG and DCCNet (ours) could segment larger lesions and get better segmentation results.In order to compare the segmentation results intuitively, we marked the segmented edge contours with lines of different colors and superimposed them on the original image, and enlarged the lesion area, as shown in Figure 9g.Obviously, the blue line representing CSAG and DCCNet In order to further illustrate the superiority of the CSFAG module, we drew the ROC curve and PR curve on the baseline-based SE block, FPA, CBAM model and CSAG and DCCNet (ours) respectively on overall, Nevus cases, Melanoma cases, and SK cases.The ROC curve gave information between false positive pixels and true positive pixels in the form of scores based on the threshold change on the probability map.When the ratio between the positive sample and the negative sample was large, the PR curve could better reflect the true performance of the classification.The difference from the upper left convex of the ROC curve was that the PR curve had an upper right convex effect.As shown in Figure 10, the areas of our ROC curve and PR curve were larger than other attention modules.This further showed that the CSFAG module had better performance than other attention modules, and better segmentation of background information and skin lesion area information.

Robustness Test of CSFAG Module
In order to verify that the CSFAG module proposed in this paper had good robustness and prove that it can be combined with multiple basic segmentation networks, we designed this set of experiments.We used U-Net [18], SegNet [56] and M-Net [54] three basic segmentation networks as the backbone network, embed the CSFAG module into the skip connection, named U-Net+CSFAG, SegNet+CSFAG and M-Net+CSFAG.
Tables 7 and 8 summarize the comparison of the segmentation performance of the three basic architectures and the new model combined with the CSFAG module and the proposed CSAG and DCCNet (ours).From Tables 7 and 8, we can see that although the new network formed after adding the CSFAG module had a large increase in the amount of model parameters, the new network was higher than the basic network in six indicators, which showed that the CSFAG module could significantly improve the segmentation performance of the network and is easy to transplant, and has good robustness.
The results of UNet and SegNet in Tables 7 and 8 were slightly lower than those of Al-Masni et al. [57] in Tables 9 and 10.We considered the differences caused by different parameter settings and training methods.Al-Masni et al. used the weight parameters of the model trained on VGG-16 on the ImageNet dataset as the initial weight of SegNet, and only fine-tuned the weight of the segmentation network.They used Theano and Keras for programming, select the AdamOptimizer optimization algorithm, set the batch to 20, and used NVIDIA GeForce GTX 1080 (16 G) GPU for training.However, when we reproduced U-Net and SegNet, we set random initial weights, performed on a workstation equipped with Intel(R) Xeon(R) Gold 5218 CPU 2.30 GHz, NVIDIA Quadro RTX 6000 (24 G).We used Python 3.6 and Pytorch 1.0.0 deep learning framework for programming.Using the stochastic gradient descent method with momentum, the momentum parameter was 0.9, and the weight attenuation coefficient was 5 × 10 −4 .As shown in Figure 11, in order to visually display the improvement of segmentation performance brought by the CSFAG module, we visually display the segmentation results of U-Net, SegNet, M-Net and the new attention-guided network they generate and CSAG and DCCNet.For (c3), (d3), (e3) in Figure 11, we use the red line to represent the ground truth, and the blue line to represent the segmentation result of ours.We zoomed in on the lesion area on the picture to visualize the comparison between the three models of the CSFAG module and the ground truth segmentation results.

Comparing with Existing Technology by Lesion Type
Tables 9 and 10 compare the CSAG and DCCNet proposed in this paper with the latest methods proposed by Al-Masni et al. [57] and Goyal et al. [58].From the results shown in Tables 9 and 10, compared with other models, CSAG and DCCNet achieved higher results in all indicators in the segmentation of three different types of skin lesions, especially in ACC, The four key indicators of DIC, JAC and MCC achieved the very competitive results.This proves that our proposed CSAG and DCCNet could produce more accurate segmentation boundaries, especially in the segmentation of melanoma lesions and seborrheic keratosis images.Therefore, we believe that the CSAG and DCCNet proposed in this paper was an effective supplement to the dermoscopic image segmentation method.Since there are few papers on calculating segmentation performance according to the types of skin lesions, there is a lack of comparable data.Most studies do not divide the test set according to the type of lesion, but evaluate the segmentation performance of all dermoscopic images in the test set together.In order to further compare with existing dermoscopic image segmentation methods, we have compiled the results in Table 11.From the results shown in Table 11, CSAG and DCCNet achieved high results in the two indicators of ACC and SPE, and also achieved more competitive results in the two key indicators of DIC and JAC.This further proved that the CSAG and DCCNet proposed in this paper was an effective supplement to the existing dermoscopic image segmentation methods.We used the independent dermoscopic image data set PH2 to verify the cross-data performance of our proposed network.We used the model trained on the ISIC2017 dataset to test the segmentation performance on the PH2 dataset.As shown in Tables 12 and 13, in the segmentation of Benign Nevus and melanoma, our proposed model obtained competitive results.This showed that our proposed model had good robustness and cross-data performance.Figure 12

Conclusions
In this article, we propose and implement a novel and robust skin lesion segmentation depth model called CSAG and DCCNet.In the last step of the encoding path, the model uses densely connected convolution instead of ordinary convolutional layers.In order to achieve better information fusion, highlight the foreground and reduce the impact of the background, we designed a novel attention-guided filter module, Channel Spatial Fast Attention-guided Filter (CSFAG for short), and embedded it in the CSAG and DCCNet segmentation network Jumping connection.Secondly, the model uses U-Net as the basic network structure, and builds an image pyramid input layer on the left side of the coding layer; on the right side of the decoder, introduces a side output layer, and averages all side output images as the final prediction image.In order to verify its effect, we evaluated the model using two publicly available datasets (ISIC-2017 Challenge and PH2 dataset).The results show that both densely connected convolution and CSFAG modules can improve the segmentation performance of the network, and the combination of them to form CSAG and DCCNet is better than some of the latest algorithms for skin lesion segmentation.Through a large number of ablation experiments, we have verified that the CSFAG module is superior to other mainstream attention modules and can be combined with multiple basic segmentation networks, effectively improving the segmentation performance of the model.CSAG and DCCNet trained on ISIC-2017 data set was tested on another publicly available data set PH2 data set to verify the robustness and cross-data set performance of our method.In the future, we believe that more interference in dermoscopic images is a key factor affecting segmentation performance, data purification on it is a very effective work.We look forward to the combination of the new dermoscopic image preprocessing method and the proposed model to get a better segmentation model, and apply the model to other medical images to prove the robustness of the method.

Figure 1 .
Figure 1.A typical pictorial presentation of the skin lesion images with different challenging for segmentation.

Figure 4 .
Figure 4. Visual display of CSFAG module input and output (A dermoscopy image of melanoma is a case, the red line is ground truth).

Figure 7 .
Figure 7. Visual description of the input dermoscopy image in different color spaces.

4. 3 .
Ablation Analysis 4.3.1.Discussion on the Number of Densely Connected Blocks at the Bottom

(Figure 9 .
Figure 9. Visualization results of the attention module ablation.The first two lines are Benign Nevus, the middle two lines are Melanoma lesions, and the last two lines are Seborrheic Keratosis lesions.(a) Original skin lesion image; (b) Ground truth corresponding to the lesion; (c) SE Block+Baseline model lesion segmentation results; (d) FPA+Baseline model lesion segmentation results; (e) CBAM+Baseline model lesion segmentation Results; (f) CSAG and DCCNet (ours) model lesion segmentation results; (g) ground truth (red) and SEnet+Baseline (purple), FPA+Baseline (green), CBAM+Baseline (yellow) and ours (blue) segmentation Results comparison chart.All pictures are preprocessed.

Figure 10 .
Figure 10.Visualized results of ROC curve and PR curve.

Figure 12 .
Figure 12.Visualization results for the PH2 data set.(a) Original skin lesion image; (b) ground truth corresponding to the lesion.(c) CSARM-CNN model lesion segmentation results; (d) ground truth (red) and segmentation results (blue) comparison chart.

Table 1 .
The distribution of The ISIC Challenge 2017 data set and PH2 datasets.

Table 2 .
Determination of the numberD of densely connected blocks on the ISIC-2017 data set.Bold data indicates that the value is the maximum value in this indicator.

Table 3 .
ISIC-2017 data set, performance comparison of U-Net, M-Net, M-Net+DC (Dense Convolutions), M-Net+CSFAG and CSAG and DCCNet (ours) in sensitivity, specificity and accuracy.Bold data indicates that the value is the maximum value in this indicator.

Table 4 .
On the ISIC-2017 data set, performance comparison of U-Net, M-Net, M-Net+DC (Dense Convolutions), M-Net+CSFAG and CSAG and DCCNet (ours) in Jaccard coefficient, Dice coefficient and Matthew correlation coefficient.Bold data indicates that the value is the maximum value in this indicator.

Table 6 .
Performance evaluation of Jaccard coefficient, Dice coefficient and Matthew correlation coefficient of different attention modules and CSFAG modules on the ISIC-2017 data set using baseline as the basic structure.Bold data indicates that the value is the maximum value in this indicator.

Table 7 .
The performance comparison of sensitivity, specificity and accuracy of the three basic architecture networks and their new network structure with the CSAG and DCCNet (ours) model on the ISIC-2017 data set.Bold data indicates that the value is the maximum value in this indicator.

Table 8 .
The performance comparison of Jaccard coefficient, Dice coefficient and Matthew correlation coefficient of the three basic architecture networks and their new network structure with the CSAG and DCCNet (ours) model on the ISIC-2017 data set.Bold data indicates that the value is the maximum value in this indicator.

Table 9 .
Performance evaluation of sensitivity, specificity, and accuracy performance of different attention modules and CSFAG modules on the ISIC-2017 data set using baseline as the basic structure.Bold data indicates that the value is the maximum value in this indicator.

Table 10 .
Performance evaluation of Jaccard coefficient, Dice coefficient and Matthew correlation coefficient of different modules and CSFAG modules on the ISIC-2017 data set.Bold data indicates that the value is the maximum value in this indicator.

Table 11 .
Performance evaluation of accuracy, Dice coefficient, Jaccard coefficient, sensitivity, specificity, and performance of different modules and CSFAG modules on the ISIC-2017 data set.Bold data indicates that the value is the maximum value in this indicator.

Table 13 .
Performance evaluation of Jaccard coefficient, Dice coefficient and Matthew correlation coefficient of different attention modules and CSFAG modules on the ISIC-2017 data set using baseline as the basic structure.Bold data indicates that the value is the maximum value in this indicator.