A Self-Spatial Adaptive Weighting Based U-Net for Image Segmentation

: Semantic image segmentation has a wide range of applications. When it comes to medical image segmentation, its accuracy is even more important than those of other areas because the performance gives useful information directly applicable to disease diagnosis, surgical planning, and history monitoring. The state-of-the-art models in medical image segmentation are variants of encoder-decoder architecture, which is called U-Net. To effectively reﬂect the spatial features in feature maps in encoder-decoder architecture, we propose a spatially adaptive weighting scheme for medical image segmentation. Speciﬁcally, the spatial feature is estimated from the feature maps, and the learned weighting parameters are obtained from the computed map, since segmentation results are predicted from the feature map through a convolutional layer. Especially in the proposed networks, the convolutional block for extracting the feature map is replaced with the widely used convolutional frameworks: VGG, ResNet, and Bottleneck Resent structures. In addition, a bilinear up-sampling method replaces the up-convolutional layer to increase the resolution of the feature map. For the performance evaluation of the proposed architecture, we used three data sets covering different medical imaging modalities. Experimental results show that the network with the proposed self-spatial adaptive weighting block based on the ResNet framework gave the highest IoU and DICE scores in the three tasks compared to other methods. In particular, the segmentation network combining the proposed self-spatially adaptive block and ResNet framework recorded the highest 3.01% and 2.89% improvements in IoU and DICE scores, respectively, in the Nerve data set. Therefore, we believe that the proposed scheme can be a useful tool for image segmentation tasks based on the encoder-decoder architecture.

Semantic image segmentation has a wide range of applications in the fields of computer vision, robotics, medical, and computer graphics. Image segmentation in natural images is used to parse the scene, and its performance has improved so that it can be applicable to automatic driving and robot sensing, to name a few [6,11]. When it comes to medical image segmentation, accuracy is even more important than other areas because the result gives important information for disease diagnosis, surgical planning, and history monitoring [12].
State-of-the-art scene segmentation frameworks for natural images are based on the fully convolutional network (FCN) [13], and the state-of-the-art models for medical image segmentation are variants of the encoder-decoder architecture called U-Net [14,15]. Encoder-decoder networks for segmentation use a similar structure: Skip connections, and coarse-grained feature maps. The skip connection-based scheme has been used in many successful image segmentation [14,16] and classification [17] methods. An attention framework was used to highlight salient features that stand out in many computer vision tasks, including segmentation, to take into account the nature of the task in the feature maps [18][19][20]. Considering the goal of segmentation, which assigns a category label to each pixel in the image, the segmentation result is obtained from the last feature map via the convolutional layer, so the feature maps in the encoder-decoder architecture should reflect the spatial characteristics of the task.
In encoder-decoder architecture, we propose a spatial adaptive weighting method for encoder-decoder architecture to reflect the spatial characteristics of feature maps. Since the segmentation result is predicted from the feature map through the convolutional layer [11,14,18], we estimate the spatial characteristics from the feature map and get the weighting parameters learned from the computed map. The weighting parameters are multiplied and added to the feature maps of the architecture.
We propose a self-spatial adaptive weighting scheme in a U-Net architecture (SS-U-Net) and apply it to medical images. The convolution block for extracting feature maps from the proposed network is replaced by the widely used convolution frameworks VGG, ResNet, and Bottleneck Resent structures. The up-convolution layer to increase the feature map resolution is replaced by the bilinear up-sampling method. To evaluate the proposed scheme, we use three sets of medical imaging data to include different medical imaging modalities: microscopy and ultrasound. Our experiments show that the proposed method has the smallest model size compared to the standard U-Net, while improving performance on three data sets. In particular, the model with the bottleneck structure, U-Net(B), has the smallest size among the compared methods, and is only about 60% the size of a standard U-Net.

Related Work
For natural image segmentation, the fully convolutional network (FCN) was first introduced by Long et al. [13]. This approach estimates a coarse segmentation map for each fully connected layer and improves the map by combining the fine segmentation score maps. The pyramid scene parsing network (PSPnet) based on FCN was proposed by Zhao et al. [5,6]. Since the FCN method's receptive field is not sufficient for complex scene images, the fusion information of these receptive fields and other sub-areas is calculated by the pyramid pooling structure and used as global prior to segmentation. To consider the context aggregation problem in a semantic segmentation scheme, an object-contextual representation method that characterized the pixel by representing the corresponding object class was proposed by Yuan et al. [21]. A method to maintain high-resolution representations throughout the entire process was proposed by Wang et al. [22]. To maintain high-resolution representation, it proposed the high-to-low resolution convolution streams and fused the representations from multi-resolution streams. A hierarchical attention mechanism for image segmentation was proposed to predict relative weights between adjacent scales and combine multiscale predictions at the pixel level [23].
U-Net [14], an encoder-decoder architecture based on the FCN, has been used in stateof-the-art models for medical image segmentation methods. It has symmetric architecture, and the feature map of the encoder is transmitted to the decoder side through a skipconnection. Then, that feature map is concatenated with the up-sampled feature map in the decoder path of the next convolutional layer. In order to highlight salient features of the network, attention architecture has been applied to the U-Net structure in the study [18]. For 3D structure medical images, H-dense U-Net based on the architecture of DenseNet [1] was proposed for liver and liver tumor segmentation by Li et al. [15]. To reduce the semantic difference between the feature maps of the encoder and decoder sub-networks, a skip pathway method was proposed in the U-Net++ [11].

Image Segmentation Problem
Image segmentation can be interpreted as an optimization problem to find a segmented image U in a given image V [12]. Thus, the given image is categorized into a set of optimized classes and the classes, G, are defined by where N c is the number of predefined classes. R and g i denote real and i-th class values, respectively.
In previous studies, before the deep-learning method emerged, minimization of the cost function for image segmentation was used to solve the optimization problem based on the Mumford-Shah function [12,24,25]. After major advances in computer vision technology based on deep convolutional networks, the problem has been solved using deep learning and large amounts of labeled data sets in many studies [13,14].

Self-Spatial Adaptive Weighting
In the image generation task based on Generative Adversarial Nets (GAN), a semantic segmentation mask was used as a condition for adjusting the appearance of images generated by image generation [26][27][28][29]. The given semantic segmentation mask is also used as a conditional guide for their normalization, and it improved the performance of image generation in previous research [29].
To consider a spatial weighting method in the image segmentation task without the given mask, we propose a spatial weighting scheme for image segmentation called self-spatial (SS) adaptive weighting, as shown in Figure 1. Let m C i ×H i ×W i be the output feature map of the convolution block, and m i denotes the feature map for the i-th convolution block in a segmentation network. Here, C i is the number of channels in the convolution block, and H i and W i represent the height and width of those feature maps, respectively. In the i-th block, the spatial characteristics are estimated from the feature map of each convolution block site as A function ν(·) is implemented by a single convolutional layer that converts m i to f i . That is, by this function, the feature map, m i , is turned into a spatial feature, f i , which has the number of division classes.
To consider the spatial characteristics for the segmentation, the spatial weighting parameter is obtained from the map computed according to Equation (3).
where µ(·), σ(·) represent functions that convert f i into the learned adaptive weighting parameters, γ i c,h,w and β i c,h,w , respectively. The spatial weighting parameters, γ and β, are multiplied and added to the feature map of the i-th convolution blocks element by element, asm The variables γ i c,h,w and β i c,h,w are the learned weighting parameters depending on the spatial map at the site c ∈ C i , h ∈ H i , w ∈ W i , and κ is a predefined constant value.
The learned weight parameters, γ i c,h,w and β i c,h,w , are calculated from the same spatial feature f i , and the learning-based parameter computation is implemented using a two-layer convolutional network, as shown in Figure 2.

Self-Spatial Adaptive Weighting-Based U-Net Structure for Image Segmentation
In medical image segmentation, U-Net architecture consisting of convolution blocks, skip connection paths, and up-convolution steps has been widely used.
The proposed weighting scheme is integrated into the standard U-Net architecture to apply adaptive weights based on spatial characteristics to the feature map that is passed to the next convolutional block via downsampling and upsampling methods. In order to prevent an increase in the model size by applying the proposed technique, the upconvolution layer of the standard U-Net was implemented by using a bilinear upsampling method, which results in reducing the model size of the U-Net, as shown in Figure 3. The proposed self-spatial adaptive weighting-based U-Net, SS-U-Net, is composed of three main blocks: a convolution block, a self-spatial adaptive weighting block, and an up/down sampling block. The feature map of the segmentation network is extracted from the convolutional block and weighted by the learned adaptive scales and biases calculated from the spatial features of the map in the SS block. In the encoding path of SS-U-Net, the weighted features transmitted through the skip connections and their resolutions are reduced in the down-sampling block implemented by max-pooling operations. On the other hand, the resolution of weighted features is increased by a bilinear upsampling method. The upsampled features and the ones passed through the skip connections are concatenated and propagated to the next convolutional block in the decoding path of the proposed network.
Deep convolutional neural networks for image classification have had a breakthrough method [17,30]. The VGG structure provides better performance with low complexity using a 3 × 3 convolutional kernel instead of a larger kernel, such as 5 × 5, or 7 × 7, and has a structure similar to that of a standard U-Net, but maintains the same spatial resolution at the input and output. The ResNet proposed skip-connection in depth so that the network stacked more layers compared to other networks. In the Bottleneck structure, the 1 × 1 convolutional layers were used to reduce the number of channels in the convolution block. Thus, it avoids increasing the complexity of the ResNet framework. These structures can be represented as shown in Figure 4. These convolutional frameworks have a similar purpose to the convolutional block of segmentation. In the proposed scheme, a widely used framework is used for the proposed structure, and in particular, a bottleneck structure is used to build a small model for segmentation. In this paper, a modified U-Net with bilinear upsampling is defined as U-Net le f t( cdot right), and the framework used is indicated in parentheses.

Datasets
As can be seen in Table 1, we cover a variety of medical imaging modalities using three medical imaging data sets for model evaluation. The first data set was obtained under a microscope to segment the cell area. The data set from the Data Science Bowl 2018 segmentation challenge consists of nuclei images from different modalities (brightfield and fluorescence) [11]. The other two data sets segmenting the fetal head consisted of 999 samples with no growth abnormalities [31], and the nerve regions were from an ultrasound imaging equipment. Given the resolution of the smallest image in the evaluation data sets, each image was scaled to 256 × 256 for our implementation. For the training, validation, and test sets, we split the data sets into training (80%), validation (10%), and test (10%) sets.
To evaluate the performance of our segmentation model, we calculate the intersection over Union (IoU), also known as the Jaccard index, which measures the area of intersection between the predicted segmentation and the ground truth divided by the area of union between them. In addition, we employ the Dice Coefficient, F1 score, which evaluates the value of 2× areas of intersection divided by the total number of pixels in the two images [32].

Training Setup
We implemented the networks using Pytorch [35], an open-source machine learning library for Python. The Adam optimizer [36] was used to train network weights and biases using 400 epochs with an initial learning rate of 0.001 and batch size of 16. For the proposed method, κ was set to 1, and the input resolution of the segmentation networks was set to 256 × 256, taking into account the image resolution of the data set.

Performance Comparison
In order to evaluate the effect of the self-spatial adaptive weighting method and convolutional blocks in U-Net network, the three kinds of convolution blocks were individually applied to the proposed structure, and combined with the proposed SS block. Table 2 compares the segmentation methods in terms of the model size and segmentation results that were measured by the IoU and DICE scores, respectively, for the three segmentation tasks. The model with the Bottleneck structure, U-Net(B), gives the smallest model size, and is about 60% of the size of standard U-Nets. In a similar manner, the model with the ResNet framework for convolutional blocks, U-Net(R), is about 95% of the standard size. Applying the SS block to a standard U-Net increases the U-Net's model size by 1.27 MB. Among the convolutional frameworks selected for the proposed method, we evaluate SS-U-Net(R), a ResNet convolutional block-based scheme with the best performance in the three segmentation tasks. In addition, the proposed method, SS-U-Net(R), produces the smallest model size, while the U-Net++ gives the largest model. Table 2. Performance comparison for image segmentation using various convolution blocks with the SS scheme. The "B", "V", "R", and "SS" represent the Bottleneck structure, VGG blocks, Residual block, and Self-Spatial normalization, respectively. Intersection over union (IoU) and the Dice coefficient are used in terms of comparison metrics (%). Numbers in bold indicate the highest performance in each metric. As can be seen in Table 2, the network with the Bottleneck framework, U-Net(B), has the lowest IoU and DICE scores in the fetal head and nerve segmentation tasks, and the network with the VGG block, U-Net(V), has the lowest IoU and Dice scores in the cell segmentation. As the proposed SS block is applied to the segmentation, segmentation performance is improved in all three types of convolutional blocks. In particular, the segmentation network using the proposed SS block and Bottleneck framework improved the IoU and DICE scores by 3.01% and 2.89%, respectively. SS-U-Net(R), a network with the proposed SS block and ResNet framework, achieved the highest IoU and Dice scores in three tasks compared to other combinations. The network using the ResNet framework improved the segmentation performance by about 1.15% and 0.97% in the IoU and DICE scores, respectively, in the nerve segmentation.

Method
The three task images are segmented by the network combined with the proposed SS block and a kind of convolutional framework, as shown in Figure 5. The proposed method for evaluating the segmentation performance is compared with standard U-Net [14], attention U-Net [18], U-Net++ [11], and customized wide U-Net architectures, as the authors did in [11]. Wide-U-Net, a model extended from the standard U-Net, has a model size similar to the largest model among the compared networks. Table 3 lists the experimental results and shows the effectiveness of the proposed scheme. The compared methods scored very high in fetal head segmentation, but the lowest performance in nerve segmentation. The proposed method in the three tasks scored the highest in both IoU and DICE. The U-Net with SS block, Att U-Net, and U-Net++ each had the second-highest performance in the three tasks. In addition, the performance of SS-U-Net was improved in all tasks compared to standard U-Net. The proposed method improved performance compared to the standard U-Net in three tasks, and in particular, it has a smaller model size. The three task images are segmented by standard U-Net, attention U-Net, U-Net++, wide-U-Net, and the proposed approaches, as shown in Figure 6. Figure 6a-i are the results of the first, second, and third tasks, respectively. The yellow area in the Figure illustrates the ground truth, and the solid red line represents the contour of the results. It can be easily seen that the result of the network with the proposed scheme has better segmentation performance than those of the compared methods.

Conclusions
In this paper, a self-spatial, adaptive, weighting-based U-Net for image segmentation was presented. The widely used convectional frameworks were employed for the proposed structure, the three kinds of convolution blocks were individually applied to the proposed structure, and their performances were compared. The experimental results showed that the proposed method could be effectively applied to the existing methods, and their performances were improved. In particular, it was verified that the proposed scheme, SS-U-Net, was efficient, and could provide the best segmentation result by combining the self-spatial adaptive weighting scheme and ResNet convolution block approach. In particular, the proposed approach with a bottleneck structure had the smallest model size among the compared methods, and they were improved in performance using the proposed block. Furthermore, the proposed method outperformed the compared methods with different segmentation targets and medical imaging modalities in terms of IoU and Dice metrics.
It is worth noting that the proposed scheme provided a compact model for SS-U-Net with the Bottlenet block structure while maintaining a similar performance to the standard U-Net. Therefore, we believe that the proposed scheme can be a useful tool for image segmentation.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the policy of the institute.

Conflicts of Interest:
The authors declare no conflict of interest.