A Semantic Segmentation Algorithm Using FCN with Combination of BSLIC

: An image semantic segmentation algorithm using fully convolutional network (FCN) integrated with the recently proposed simple linear iterative clustering (SLIC) that is based on boundary term (BSLIC) is developed. To improve the segmentation accuracy, the developed algorithm combines the FCN semantic segmentation results with the superpixel information acquired by BSLIC. During the combination process, the superpixel semantic annotation is newly introduced and realized by the four criteria. The four criteria are used to annotate a superpixel region, according to FCN semantic segmentation result. The developed algorithm can not only accurately identify the semantic information of the target in the image, but also achieve a high accuracy in the positioning of small edges. The effectiveness of our algorithm is evaluated on the dataset PASCAL VOC 2012. Experimental results show that the developed algorithm improved the target segmentation accuracy in comparison with the traditional FCN model. With the BSLIC superpixel information that is involved, the proposed algorithm can get 3.86%, 1.41%, and 1.28% improvement in pixel accuracy (PA) over FCN-32s, FCN-16s, and FCN-8s, respectively. superpixel semantic annotation. BSLIC superpixel is semantically annotated by the pixel-level classiﬁcation map, and the classiﬁcation of the image at the pixel level is obtained, which is the revised semantic segmentation result. The pixel-level classiﬁcation map combines the advantages of high-level semantic information and good edge information.


Introduction
Research on image semantic segmentation has been well-developed over decades [1][2][3][4][5][6][7][8][9]. Image semantic segmentation is a process in which an image is divided into several non-overlapping meaningful regions with semantic annotations being labeled. In an early study, the full-supervised semantic segmentation is developed by probabilistic graphical models (PGM), such as the generative model [1] and the discriminative model [10]. These models are based on the assumption of conditional independence, which might be too restrictive for many applications. To solve this problem, the condition random field model (CRF) [11] is proposed, which allows for us to get the correlation between variables. Moreover, the CRF model has the ability to exploit the local texture features and the contextual information. However, the CRF model can't acquire the overall shape feature of the image, which might cause misunderstanding during the analysis of a single target. Currently, one of the most popular deep learning techniques for semantic segmentation is the fully convolutional network (FCN) [4]. Unlike the traditional recognition [12][13][14][15][16][17] and segmentation [18,19] methods, FCN can be regarded as a CNN variant, which has the ability to extract the features of objects in the image. By replacing the fully connected layers with the convolutional layers, the classical CNN model, such as AlexNet [20], VGGNet [21], and GoogLeNet [22], is transformed into the FCN model. When compared with the CNN-based methods, the obtained model can achieve a significant improvement in segmentation accuracy. Despite the power and flexibility of the FCN model, it still has some problems that hinder its application to certain situations: small objects are often ignored and misclassified as background, the detailed structures of an object are often lost or smoothed.
In 2003, Ren et al. [23] proposed a segmentation method, named as superpixel. A superpixel can be treated as the set of pixels that are similar in location, color, texture, etc. It reduces the number of entities to be labeled semantically and enable feature computation on bigger, more meaningful regions. There are several methods to generate superpixel, such as Turbopixels [24], SuperPB [25], SEEDS [26], and SLIC [27]. By comparison, SLIC is significantly more efficient and accurate than others. In the previous work, BSLIC is presented [28,29] for better superpixel generation that is based on SLIC. BSLIC can generate superpixels with a better trade-off between the regularity of shape and the fitting degree of the edge, so as to keep the detailed structures of an object.
In this paper, we propose a semantic segmentation algorithm by combining FCN with BSLIC. FCN is typically good at extracting the overall shape of an object. However, FCN ignores to focus on the detailed structures of an object, which still has room to improve. Meanwhile, BSLIC is used to efficiently align the SLIC superpixels to image edges and to obtain better segmentation results. In contrast to SLIC, BSLIC adopts a simpler edge detecting algorithm to obtain the complete boundary information. To combine FCN semantic segmentation result and BSLIC superpixel information, we newly introduce the superpixel semantic annotation, which is realized by four criteria. With the four criteria, the tsuperpixel region is annotated by FCN semantic segmentation result. By this means, the detailed structures of an object can be kept to some extent. The proposed algorithm can not only extract the high-level abstract features of the image, but also take full advantages of the detail of the image to improve the segmentation accuracy of the model. The rest of the paper is organized as follows: In Section 2, the theory that is related to the proposed algorithm is introduced, including the fully convolution network and the BSLIC superpixel segmentation. The details of the algorithm in this paper are described in Section 3, and the experimental results are given in Section 4. In Section 5, an overall conclusion is made.

FCN
As a CNN variant, FCN is widely used for the pixel-level classification. The output of FCN is a category matrix with the semantic information of the image. The core ideas of the FCN model are summarized: (1) converting a fully connected layer to a convolution layer with the kernel size 1 × 1; (2) using a deconvolution or upsampling operation to generate a category matrix with the same size as the input image. By this means, the popular FCN-8s, FCN-16s, and FCN-32s are defined in [4], and the generation procedure of them is described in Figure 1. The 1st row of Figure 1 shows that the feature maps are derived from the input image by repeated convolution calculation and pooling operation. The convolution calculation is used to extract features which cannot change the resolution of the image or feature maps. Each pooling operation can reduce the resolution of the feature map by two times. Therefore, the resolution of the nth (n = 1, 2, . . . , 5) feature map is 1/2 n of the input image. Based on these feature maps, the semantic segmentation result, with the same resolution of the input image, can be obtained by using upsampling and multiscale fusion techniques. The upsampling is introduced to increase the resolution of feature map. Multiscale fusion is used to acquire finer details from the feature maps of different scale. Three different combinations of upsampling and muliscale fusion are utilized to implement FCN-8s, FCN-16s, and FCN-32s. As shown in Figure 1, the feature maps used to generate the semantic segmentation results are from pool3, pool4 and conv7, which are 1/8, 1/16, and 1/32 resolution of the input image, separately. The detailed description is as follows: The conv7 feature map is upsampled 32 times directly to obtain the semantic segmentation result.

• FCN-16s
The conv7 feature map is upsampled 2 times firstly. Then the two-time-upsampled conv7 feature map and pool4 feature map are fused by using the mutiscale fusion technique. Finally, the fusion result is upsampled 16 times to obtain the semantic segmentation result.

• FCN-8s
The conv7 feature map is upsampled four times firstly. Then, the pool4 feature map is upsampled two times. Next the four-time-upsampled conv7 feature map, two-time-upsampled pool4 feature map, and pool3 feature map are fused by using the multiscale fusion technique. Finally, the fusion result is upsampled eight times to obtain the semantic segmentation result.

BSLIC
BSLIC comes from SLIC, which is a method of to generate superpixel. Essentially, BSLIC is a local K-means clustering algorithm. Assuming that the number of pixels in the given image is N, the number of superpixels to be generated is k, the coordinates of the pixels in the image are x and y and the components of CIELAB color space are l, a, and b, the SLIC algorithm is proposed as follows: • Step 1: The clustering center C i = [l i a i b i x i y i ] T is uniformly initialized in horizontal and vertical directions separately. The horizontal interval S h and vertical interval S v are as follows: • Step 2: In order to avoid selecting noise points in the image, all of the clustering centers are updated to points that have the smallest gradient value in their neighborhood.

• Step 3:
In the 2S 0 × 2S 0 neighborhood of each image pixel i, the clustering center with the smallest distance to the pixel i is searched to achieve the class tag of the pixel i.

• Step 4:
The clustering center is updated to the mean vector of all the pixel eigenvectors in its category.

• Step 5:
Repeat the third and fourth steps until the last two iterations of the cluster center error is no more than 5%.
BSLIC has an improvement in three aspects: initializing cluster centers in hexagon, rather than square distribution, additionally choosing some specific edge pixels as cluster centers, incorporating boundary term into the distance measurement during k-means clustering. Figure 2 shows the segmentation results of BSLIC algorithm in different input parameters k and m, where k is the number of superpixels to be generated, and m is the weighting factor of color euclidean distance and the space euclidean distance. Through evaluating the segmentation results, we can draw the following conclusions: (1) larger m achieves better regularity for superpixels, and smaller m achieves better adherence to image boundaries; and, (2) the larger k aligns to more detailed boundaries.

Overall Framwork
To make the FCN model achieve a more accurate and detailed description of the target edge, this paper integrates the superpixel edge of the image into the FCN model. The block diagram of our proposed algorithm that is based on FCN and BSLIC superpixel is shown in Figure 3. First, the FCN model based on VGG-16 network is trained on the dataset PASCAL VOC 2012 in order to develop the FCN semantic segmentation model. Then, the trained FCN semantic segmentation model is used to segment the image to obtain the image pixel-level classification map. At the same time, BSLIC superpixel segmentation is performed on the segmented image, generating a superpixel segmentation map. Finally, the results of FCN and BSLIC are combined based on superpixel semantic annotation. BSLIC superpixel is semantically annotated by the pixel-level classification map, and the classification of the image at the pixel level is obtained, which is the revised semantic segmentation result. The pixel-level classification map combines the advantages of high-level semantic information and good edge information.

Superpixel Semantic Annotation
Based on the framework mentioned above, superpixel semantic annotation (Algorithm 1) combines the FCN pixel-level classification map and the BSLIC superpixel segmentation map, and it finally gets the revised semantic segmentation result.

1.
Acquire a FCN pixel-level classification map.

2.
Acquire a BSLIC superpixel segmentation map. The parameters are shown as follows: the collection of superpixel Sp = {Sp 1 , Sp 2 , · · · , Sp k , · · · , Sp K }, the number of semantic categories in superpixel Sp k is A, the proportion of pixels of the semantic category t(0, 1, · · · , 20) , all pixels in the superpixel Sp k is C t .

3.
Generate superpixel semantic annotation using the four criteria Loop: For k = 1:K Criterion 1 If there is no image edge in superpixel Sp k and A = 1 then Label the superpixel with FCN semantic result End Criterion 2 If there is no image edge in superpixel Sp k and A > 1 then Use t of the largest C t to label the superpixel End Criterion 3 If there is image edge in superpixel Sp k and A = 1 then Label the superpixel with FCN semantic result End Criterion 4 If there is image edge in superpixel Sp k and A > 1 then If C t > 80% in superpixel Sp k then Use t of the largest C t to label the superpixel

Else
Maintain then FCN semantic segmentation result End 4.
Output the superpixel semantic annotation result.
The core of the Superpixel semantic annotation is the four criteria, which can simplify the process of analyzing and calculating. According to the four criteria, the combination of FCN and BSLIC can be classified into four situations. Based on this simple and complete classification, the corresponding solutions are designed to combine FCN and BSLIC. All the four criteria are described as follows: • Criterion 1: Situation: there is no image edge in the superpixel, and the FCN semantic of all pixels are the same; Solution: use the FCN pixel-level classification map to label all the pixels in this superpixel.

• Criterion 2:
Situation: there is no image edge in the superpixel, but the FCN semantic of the pixel is different; Solution: use the category of the largest proportion C t (t = 0, 1, 2, 3, . . .) to label all of the pixels in this superpuixel.

• Criterion 3:
Situation: there are image edges in the superpixel, but the FCN semantic of all the pixels are the same; Solution: use the FCN pixel-level classification map to label the pixels in this superpixel.
• Criterion 4: Situation: there are edges in the superpixel, and the FCN semantic of the pixels are different. Solution 1: if there is any proportion C t more than 80%, and then label all of the pixels in this superpixel as the category of the largest proportion C t . Solution 2: if there is no proportion C t more than 80%, then maintain the FCN pixel-level classification result.
As shown in Figure 4, an example of superpixel semantic annotation is given to explain how to use the four criteria. Based on the superpixel segmentation map (Figure 4a) obtained by BSLIC, the detailed explanation is discussed as follows. In Figure 4b, the situation of superpixel Sp 219 , which is the local area of the aeroplane, meets criterion 1. The pixels of superpixel Sp 219 (447 pixels) are labeled as aeroplane (category 1) in the FCN pixel-level classification map, so the pixels in this superpixel can be labeled as aeroplane directly.
In Figure 4c, the situation of superpixel Sp 400 , which is the junction of wing and background, meets criterion 2. FCN model labels 75.5% of the pixels in the superpixel as background, and labels 24.5% of the pixels as aeroplane. Therefore, the classification of the pixels in this superpixel is background.
In Figure 4d, the situation of superpixel Sp 178 , which is the area of fuselage, meets criterion 3. FCN model labels the pixels in this superpixel as aeroplane. Therefore, the pixels in this superpixel can be labeled as aeroplane directly.
In Figure 4b, the situation of superpixel Sp 233 , which is the fuzzy edge of the image, meets criterion 4. 499 pixels (91.6%) in this superpixel are labeled by FCN as background. Consequently, the superpixel can be labeled as background directly.

Experimental Results
Experiments are performed to evaluate the proposed algorithm in this section. The analysis of the experimental data will be given in detail. The experimental platform for this paper is: Intel 6700K@4.00GHz CPU, 32G memory, Samsung 840Pro SSD, Windows7 x64 operating system, Matlab 2014a development platform, MatConvNet depth learning toolbox beta23 version.

Training and Testing of FCN Model
The used dataset in our work is PASCAL VOC 2012 with a total of 17,125 images. All of the targets are divided into 20 categories, excluding the background labeled as category 0. At the same time, in order to facilitate the expression of image semantics, PASCAL VOC 2012 assigns specific color labels and numbers to each category so that different categories of targets can be distinguished by color or numbers in image semantic segmentation. Figure 5 shows 20 categories of instance images and corresponding color labels. The dataset will be used to train and test the FCN network, and the detailed processed is shown as follows: First, divide the PASCAL VOC 2012 dataset. The dataset involves 20 foreground object categories and one background category, as shown in Figure 5. The original dataset contains less than 5000 pixel-level labeled images. The extra annotations provided by [30] have 10,582 pixel-level labeled images. We take 9522 (90%) images as the training dataset, 530 (5%) images as the validation dataset, and the remaining 530 (5%) images as the test dataset. Secondly, reconstruct the VGG-16 network for FCN. By converting the fully connected layers in the VGG-16 network into the convolution layers, a FCN-VGG16 model is acquired. After that, the FCN model is initialized by using the full 16-layer version, which is trained by OverFeat [21].
Thirdly, in order to have a better training result, we run the FCN for 50 epochs on training dataset. At the same time, we use pixel-level label to supervise the FCN-VGG16 model. The training model classification accuracy over the epoch of training is shown in Figure 6. The line that is marked by 'train' shows the relationship between the classification accuracy of training dataset with the epoch. The line marked by 'val' shows the relationship between the classification accuracy of the validation dataset with the epoch. Due to the randomness of the model parameter initialization, the classification accuracy of the validation dataset may be higher than that of the training dataset at initial epochs. Furthermore, the knowledge learning from the training dataset cannot suit the validation dataset well, so the curve 'val' becomes lower than the curve 'train' over epoch, leading to an intersection point around certain epochs.
Finally, validation dataset has been used to evaluate the training performance. The semantic segmentation results are obtained from FCN-8s, FCN-16s, and FCN-32s, respectively. Some of the experimental results are shown in Figure 7. The successful training model is used for semantic segmentation in further process.

Qualitative Comparison
Due to the outstanding performance, FCN-8s is selected to be combined with BSLIC in our work. Our proposed algorithm is compared with three other methods: FCN-8s, FCN-16s, and FCN-32s. All of the test dataset from PASCAL VOC 2012 are processed by four algorithms. Owing to the better performance observed in Figure 7, FCN-8s is selected as the reference for comparison. Figure 8 compares the FCN semantic segmentation results with our semantic segmentation results. From Figure 8c,d, it is observed that the proposed algorithm can reach the same semantic recognition accuracy as FCN-8s. At the same time, the improved algorithm has a better accuracy when dealing with the small edges of the target, such as the wheel of aeroplane, the bird's feet, the sheep's ear, and the rear view mirror of a car in Figure 8d. In FCN-8s, small objects are often ignored and misclassified as background. In conclusion, the combination of FCN and BSLIC outperforms FCN-8s, FCN-16s and FCN-32s, especially the ability to keep the detailed structure of an object.   Table 1 shows the time-consuming of four methods. The proposed method allows for FCN and BSLIC to run at the same time, so there is no time to waste. However, it requires a little extra time to combine FCN and BSLIC by using the four criteria. This is the reason why the four criteria are designed in a simple and efficient way.

Quantitative Comparison
In practical engineering, pixel accuracy (PA), intersection over union (IoU), and mean intersection over union (mIoU) are used to evaluate the performance of the semantic segmentation technique. PA is used to measure the accuracy of object contour segmentation. IoU is used to measure the accuracy of an object detector on a particular dataset. mIoU is the average of IoU and is defined to reflect the overall enhancement of semantic segmentation accuracy. The formulas to calculate metrics can be found in [31] and is used in this paper. It is assumed that the total of classes is (k + 1) and p ij is the amount of pixels of class i inferred to class j. p ii represents the number of true positives, while p ij and p ji are usually interpreted as false positives and false negatives, respectively.
• Pixel Accuracy (PA): It computes a ratio between the amount of properly classified pixels and the total number of them. The formula can be expressed as: • Intersection Over Union (IoU): The IoU is used to measure whether the target in the image is detected. The formula can be expressed as:

• Mean Intersection Over Union (mIoU):
This is the standard metric for segmentation purposes. It computes a ratio between the ground truth and our predicted segmentation. mIoU is computed by averaging IoU.
Based on the previous test dataset from PASCAL VOC 2012, PA, IoU, and mIoU are calculated by (3)-(5), respectively. Table 2 gives the IoU scores between SegNet [32], UNIST_GDN_FCN [33], and our proposed method. Table 3 gives the IoU scores of FCN-32s, FCN-16s, FCN-8s, and our method.  It can be shown in both Tables 2 and 3 that our method has an advantage in most of the classification indicators. Because Table 3 is consistent with the results obtained in Table 2, so only the  Table 3 is used to analyze the performance of the proposed method, as follows: The comparison of IoU between our proposed algorithm and the FCN model are shown in Figure 9. The abscissa value may range from 0~20, representing the different categories in PASCAL VOC 2012. Figure 9 is plotted by the data in Table 3, which gives all the IoU in 21 categories. For all of the four methods, the highest and the lowest IoU scores are observed in category 0 and category 9, respectively. When compared with the FCN-32s, FCN-16s, and FCN-8s, the proposed algorithm has a higher IoU score in categories with the superscript '*' in Table 3. Figure 10 shows the difference value of IoU score between our algorithm and the FCN model. Obviously, our algorithm outperforms FCN model in large majority of categories. Especially, the IoU score of FCN model are 64.79%, 69.18%, and 69.73% in category 10. But, they are 6.25%, 1.86% and 1.31% lower than our algorithm. A detailed comparison is shown as follow: Figure 10. IoU difference value between our algorithm with FCN models. Table 4 shows the pair-wise comparison of IoU score between our method and the three other methods. When compared with FCN-32s, 20 categories have been improved by using our proposed method. The increased category ratio is 95.24%. When compared with FCN-16s, 15 categories have been improved. The increased category ratio is 71.43%. As compared with FCN-8s, 16 categories have been improved. The increased category ratio is 76.16%. The comparison proves our method can achieve a good performance. As shown in Table 5, our algorithm achieves mIoU score at 62.8% which outperforms others. Our algorithm achieves PA score at 77.14%, which is 3.86%, 1.41%, and 1.28% higher than that of FCN-32s, FCN-16s, and FCN-8s, individually. The scores of mIoU show that the proposed algorithm has an average improvement in the classification accuracy. The scores of PA indicate that our algorithm has a better fit for the object contour. The enhancement of these indicators has benefited from the application of BSLIC and the four criteria.

Conclusions
In this paper, an improved image semantic segmentation algorithm is proposed. The BSLIC superpixel edge information is combined with the original FCN model, where superpixel semantic annotation is applied to the combination process. The superpixel semantic annotation is abiding by four criteria, so that the detailed structures of an object can be kept to some extent. The four criteria cover every situation of superpixel semantic annotation process and give each one a solution to acquire a clearer edge of the target. The algorithm not only inherits the high-level feature extraction ability of the FCN model, but it also takes full advantages of the low-level features that are extracted by BSLIC. Compared with the FCN semantic segmentation model, the pixel accuracy of the proposed algorithm on PASCAL VOC 2012 dataset is 77.14%, which is 3.86%, 1.41%, and 1.28% higher than that of FCN-32s, FCN-16s, and FCN-8s, respectively. The experiment results prove that our method is available and effective. The proposed method combines FCN with BSLIC by using the four criteria, and the introduction of other modules might provide optimized solution to extent the application.

Conflicts of Interest:
The authors declare no conflict of interest.