Semantic Image Segmentation with Deep Convolutional Neural Networks and Quick Shift

: Semantic image segmentation, as one of the most popular tasks in computer vision, has been widely used in autonomous driving, robotics and other ﬁelds. Currently, deep convolutional neural networks (DCNNs) are driving major advances in semantic segmentation due to their powerful feature representation. However, DCNNs extract high-level feature representations by strided convolution, which makes it impossible to segment foreground objects precisely, especially when locating object boundaries. This paper presents a novel semantic segmentation algorithm with DeepLab v3+ and super-pixel segmentation algorithm-quick shift. DeepLab v3+ is employed to generate a class-indexed score map for the input image. Quick shift is applied to segment the input image into superpixels. Outputs of them are then fed into a class voting module to reﬁne the semantic segmentation results. Extensive experiments on proposed semantic image segmentation are performed over PASCAL VOC 2012 dataset, and results that the proposed method can provide a more efﬁcient solution.


Introduction
Semantic image segmentation is a typical computer vision problem. Its task is to assign different categories to each pixel in an image according to the object of interest [1]. In the past several years, due to a large amount of training images and high-performance GPUs, deep learning techniques-in particular, supervised approaches such as deep convolutional neural networks (DCNNs)-have achieved relentless success in various high-level computer vision tasks, such as image classification, object detection, semantic segmentation, etc. [2][3][4]. The key advantage of these deep learning techniques is to learn high-level feature representations in an end-to-end fashion, which are more discriminative than traditional ones. Inspired by the success of deep learning techniques in image classification tasks, researchers explored the capabilities of such networks for pixel-level annotations and proposed many prominent deep learning networks for semantic segmentation.
Nowadays, most DCNNs for semantic segmentation are based on a common pioneer: fully convolutional network (FCN) proposed by Long et al. [5]. It transforms the well-known DCNNs used for image classification, such as AlexNet [6], VGG [7], GoogleNet [8], into fully convolutional ones by replacing the fully connected layers with convolutional ones in order to output spatial feature maps instead of classification probabilities. Those feature maps are then decoded [5] to produce dense pixel-level annotations. FCN is considered a milestone in deep learning techniques for semantic segmentation, since it demonstrates how DCNNs can be trained end-to-end to solve this problem, efficiently learning how to produce dense pixel-level predictions for input of arbitrary sizes. It achieved 20% relative improvement in segmentation accuracy over traditional methods on the PASCAL VOC 2012 dataset [9].
DeepLab series [10][11][12][13] are successful and popular in DCNNs based semantic segmentation model. DeepLab v1 [10] introduces atrous convolution [14] in DCNN to effectively enlarge the receptive field without increasing the number of network parameters. To localize objects boundaries, it combines the last layer of DCNN with fully connected CRF [15]. Due to these two advanced techniques, it reached 71.6% mIOU accuracy in the PASCAL VOL 2012 dataset. Based on DeepLab v1, DeepLab v2 [11] further proposes atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. By employing multiple parallel atrous convolutional layers with different dilation rates, ASPP can exploit multi-scale features, thus capturing objects as well as image context at multiple scales. DeepLab v2 combines atrous convolution, ASPP and fully connected CRF, achieving 79.9% mIOU accuracy in the PASCAL VOC 2012 dataset. DeepLab v3 [12] incorporates improved ASPP, batch normalization and a better way to encode multi-scale context to further improve performance, reaching 85.7% mIOU accuracy in the PASCAL VOC 2012 dataset. Improved ASPP involves concatenation of image-level features, a 1x1 convolution and three 3x3 atrous convolution with different dilation rates. Batch normalization is used after each of the parallel convolution layers. Fully connected CRF is abandoned from Deeplab v3. In Deeplab v3+ [13], ASPP and Encoder-Decoder structure are used. The Decoder module refines the segmentation results in pixel-level. Deeplab v3+ further explores the Xception model [16] and applies the depthwise separable convolution [17] to ASPP and the Decoder module. In PASCAL VOC 2012 test set and Cityscapes datasets, it achieved performance of 89.0% and 82.1% separately.
Superpixel has been commonly used as a preprocessing step for image segmentation since it was first introduced by Ren et al. [18] in 2003, because it reduces the number of inputs for subsequent processing steps and adheres well to objects boundaries. It groups pixels into perceptually meaningful atomic regions, thus can enable feature computing on a meaningful image representation. In the past decades, a large number of superpixel algorithms have been proposed. Quick shift [19] is a mode-seeking based clustering algorithm, which has a relatively good boundary adherence. It first initializes the segmentation using medoid shift [20], then moves each data point in the feature space to the nearest neighbor that increases the Parzen density estimation [21]. Simple Linear Iterative Clustering (SLIC) [22] adopts a k-means clustering approach with a distance metric that depends on both spatial and intensity differences to efficiently generate superpixels. Felzenszwalb [23] is a graph-based approach used for image segmentation. It performs an agglomerative clustering of pixels as nodes on a graph such that each superpixel is the minimum spanning tree of the constituent pixels.
Although DeepLab v3+ has achieved good performance in semantic image segmentation, it still has some shortcomings. One of the main problems is that it adopts DCNN for semantic segmentation, which consists of strided pooling and convolution layers. They increase receptive field but aggregate the context information while discarding the boundary information. However, semantic segmentation needs the exact alignment of class maps and thus, needs the boundary information to be preserved. To ttackle the challenging problem, this paper presents a novel method to refine the object boundaries of the segmentation results output from DeepLab v3+, which unites the benefits of DeepLab v3+ with the superpixel segmentation algorithm-quick shift [19]. The main methods are as follows: (i) Using DeepLab v3+ to obtain the class-indexed score map of the same size of the input image; (ii) Segmenting the input image into superpixels by quick shift; (iii) Inputing the output of these two modules into the category voting module to refine the object boundary of the segmentation result. The proposed method in this paper improves the semantic segmentation results both qualitatively and quantitatively, especially on object boundaries. Experiments on the PASCAL VOC 2012 dataset verify the effectiveness of the proposed method.
The paper is organized as follows. Section 2 describes the proposed method in detail. Section 3 presents the experimental results of the proposed method on the PASCAL VOC 2012 dataset, and comparisons with other methods. Section 4 discusses and analyses the experimental results. Finally, conclusions are drawn in Section 5.

Methodology
In this section, a robust framework is proposed for semantic image segmentation. It captures the class index score map by DeepLab v3+ and segments the input image into superpixels by quick shift. The object boundary of the segmentation result is refined by a class voting module.

Motivation
(1) It is hard for DCNNs based semantic segmentation methods to produce semantic segmentation results with accurate objects boundaries. There are two main reasons. First, the memory of GPU is limited, so DCNNs should adopt strided pooling and convolution to reduce parameters. Second, it is difficult to assign labels for pixels on objects boundaries because the cascaded feature maps generated by DCNNs blur them. In order to precisely segment foreground objects from background, DCNNs should have the following two properties. First, it should classify objects boundaries precisely. Second, for pixels on objects boundaries, class score computed for the target class should be close to class scores of other classes.
As shown in Figure 1, DCNN centers on the red and yellow points of the input image, respectively, to extract features in their respective receptive fields. At the aeroplane's boundaries, the image regions corresponding to the aeroplane and the background pixels have a great overlap, resulting in features extracted by DCNN are very similar and therefore difficult to classify pixels on aeroplane's boundaries. The softmax loss function used for semantic segmentation is often simply formulated as: where N is the total number of classes, and k is the target class. In the training iteration, DCNNs minimize loss, i.e. maximize x k . In order to obtain objects boundaries accurately by interpolation, DCNNs should consider not only class score of the target class but also class scores of other classes. The loss function aforementioned only tries to maximize class score of the target class, while ignoring class scores of other classes, so it is difficult for DCNNs to output a proper score for each class.
(2) In an image, a foreground object is often composed of a series of regions. Inside these regions, its color, lightness and texture have little changes. DCNNs based semantic segmentation methods directly classify every pixel in the image and have no idea that these regions belong to the same object. Figure 2 is a semantic segmentation result output by DeepLab v3+. From it, we can see that the areas labeled with green color are segmented separately from the regions they should belong to, which is incorrectly segmented. In order to tackle this problem, we employ a region-based method to postprocess semantic segmentation results of DeepLab v3+.

Main Process
Framework of the proposed method is shown in Figure 3. It consists of the following three modules: (a) DeepLab v3+, (b) superpixel segmentation-quick shift, and (c) Class Voting. In (a), we use DeepLab v3+ to obtain a class-indexed score map for the input image. In this score map, each pixel in the input image is marked with an index corresponding to its predicted class. In (b), we use quick shift algorithm to segment the image into superpixels. It outputs the superpixel index of each pixel in the image. Then, the outputs from (a) and (b) are fed into (c) to obtain the refined semantic segmentation results for the input image.   Figure 4, DeepLab v3+ is a novel Encoder-Decoder architecture which employs DeepLab v3 [12] as Encoder module and a simple yet effective Decoder module. It applies ResNet-101 as backbone, adopts atrous convolution in deep layers to enlarge receptive field. After ResNet-101, an ASPP module is on the top of it to aggregate the multi-scale contextual information. The Decoder concatenates the low-level features from ResNet-101 with upsampled deep-level multi-scale features extracted from ASPP. Finally, it upsamples the concatenated feature maps to produce the final semantic segmentation results.

Training Details
We implement DeepLab v3+ with PyTorch and experiment it on PASCAL VOC 2012 dataset. The implemented model using outputstride = 16 during training and evaluation, without using multi-scale and left-right flipped inputs [13]. The dataset contains 20 foreground object classes and one background class. It officially consists of 1,464 images in train set and 1449 images in val set. We also augment the dataset with additional annotations provided by [24], resulting in a total of 10,582 training images. These 10,582 images are used as the train_aug set to train the model following the training strategy in DeepLab v3+ [13] and the val set for evaluation.
In DeepLab v3+, it preprocesses input images by resizing and cropping them to fixed size of 513 × 513. When computing the value of mIOU, the annotated objects boundaries in ground-truths are not taken into consideration. Thus, mIOU can not be used to evaluate whether the pixels on objects boundaries are classified right or not. In order to evaluate the accuracy of the proposed method in localizing objects boundaries, we compute the value of mIOU of the implemented DeepLab v3+ in two steps. First, we follow the same training process as in [13] with fixed input image size of 513 × 513. Second, we recompute the value of mIOU with the model trained in the first step, but the inputs of the model are images of arbitrary sizes and without preprocessing. At the same time, we labeled objects boundaries in ground-truths as background when computing the value of mIOU. In the first step, the model we implemented reaches the performance of 78.89% mIOU in the PASCAL VOC 2012 val set, which is slightly higher than the result (78.85%) shown in [13]. This verifies that our implementation is right. In the second step, the model achieves the performance of 68.34% mIOU in the PASCAL VOC 2012 val set, which is used for comparison with the proposed method in Section 3.

Quick Shift
Quick shift [19] is one of the most popular superpixel segmentation algorithms. Its principle is based on an iterative mode-seeking that identifies modes in a set of data points. A mode is defined as the densest location in a certain feature space which is composed of all the data points. Given N data points x 1 ,x 2 ,. . . ,x N ∈X⊆R d , quick shift first computes Parzen density estimation [21]: where k(x) is the kernel function, which is usually an isotropic Gaussian window. D(x i , x j ) is the distance between data point x i and x j . Then, it moves the center of the kernel window to the nearest neighbor of x i , in order to extend the search path to the next data point. At which there is an increasing density P: When all the data points are connected with one another, a threshold is used to separate modes. Different clusters of the data points can then be separated.
Quick shift may be used for any feature space, but for the purpose of this paper we restrict it to 5D feature space to use it for image segmentation. The 5D feature space is consisting of 3D RGB color information and 2D location information. After computing the Parzen density estimation for each image pixel, quick shift constructs a tree connecting each pixel to its nearest neighbor that has higher density value. Then each pixel connects to its closest higher density pixel parent that achieves the minimum distance. It generates a forest of pixels whose branches are labeled with a distance value. This specifies a hierarchical segmentation of the image. Superpixels can be identified by applying a threshold to cut the branches that is larger than it.
We apply quick shift to partition the input image into superpixels, as shown in Figure 5.

Class Voting
DeepLab v3+ outputs a class-indexed score map for the input image, each pixel of which is labeled with an index corresponding to its predicted class. At the same time, superpixels of the image are obtained by the quick shift algorithm. Each pixel in the image is labeled with a superpixel index. Then, the total number of pixels belonging to each class in each superpixel are counted. Finally, the Class Voting module votes on each superpixel to the class which contains the maximum number of pixels in it. A pseudo-code implementation is shown in Algorithm 1. initial clusterStat(i, j) = 0, i = 0, . . . , clusterNum − 1, j = 0, . . . , classNum − 1 initial clusterVote(i) = 0, i = 0, . . . , clusterNum − 1 for i = 0 to W − 1 do for j = 0 to H − 1 do clusterStat(quickshi f t(i, j), deeplabv3plus(i, j))+ = 1 end for end for clusterVote(i) = max(clusterStat(i, j)), j = 0, . . . , classNum − 1 for i = 0 to W − 1 do for j = 0 to H − 1 do segment(i, j) = clusterVote(quickshi f t(i, j)) end for end for return segment(W, H);

Experimental Design
In order to evaluate the effectiveness of the proposed method, we have conducted the following experiments: (I) The mIOU value of the proposed method against DeepLab v3+; (II) The superpixel segmentation algorithm in module (b) of Figure 3 is replaced to demonstrate the superiority of quick shift against them by SLIC and Felzenszwalb. The experiments are implemented on an NVIDIA GTX TITAN Xp GPU with 12 GB of memory.
Based on PASCAL VOC 2012 dataset [9], we conducted a large number of experiments to evaluate the performance of the method in terms of quality and quantity. The qualitative metric is an important evidence of supporting our claims in an intuitive way. The quantitative metric is more convincing, which is measured in terms of pixel intersection-over-union averaged(mIOU) across the 21 classes. The definition of mIOU is as follows: where k + 1 is the number of classes (including background classes), p ij indicates the number pixels in class i that is classified as class j.

Qualitative Evaluation
We provide some exemplary results on PASCAL VOC 2012 val set in Figure 6. Note that the details of the target edges in Figure 6d are significantly improved than Figure 6c, and the segmentation results in Figure 6c often include parts of the background. In other words, the proposed method can estimate more accurate objects boundaries than DeepLab v3+, even when the foreground objects are very small. It can also match the ground-truth more consistently and segments foreground boundaries more precisely with images that contain foreground objects in complex environments. The results in Figure 6 implies the effectiveness of the proposed method in localizing objects boundaries when conducing semantic segmentation tasks.

Quantitative Evaluation
For the quantitative evaluation of the proposed method, four groups of experiments conducted: (1) DeepLab v3+, (2) DeepLab v3+ + SLIC , (3) DeepLab v3+ + Felzenszwalb, and (4) DeepLab v3+ + quick shift. To simplify ablation study, all methods are trained on VOC 2012 train and tested on VOC 2012 val. Table 1 lists the quantitative results of all the compared methods. The best one is highlighted in bold. As shown in Table 1, the proposed method achieves higher performance than the others. To be specific, our proposed method outperforms DeepLab v3+ by 1.26% in the PASCAL VOC 2012 val set. The superpixel segmentation algorithm-quick shift-is superior to SLIC and Felzenszwalb by 1.85% and 0.8%, respectively.

Why Quick Shift Superior to SLIC and Felzenszwalb
When partitioning an image into superpixels, SLIC evenly distribute seed points in the image. Thus, regions that belong to same object are often force to be divided into different superpixels, which causes misclassification. As shown in Figure 7, it incorrectly segments the background of sky as one part of the airplane on its boundaries. We label the wrongly segmented superpixels in Figure 7 with red circles. Felzenszwalb often oversegments images. As shown in Figure 8, Felzenszwalb partitions the image into more than thousands of superpixels. In extreme cases, the number of superpixels it produces is equal to the number of pixels in the image, which makes no improvement over segmentation results from DeepLab v3+. Compared with SLIC and Felzenszwalb, quick shift automatically finds the proper number of clusters and merges similar superpixels, which makes it achieving the best performance.

The Influence of Parameter σ on Segmentation Results
In Equation (2), we use Gaussian function (as shown in Equation (5)) as the kernel function.
σ is the width of Gaussian kernel. Smaller σ causes the density estimate of pixel x i only calculating local information, which leads to oversegmentation. Larger σ smooths the density of each location, which leads to fewer clusters. As shown in Table 2, we have experimented different value of σ on the PASCAL VOC 2012 train set, and found that 5 is the experimentally optimal one.

Conclusions
DCNNs with deep feature representation of the images are driving significant advances in semantic segmentation. Nonetheless, its drawback is susceptible to interference from segmenting objects boundaries. Therefore, we propose a novel semantic segmentation algorithm with DCNNs and quick shift. Compared with DeepLab v3+, the proposed method can localize accurate objects boundaries, which is meaningful in practical applications, such as object recognition, automatic drive, etc. Even though the proposed method works very well in most cases, it is still fails since the boundary of the object is difficult to distinguish from the surrounding background in a dark environment. In the future work, we will investigate the interplay of DCNNs and superpxiel algorithms to tackle this problem.