You are currently viewing a new version of our website. To view the old version click .
Big Data and Cognitive Computing
  • Article
  • Open Access

10 June 2019

Weakly-Supervised Image Semantic Segmentation Based on Superpixel Region Merging

,
,
,
,
,
and
1
Shanghai Key Lab of Modern Optical Systems, School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
2
Department of Computer Science and Engineering, University of Nevada, Reno, NV 89557, USA
3
Information Science and Technology Research, Shanghai Advanced Research Institute, Chinese Academy of Sciences, No. 99 Haike Rd., Zhangjiang, Pudong, Shanghai 201210, China
4
Department of Computer Science, University of California, Davis, CA 95661, USA

Abstract

In this paper, we propose a semantic segmentation method based on superpixel region merging and convolutional neural network (CNN), referred to as regional merging neural network (RMNN). Image annotation has always been an important role in weakly-supervised semantic segmentation. Most methods use manual labeling. In this paper, super-pixels with similar features are combined using the relationship between each pixel after super-pixel segmentation to form a plurality of super-pixel blocks. Rough predictions are generated by the fully convolutional networks (FCN) so that certain super-pixel blocks will be labeled. We perceive and find other positive areas in an iterative way through the marked areas. This reduces the feature extraction vector and reduces the data dimension due to super-pixels. The algorithm not only uses superpixel merging to narrow down the target’s range but also compensates for the lack of weakly-supervised semantic segmentation at the pixel level. In the training of the network, we use the method of region merging to improve the accuracy of contour recognition. Our extensive experiments demonstrated the effectiveness of the proposed method with the PASCAL VOC 2012 dataset. In particular, evaluation results show that the mean intersection over union (mIoU) score of our method reaches as high as 44.6%. Because the cavity convolution is in the pooled downsampling operation, it does not degrade the network’s receptive field, thereby ensuring the accuracy of image semantic segmentation. The findings of this work thus open the door to leveraging the dilated convolution to improve the recognition accuracy of small objects.

1. Introduction

In the last few years, convolutional neural networks [1,2,3,4] (CNN) have had widespread applications in various industries. The state-of-the-art semantic segmentation methods [5,6,7,8,9] rely on convolutional neural networks. Image-level marking plays a key role in segmentation. Because fully supervised [10,11] (pixel-level) semantic image segmentation is time-consuming and needs to be supported by high-performance CNNs, there is a lot of potential and challenges for weakly-supervised [12,13,14,15] semantic segmentation. For image semantic segmentation, there are two main methods: one is based on image-level labeling, and the other is image semantic segmentation based on pixel-level labels. Grangier et al. [13] implemented a semantic segmentation of images using a simple CNN model, which proved that CNN can achieve better results in semantic segmentation. However, it is time-consuming and laborious to label accurate marinade-level labels on a large amount of image data. Lin et al. [14] used scribble-supervised images to train a convolutional network for semantic segmentation. Dai et al. [15] used the bounding box to achieve the annotation of the target area, extract the feature information of the position and size of the image area to supervise the training of convolution networks. Liu et al. [16] used the depth features learned by CNN to establish a conditional random field (CRF) model [17,18,19] and used the structured support vector machine (SSVM) to learn the CRF model parameters, avoiding the manual extraction of image features.
The feature representation of images is a key step for image semantic segmentation. The feature-based work includes: a random forest-based classifier [20], TextonForest [21]. Yan et al. [22] proposed a model for assigning labels to super-pixels by learning related features, which are used to merge superpixel blocks to extract candidate regions. Liu et al. [23] proposed a weakly-supervised method based on graph propagation, which automatically assigns image-level labels to the super-pixel context information. Aminpour and Razzaghi [24] used a two-layer graphical model to assign labels to super-pixels by linking local and global similarity features for weakly-supervised semantic segmentation. These methods all study the model on superpixel segmentation. It is well known that super-pixels can be described in detail for local structures, so their application in the framework of convolutional networks is increasing. Zhang et al. [25] used the local detail optimization of super-pixels, the mean field inference algorithm, and the quadratic programming relaxation correlation algorithm to optimize the CRF in order to obtain the final label assignment result. Hence, the superpixel method for graphic preprocessing is used frequently. However, improving the performance of super-pixels in weakly-supervised semantic segmentation will be the focus of this work.
Furthermore, transfer learning [26] continues to be a popular learning framework because it enables training CNN with a relatively small dataset. Oquab et al. [27] utilized a simple transfer learning procedure to demonstrate how image representations learned with CNNs on large-scale annotated datasets can be efficiently transferred to other visual recognition tasks with a limited amount of training data to achieve state-of-the-art results. Also, Jiang et al. [28] used transfer learning to scale down the disparity in data distribution between training and test data. Şeker [29] used transfer learning to overcome the large number of data markers required by deep learning algorithms. Hence, due to the small size of the PASCAL VOC 2012 [30] dataset used in this work, we will also use transfer learning to train our network.
According to theoretical hypothesis [25,31], we present an image semantic segmentation method based on superpixel region merger and CNN. At the same time, a series of linear constraints are incorporated into the training process to improve recognition of the target’s contour. Furthermore, because manual labeling is very time-consuming on a large number of data sets, we employ a superpixel segmentation theory and combine the superpixel regions to form larger superpixel blocks based on the region adjacency map to achieve the same labeling effect. Compared to the pixel level annotation [8], the pixel block provides the pixel information of the region, which should have a good effect. Also, compared to the scribble-supervised approach [14], in the manual annotation, time is saved.
The proposed approach works as follows: Firstly, the original image is subjected to simple linear iterative clustering (SLIC) superpixel segmentation [32,33], and then three criteria (i.e., full lambda, spectral histogram or color-texture model) are used for super-pixel region merging. The merging is also the makeup of the pixel block. After merging, we can obtain a set of target areas that have already been marked. We then used a graphical model to supervise the merged marker areas and produce predictive results. Furthermore, a loss function is used to calculate whether the result is expected. Then, we feed the results back to the optimization function and optimize the process of the previous step which includes parameter adjustment of the region merged and supervision training for unmarked pixels. Merging the graph model again will propagate the spatially constrained block of pixels to the unlabeled block of pixels. Simultaneous use of a full convolutional network (FCN) provides a semantic prediction for graphical models, although the output is rough at the outset. However, we use an iterative-feedback mechanism to optimize the graphical model, feedback the predicted value to the optimization function, and update the pixel combination to achieve the optimal result. All these steps are illustrated in Figure 1.
Figure 1. The flow chart of supervision training based on regional merging neural network (RMNN).
In the remainder of this paper, we introduce the preprocessing SLIC for superpixel region merging and transfer learning of visual geometry group 16 layers (VGG16) related theory in Section 2. We provide the details of regional merging neural network (RMNN) in Section 3. Section 4 presents our experimental results. Finally, we conclude this paper with future directions of research in Section 5.

3. Regional Merger Algorithm

3.1. Regional Merger

In the SLIC method, when the number of super-pixels divided is larger, the accuracy of segmentation of the object can be increased when the regions are merged. Hence, we iteratively merged those very small super pixels. By the region merging method, the image segmentation accuracy has been promoted and the high-precision boundary from the initial partition is preserved.
The fast and effective approach in our proposed region merging algorithm is to model the image as a region adjacency graph (RAG) [37]. Each super-pixel region is regarded as a point in the graph. When two regions are adjacent, the corresponding nodes are connected by edges with the same weight and the merging of the regions is realized by merging the nodes in the graph. In order to merge regions at a minimum cost, we have three criteria representing different features (i.e., shape, spectrum, texture) to calculate the difference caused by the image approximation problem and how it affects the cost of the merge. λ is the parameter of the three criteria.
  • Merging criterion 1, Full Lambda:
    { C i , j = C i · C j C i + C j h ( i , j ) h ( i , j ) = ( μ i μ j ) 2 · 1 l λ
    where C i is the area, μ i is the mean spectral value, l is the shared boundary of region i   and   j ,   and   λ is the shape parameter.
  • Merging criterion 2, Spectral Histogram [38]:
    { C i , j = C i · C j C i + C j h ( i , j ) h ( i , j ) = G ( i , j ) · 1 l λ G ( i , j ) = 2 ( i , j i = 1 t f i log f i i = 1 t ( i , j f i ) log ( i , j f i ) + 1.3863 )
    where G ( i , j ) is the G-Statistic value of two spectral histograms i and j, f i represents the probability density function.
  • Merging criterion 3, Color-Texture Model [39]:
    { C i , j = C i · C j C i + C j h ( i , j ) h ( i , j ) = ( w C G c + w T G t ) · 1 l λ G c = 1 D a = 1 D G C a ( i , j )
    where G c is the G-Statistic value of two spectral histograms, D represents the number of bands of the image with a value of 0.30–0.76 (µm), G C a ( i , j ) represents the histogram distance of the regions i and j in the a-th band. G t = G ( i , j ) is the G-Statistic value of two LBP texture histograms; w C and w T are the corresponding weights (these two values are automatically estimated). C i , j is expressed as the combined value of the region ( i , j ) , and ( i , j ) represents the heterogeneity of the two regions. l is the shared boundary length of the adjacent region ( i , j ) ; λ is the influence coefficient of the boundary. When the parameter λ = 0 , l λ = 1 , indicating that the boundary does not affect the regional heterogeneity metric. On the other hand, if λ 0 , indicating that the longer the boundary, the smaller the heterogeneity.
Figure 5 shows the segmentation results obtained by the region merging algorithm adopting different criterion and some details of the show. It can be seen that the average intensity of the pixel’s color area reflects the local pixel color features, and the pixel is obtained by clustering the consistency of the characteristics of the local area. Therefore, the mean intensity of the color is very important. It can also be seen from Figure 5 that the best aggregation result was obtained by criterion 3. Through fusion, we obtained the relatively complete edge information and the background also achieves a certain degree of aggregation. We evaluated the segmentation results experimentally. The BR values of the three criteria are 59.24%, 92.71%, and 63.15%, respectively, which justifies that the third criterion is better in performance.
Figure 5. A merged graph when different criteria are used. The values of (ac) represent the results of criteria 1, 2, and 3. In terms of detail, the third image works better. Therefore, criterion 3 was chosen as the consolidation criteria.
The super-pixel region fusion (Algorithm 1) was performed on the original image. However, because the target in the original image has color information, the super-pixel image segmentation will be used to improve the accuracy of the edge obtained. After the SLIC superpixel segmentation, we obtained a cluster C = { C 1 , C 2 , , C k } with similar size area and each pixel has a pixel label, which is expressed as C i = ( x i , y i , s i g n i ) , where ( x i , y i ) is the pixel coordinate information in area C i , and s i g n i is the super-pixel inner label.
Algorithm 1. Superpixel region merging
1. 
We initialize the image by using the step size s to obtain the cluster center set C = { C 1 , C 2 , , C k } .
2. 
Move the clustering point to the position with the smallest gradient in 3 * 3.
3. 
Set the label s i g n i = 1 , and the distance d i = for each pixel i.
4. 
Iterate over the points in the collection:
5. 
For each cluster center C i :
 For each pixel i in a 2s × 2s area around C i
  Calculate the distance D between C i and i
  if D < d i then
   set d i = D   &   s i g n i = i
  end if
  Add C i to collection C.
 End for
End for
6. 
According to steps 1 through 5, cluster C = { C 1 , C 2 , , C k } is obtained.
7. 
Calculate the color (spectral) intensity mean μ j (criterion 1) of the pixel in each super pixel cluster ( C i ) .
8. 
Iterate through the super-pixels in C and select the super pixel C i in set C as the starting point in turn and mark the point has been accessed.
9. 
Search for the neighborhood superpixel of C 0 and find all adjacent super-pixels ( C n ). The adjacent super-pixel pair P 0 n ( C 0 , C n ) is composed, and represent all the super-pixel sets of the C 0 neighborhood as neighbors = { ( C 0 , C n 1 ) , ( C 0 , C n 2 ) , , ( C 0 , C n k ) } , where k is the number of neighborhood super-pixels.
10. 
Start with the selection of criterion 3 when the area of the super pixel pair ( C i · C j C i + C j ) is small, i.e., it has a priority merge right (In the case where the regional heterogeneity is the same, the smaller the area of the region, the smaller the error caused by the merger for the entire image approximation). Furthermore, when the area of each pixel pair is similar, calculate and compare the heterogeneity δ = h ( i , j ) of each adjacent superpixel pair ( C 0 , C n ) to determine whether to fuse. When δ is minimum, the fusion condition is satisfied. Otherwise, fusion is not performed.
11. 
If the super pixel pair is fused, the pixel label in the fused cluster is updated to s i n g n C 0 = s i n g n C n . Also, the edge weight of h ( i , j )   is updated and the pixel point C n   is marked accessed. If no fusion is performed, no processing is performed.
12. 
Traverse the neighborhood collection neighbors. If C n in P 0 n ( C 0 , C n ) has been marked accessed, merging is performed and s i n g n C 0 = s i n g n C n . Otherwise, do not process.
13. 
Repeat steps 7 through 12 until all super-pixels have been accessed.
14. 
The fusion is completed, and the new clusters are obtained.

3.2. Algorithm Framework

In this paper, image-level label semantic segmentation is used because there are many CNN-based network models, including VGGNet [26], GoogleNet [40], and ResNet [41]. Also, to reduce the computational resources in the experiment, we chose the VGG16 layer network model as shown in Figure 6 and converted the fully connected layer in the architecture into a convolutional layer. In the model, the more the layer is activated, the simpler the image is, so the network parameters of conv1-FC7 layer were first trained in ImageNet and remain unchanged during the process of transfer learning. We replaced the last layer with a layer of 20 object classes and a background class in the PASCAL VOC data set. The step size of this network is 32 s. However, instead of using the weight of the last layer from ImageNet, we chose to use random Gaussian noise to initialize the weight of the last layer; alternatively, we can also use the FC8 layer of VGG16 layer network.
Figure 6. This paper convolutional neural networks (CNN) model. Since our data set is small and similar to ImageNet, we used the method of freezing and training the last layer for transferring learning. The pre-trained network inner layer parameters (conv1-FC7) were transferred to the target classified by the PASCAL VOC object. For fine-scaled image edges and details, we initialized the last layer with Gaussian noise.
First, we used SLIC to segment the original image, and used Algorithm 1 to perform region merging on the generated super-pixels to obtain a set of candidate regions. Because they are aggregated by some super-pixels, they have a good region and edge properties. The trained FCN is then applied to the region-combined image and a pixel-wise prediction is generated. At the same time, using the ground truth annotation, we expect the prediction to overlap with the ground truth image. We define the overlapping objective function (6) as:
ε o = 1 N S ( 1 I o U ( B , C ) ) δ ( s i g n B , s i g n C )
When the semantic label of the candidate region C is the same as the ground truth label of B, the function value is equal to 1, otherwise, it is equal to 0. The objective function is normalized. Also, to obtain an optimal solution, we minimize the objective function (7):
min θ ,   { s i g n C } i ε o
In order to bring the value of the objective function closer to 1, we apply the region merging algorithm to train the model. Furthermore, because the merged region of the superpixel can contain more edge details, we propose a greedy and iterative scheme to ensure the predicted results are closer to the ground truth. In the experiment, we also update the semantic label s i g n C for all candidate regions. The semantic tag is updated by selecting the candidate region of the semantic tag and then iteratively merging the hyperpixel region of the neighborhood so that its cost ε o is the smallest among all candidates. Finally, 20 object class semantic tags are assigned to the selected area, and other pixels are assigned as background tags.
In order to fix the semantic labels of all candidate regions, we repeatedly perform the region merge and predict the training steps. Then we determine a group of regions and predict another group region. With each iteration, we update the area markers for all images. Figure 7 shows the split picking pattern that is gradually updated during the training phase.
Figure 7. Semantic segmentation graph of different iteration times. The number of iterations is epoch #1, epoch #4, epoch #8, and epoch #10.
According to our proposed Algorithm 2, the final semantic segmentation result is obtained.
Algorithm 2. Superpixel merge semantic segmentation
1. 
Construct the SLIC split graph.
2. 
Build super pixel area merge map.
  The parameters are as follows:
  The superpixel region is set as C = { C 1 , C 2 , , C k } .
  Also, the semantic label s i g n C { 0 , 1 , , 20 } in the area C i is initialized to 0, and 0 is the background label.
3. 
Iteratively generate semantic segmentation by using region merging algorithm.
Loop: for epoch = 1:Epoch
3.1
Use the FCN to generate a rough prediction map and update the semantic label s i g n C belonging to the prediction area   C i from formula (1).
3.2
Calculate the overlap value between the obtained graph and the ground truth value (function (6)) and minimize the objective function (7) to determine whether it is close to 1.
3.3
If it is close to 1, end the iteration and go to step 4. Otherwise update the parameters of the merged area.
3.4
Calculate a plurality of adjacent regions C i , and generate a neighborhood set neighbours = { ( C i , C n 1 ) , ( C i , C n 2 ) , , ( C i , C n k ) } , and perform the super pixel region merging algorithm. Go back to step 3.1.
4. 
Output semantic result graph.

4. Experiment Analysis

Our test bed consists of Xeon E5-2609 v3 CPU, 32G RAM, hard disk 3T, Windows10 x64 operating system, and Ubuntu 16.04 LTS operating system with Matlab R2016a platform, and Caffe deep learning framework.

4.1. Data Set

We performed image semantic segmentation on the PASCAL VOC 2012 dataset to evaluate our approach. Figure 8 shows that the data set contains 20 object class tags and a single background class tag.
Figure 8. Twenty categories.
A total of 10,582 training images and a VOC2012 validation set containing a total of 335 images remained unchanged in this work.

4.2. Parameter Sensitivity of Regional Combination

In order to assess the performance of the superpixel region merging algorithm, we studied the effects of different parameters, for instance, the weight size, the number of super-pixels, and the amount of regions after fusion on the segmentation results. We tested on the VOC 2012 dataset, the standard metric [42] is the boundary recall (BR) and the achievable segmentation accuracy (ASA). The boundary recall measures the percentage of the natural boundaries recovered by the superpixel boundaries while the achievable segmentation accuracy gives the highest accuracy achievable for object segmentation that utilizes super-pixels as units [42].
Figure 9 reveals the performance changes for different shape parameter α. In respect of BR, the monotonic value of BR reduces. When α ≤ 0.2, there is only a slight change. In the case of BR and ASA, the curve trend is nearly similar and both decreasing as the amount of pixels increases, and the value of α increases as well. In order to make the number of super-pixels as consistent as possible and avoid significant reduction of the segmentation precision, we set 0.3 ≤ α ≤ 0.9. Also, the number of super pixels is controlled at 600–1500.
Figure 9. Different α segmentation performances, (a) represents the performance curve of different α values of boundary recall, (b) represents the performance curve of different α values of achievable merge accuracy. We set the function f ( x ) = x log ( x ) , γ = 0.3 .
Similarly, we studied the effect of regional compaction on the regional fusion in the range of 0.2–1.0. The results obtained are very small and the curve of BR and ASA almost overlaps. In addition, the influence of the number of combined regions, γ on the quality of segmentation, is also tested. Figure 10 shows the performance changes in the case of different parameters.
Figure 10. Different γ segmentation performances, (a) represents the performance curve of different γ values of boundary recall, (b) represents the performance of different γ values of achievable merge accuracy. We set the function f ( x ) = x log ( x ) , α = 0.3 .
In summary, we select the parameters in the iterative process of regional fusion: shape parameter is α, and number of combined regions ( γ ). We have the greatest impact on the combined results. Therefore, the interval for setting α is 0.3–0.9, and the interval for γ is 10–100. In this interval, the speed of the region merging process can be made more efficient and accurate.

4.3. Index

Semantic segmentation performance is based on previous work [43]. We used the standard intersection over union (IoU) metric, also known as the Jaccard Index [44], to evaluate our results. It is defined as the percentage of pixels in each category that is marked or classified as predicted pixels in the total pixel of the category.

4.4. Qualitative Analysis

According to our proposed method, Figure 11 shows the comparison between the proposed RMNN method and constrained convolutional neural networks (CCNN) with size as a constraint [35] on the VOC 2012 dataset.
Figure 11. The semantic segmentation obtained by combining region merging and CNN. (a) The original images. (b) Superpixel region merge results. (c) Constrained Convolutional Neural Networks (CCNN). (d) RMNN. (e) The ground truth. See also Table 1.
As shown in Figure 11, the weakly-supervised semantic segmentation maintains the basic overall contour of the target due to the excellent performance of the superpixel segmentation and region merging algorithm. At the same time, it is also reflected in the details of the target. The general image-level semantic segmentation can only guarantee the outline of the target but cannot identify some details of the target. However, our algorithm can improve the recognition of certain details in the target while maintaining the basic shape. In (c), (d), we can observe that the proposed algorithm is superior to the CCNN method, and our algorithm can achieve better accuracy on the contour of the target. Although our algorithm is slightly inferior in the second set of figures, the overall outline of the target is still identified. In summary, our approach is superior to some weakly-supervised semantic segmentation such as CCNN, especially in the ability to maintain the contour of the target. Also due to the nature of the regional merger, it is difficult to generalize the shape of some small objects. For example, the fifth image in Figure 11 does not identify the distant train, so there is still room for improvement in this respect. We have left this for future work.
We further performed experiments on the labeled PASCAL-CONTEXT dataset [45]. The dataset provides semantic labels for all targets, including grass, sky, and water. The model was trained on images of 5K resolution (fine notes only). We replaced 21 categories in the original framework with 60 categories, and we used IoU to assess accuracy. In addition, CRF was used as a post-processing step in the framework because the semantic tag contains the entire scene. Some examples of PASCAL-CONTEXT are shown in Figure 12.
Figure 12. Results on PASCAL-CONTEXT dataset.
Our results show that the strongly-supervised mean IoU (mIoU) on the PASCAL-CONTEXT dataset was 48.1%, and the score based on the proposed super-pixel weak supervision method was 45.9%. Although our score was lower than the strong supervision, it was 1.3% higher than the PASCAL VOC score. It shows that more tag comments can improve the precision of semantic segmentation.

4.5. Quantitative Analysis

In our experiments, we used some indicators (IoU, mIoU, PA) to evaluate our algorithm, and we observed that our method achieves better performance in weakly-supervised segmentation. First, using the IoU briefly discussed in Section 4.3, Table 1 compares some of the contemporary weak segmentation methods. Pinheiro and Collobert [34] proposed several semantic segmentations based on the multi-instance learning (MIL) framework. The model pays more attention to the rearrangement of pixels in image classification; hence, the algorithm is sensitive to the initialization of images and improves the accuracy of distinguishing correct pixels by using a small number of smooth priors. The initial step of the RMNN is to perform superpixel constraints, and constrained optimization relies on superpixel combining during iterative training. The initial step of RMNN is to perform super-pixel segmentation. For each image I, we can obtain a set of super-pixel tags s i n g n C 0 . Our superpixel merging can encourage multiple pixels to add constraints by merging to get the outline of the target. While this approach encourages certain pixels to use a specific label, it is usually insufficient to correctly label all the pixels. We constrain the target by super-pixel merging to form a super-pixel block and mark the target foreground as much as possible. This distinguishes the foreground and background labels, iteratively synthesizes larger areas, and ensures that the final mark is as close as possible to the outline of the target. The EM-Adapt method [12] proposes an expectation maximization (EM) method under weakly-supervised and semi-supervised conditions, and it focuses on the most unique part of the object (such as a human face) rather than capturing the entire object (such as the human body). On the contrary, our method focuses on the overall contour of the target, but it is less sensitive to the recognition of some small objects. We compared the IoU values obtained from applying the weakly-supervised semantic segmentation categories on PASCAL VOC 2012 dataset, and we can conclude that the RMNN has significant improvement in overall performance over other methods.
Table 1. Comparison of several weakly-supervised semantic segmentation methods on PASCAL VOC 2012 data set, the IoU score for each level of the VOC data set is presented and the best performance is highlighted in bold.
The CCNN [35] method proposes a series of constraints for semantic segmentation, either using a single constraint or a superposition of multiple constraints. This method can be used in several existing frameworks, we chose size as a constraint to compare with our method. The x-coordinate values are the 20 categories of PASCAL VOC 2012. Figure 13, drawn from the data in Table 1, shows the IoU comparison between CCNN and the proposed RMNN.
Figure 13. RMNN and CCNN segmentation accuracy.
Secondly, we use the average or mean intersection over union (mIoU) (Equation (8)) and pixel accuracy (PA) [43] in Table 2. PA is the ratio of the number of correctly classified pixels to the total number of pixels. This is a standard indicator used for subdivision purposes to calculate the ratio between ground truth and predicted results. mIoU is derived by calculating the average IoU. Suppose the total amount of classes is ( n + 1 ) and p i j is the amount of pixels of class i inferred to class j. p i i represents the number of true positives, while p i j and p j i are usually interpreted as false positives and false negatives [31].
Table 2. The mean intersection over union (mIoU) and pixel accuracy (PA) scores of the four methods.
{ m I o U = 1 n + 1 × n I o U n I o U = p i i j = 0 n p i j + j = 0 n p j i p i i
During the experiment, we only considered image-level labels and used superpixel region merging to improve the results of weakly-supervised semantic segmentation. This saves time on commenting the target and does not require the user’s input. Table 3 shows the comparison of our method with the fully supervised method for mIoU and PA scores.
Table 3. We compare the fraction of pixel level semantically segmented mIoU and pixel accuracy (PA).
Weakly-supervised semantic segmentation performs poorly when compared with the more complex systems that use more segmentation constraints in terms of the pixel accuracy and mIoU. We believe that the key to this difference in performance is pixel-level segmentation information. Although RMNN has a good effect, it lacks the important factor of object characteristics, hence, the gap with the pixel level is huge. In future research, we plan to enhance the image-level competitive advantage by extracting the color, texture, and other features of the image.

5. Conclusions and Future Work

This paper proposes a weakly-supervised semantic segmentation method using superpixel aggregation as an annotation. The method combines super-pixels with similar features using superpixel color and texture features. This forms an annotation of the target object to achieve the design objective. Specifically, we split the image into super-pixels and into the detailed features of each superpixel. We experimented with the pros and cons of the three merge criteria and determined that the third criterion to merge the target’s super-pixels would work better. Since the data set in the experiment is PASCAL VOC 2012, there may be some problems if the network trained on ImageNet is directly applied to the Pascal VOC. Because the source data set and the target data set may be very different, this paper uses migration. The way we learn, we can use the small data set PASCAL VOC to fine tune the network (trained on ImageNet) so that the network can be used for small data sets. Through the training flow diagram of this paper, we predict the semantics of the super-pixels of the merged region and use our basic fact value comparison to feed the intermediate results back to the optimization function, and optimize the function iteration to the final optimal value. The experimental results show that the proposed RMNN approach can achieve more accurate segmentation results compared to state-of-the-art weakly-supervised segmentation systems. The score reached 44.6%.
Although our algorithm has a good mIoU score, it can still be improved. One limitation of the proposed approach is that it is not optimal for small objects; in future work, the distance between the elements covering the convolution kernel of the network model needs to be increased, and the convolution is no longer performed in the continuous space. We also plan to use dilated convolution to reduce the loss of more small features. A long-term goal is to integrate the proposed method to some popular big data systems (e.g., Spark) as image processing is considered as one of the most data-intensive workloads in many scientific domains [46].

Author Contributions

Conceptualization, Q.J.; methodology, Q.J., X.C., and L.J.; project administration, L.J. and S.P.; software, L.J.; writing—original draft preparation, Q.J., and O.T.T.; writing—review and editing, Q.J., D.Z., and J.W.; validation, Q.J., L.J., and D.Z.

Funding

This work is in part supported by an NSFC Award under contract #61775139, the National Natural Science Foundation of China #61332009, and China Post-doctoral Science Foundation #2017M610230.

Acknowledgments

The research was partly supported by the program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning. Dongfang Zhao is in part supported by a Microsoft Azure Research Award and a Google Research Award.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xiao, Y.; Xing, C.; Zhang, T.; Zhao, Z. An Intrusion Detection Model Based on Feature Reduction and Convolutional Neural Networks. IEEE Access 2019, 7, 42210–42219. [Google Scholar] [CrossRef]
  2. Qin, Q.; Vychodil, J. Pedestrian Detection Algorithm Based on Improved Convolutional Neural Network. J. Adv. Comput. Intell. Intell. Inform. 2013, 21, 834–839. [Google Scholar] [CrossRef]
  3. Pfitscher, M.; Welfer, D.; do Nascimento, E.; Cuadros, M.A.; Gamarra, D.F. Article Users Activity Gesture Recognition on Kinect Sensor Using Convolutional Neural Networks and FastDTW for Controlling Movements of a Mobile Robot. Intell. Artif. 2019, 22, 121–134. [Google Scholar] [CrossRef][Green Version]
  4. Abdelwahab, M.A. Accurate Vehicle Counting Approach Based on Deep Neural Networks. In Proceedings of the 2019 International Conference on Innovative Trends in Computer Engineering (ITCE), Aswan, Egypt, 2–4 February 2019; pp. 1–5. [Google Scholar] [CrossRef]
  5. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  6. Hariharan, B.; Arbelaez, P.; Girshick, R.; Malik, J. Object Instance Segmentation and Fine-Grained Localization Using Hypercolumns. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 627–639. [Google Scholar] [CrossRef] [PubMed]
  7. Hariharan, B.; Arbelaez, P.; Girshick, R.; Malik, J. Simultaneous Detection and Segmentation. Lect. Notes Comput. Sci. 2014, 8695, 297–312. [Google Scholar]
  8. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
  9. Mostajahi, M.; Yadollahpour, P.; Shakhnarovich, G. Feedforward semantic segmentation with zoom-out features. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3376–3385. [Google Scholar]
  10. Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning Hierarchical Features for Scene Labeling. IEEE Trans. Pattern Anal. 2013, 35, 1915–1929. [Google Scholar] [CrossRef] [PubMed]
  11. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.H.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  12. Papandreou, G.; Chen, L.-C.; Murphy, K.; Yuille, A.L. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. arXiv 2015, arXiv:1502.02734. [Google Scholar]
  13. Grangier, D.; Bottou, L.; Collobert, R. Deep convolutional networks for scene parsing. ICML Deep Learn. Workshop 2009, 3, 109. [Google Scholar]
  14. Lin, D.; Dai, J.F.; Jia, J.Y.; He, K.M.; Sun, J. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3159–3167. [Google Scholar] [CrossRef]
  15. Dai, J.F.; He, K.M.; Sun, J. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Washington, DC, USA, 7–13 December 2015; pp. 1635–1643. [Google Scholar] [CrossRef]
  16. Liu, F.Y.; Lin, G.S.; Shen, C.H. CRF learning with CNN features for image segmentation. Pattern Recogn. 2015, 48, 2983–2992. [Google Scholar] [CrossRef]
  17. Shotton, J.; Winn, J.; Rother, C.; Criminisi, A. TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int. J. Comput. Vis. 2009, 81, 2–23. [Google Scholar] [CrossRef]
  18. Kohli, P.; Ladicky, L.; Torr, P.H.S. Robust Higher Order Potentials for Enforcing Label Consistency. Int. J. Comput. Vis. 2009, 82, 302–324. [Google Scholar] [CrossRef]
  19. Fulkerson, B.; Vedaldi, A.; Soatto, S. Class Segmentation and Object Localization with Superpixel Neighborhoods. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision (ICCV), Kyoto, Japan, 27 September–4 October 2009; pp. 670–677. [Google Scholar] [CrossRef]
  20. Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-Time Human Pose Recognition in Parts from Single Depth Images. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
  21. Shotton, J.; Johnson, M.; Cipolla, R. Semantic texton forests for image categorization and segmentation. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 24–26 June 2008. [Google Scholar]
  22. Liu, X.; Yan, S.; Luo, J.; Tang, J.; Huango, Z.; Jin, H. Nonparametric Label-to-Region by search. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
  23. Liu, S.; Yan, S.; Zhang, T.; Xu, C.; Liu, J.; Lu, H. Weakly Supervised Graph Propagation Towards Collective Image Parsing. IEEE Trans. Multimed. 2012, 14, 361–373. [Google Scholar] [CrossRef]
  24. Aminpour, A.; Razzaghi, P. Weakly Supervised Semantic Segmentation Using Hierarchical Multi-Image Model. In Proceedings of the 2018 26th Iranian Conference on Electrical Engineering (ICEE), Mashhad, Iran, 8–10 May 2018; pp. 1634–1640. [Google Scholar]
  25. Zhang, L.; Li, H.; Shen, P.Y.; Zhu, G.M.; Song, J.; Shah, S.A.A.; Bennamoun, M.; Zhang, L. Improving Semantic Image Segmentation with a Probabilistic Superpixel-Based Dense Conditional Random Field. IEEE Access 2018, 6, 15297–15310. [Google Scholar] [CrossRef]
  26. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  27. Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. In Proceedings of the Computer Vision & Pattern Recognition 2014, Columbus, OH, USA, 23–28 June 2014; pp. 1717–1724. [Google Scholar]
  28. Jiang, Y.; Wu, D.; Deng, Z.; Qian, P.; Wang, J.; Wang, G.; Chung, F.L.; Choi, K.S.; Wang, S. Seizure Classification from EEG Signals using Transfer Learning, Semi-Supervised Learning and TSK Fuzzy System. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 2270–2284. [Google Scholar] [CrossRef] [PubMed]
  29. Seker, A. Evaluation of Fabric Defect Detection Based on Transfer Learning with Pre-trained AlexNet. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Turkey, 28–30 September 2018. [Google Scholar]
  30. Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  31. Zhao, W.; Zhang, H.; Yan, Y.; Fu, Y.; Wang, H. A Semantic Segmentation Algorithm Using FCN with Combination of BSLIC. Appl. Sci. 2018, 8, 500. [Google Scholar] [CrossRef]
  32. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Susstrunk, S. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. 2012, 34, 2274–2281. [Google Scholar] [CrossRef]
  33. Wang, S.; Lu, H.C.; Yang, F.; Yang, M.H. Superpixel Tracking. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 1323–1330. [Google Scholar]
  34. Pinheiro, P.O.; Collohert, R. From Image-level to Pixel-level Labeling with Convolutional Networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1713–1721. [Google Scholar]
  35. Pathak, D.; Krahenbuhl, P.; Darrell, T. Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1796–1804. [Google Scholar] [CrossRef]
  36. Zhang, Y.; Li, X.; Gao, X.; Zhang, C. A Simple Algorithm of Superpixel Segmentation with Boundary Constraint. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 1502–1514. [Google Scholar] [CrossRef]
  37. Haris, K.; Efstratiadis, S.N.; Maglaveras, N.; Katsaggelos, A.K. Hybrid image segmentation using watersheds and fast region merging. IEEE Trans. Image Process. 1998, 7, 1684–1699. [Google Scholar] [CrossRef]
  38. Wang, A.P.; Wang, S.G.; Lucieer, A. Segmentation of multispectral high-resolution satellite imagery based on integrated feature distributions. Int. J. Remote Sens. 2010, 31, 1471–1483. [Google Scholar] [CrossRef]
  39. Hu, Z.W.; Wu, Z.C.; Zhang, Q.; Fan, Q.; Xu, J.H. A Spatially-Constrained Color-Texture Model for Hierarchical VHR Image Segmentation. IEEE Geosci. Remote Sens. 2013, 10, 120–124. [Google Scholar] [CrossRef]
  40. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  41. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  42. Liu, M.Y.; Tuzel, O.; Ramalingam, S.; Chellappa, R. Entropy-Rate Clustering: Cluster Analysis via Maximizing a Submodular Function Subject to a Matroid Constraint. IEEE Trans. Pattern Anal. 2014, 36, 99–112. [Google Scholar] [CrossRef] [PubMed]
  43. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
  44. Jaccard, P. The Distribution of Flora in the Alpine Zone. New Phytol. 2010, 11, 37–50. [Google Scholar] [CrossRef]
  45. Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.G.; Lee, S.W.; Fidler, S.; Urtasun, R.; Yuille, A. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 891–898. [Google Scholar]
  46. Mehta, P.; Dorkenwald, S.; Zhao, D.; Kaftan, T.; Cheung, A.; Balazinska, M.; AlSayyad, Y. Comparative evaluation of big-data systems on scientific image analytics workloads. In Proceedings of the 2017 VLDB Endow, Washington, DC, USA, 10–11 August 2017; pp. 1226–1237. [Google Scholar] [CrossRef]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.