Weakly Supervised Conditional Random Fields Model for Semantic Segmentation with Image Patches

Image semantic segmentation (ISS) is used to segment an image into regions with differently labeled semantic category. Most of the existing ISS methods are based on fully supervised learning, which requires pixel-level labeling for training the model. As a result, it is often very time-consuming and labor-intensive, yet still subject to manual errors and subjective inconsistency. To tackle such difficulties, a weakly supervised ISS approach is proposed, in which the challenging problem of label inference from image-level to pixel-level will be particularly addressed, using image patches and conditional random fields (CRF). An improved simple linear iterative cluster (SLIC) algorithm is employed to extract superpixels. for image segmentation. Specifically, it generates various numbers of superpixels according to different images, which can be used to guide the process of image patch extraction based on the image-level labeled information. Based on the extracted image patches, the CRF model is constructed for inferring semantic class labels, which uses the potential energy function to map from the image-level to pixel-level image labels. Finally, patch based CRF (PBCRF) model is used to accomplish the weakly supervised ISS. Experiments conducted on two publicly available benchmark datasets, MSRC and PASCAL VOC 2012, have demonstrated that our proposed algorithm can yield very promising results compared to quite a few state-of-the-art ISS methods, including some deep learning-based models.


Introduction
Different from conventional image segmentation, by combining image segmentation and object recognition, image semantic segmentation (ISS) divides an image into many image blocks to identify the semantic category of each block [1]. It has been widely applied in semantic information extraction from images for scene understanding and object recognition [2,3].
In general, ISS approaches can be mainly divided into two categories, i.e., fully supervised and weakly supervised [4]. Fully supervised ISS requires pixel based labeling of the whole image, which is often achieved manually. To complete the labeling of a picture, skilled annotators on average need nearly 10 min, which is quite time consuming and labor intensive [5]. Considering the difficulty of obtaining pixel-level labeling in fully supervised learning, weakly supervised ISS is more desirable as it does not require pixel based labeling of the whole images thus the associated labor cost and time consumption can be reduced significantly. As a result, weakly supervised ISS has become a research hotspot in recent years.

1.
We propose an image patch and CRF based weakly supervised ISS algorithm (IPCRFWSS), which can successfully achieve semantic label inference and prediction; 2.
We propose an algorithm for automatic estimation of the recommended number of superpixels for different images, which has significantly improved the efficiency and accuracy of image segmentation as it can be used to generate image patches for image-level labels; 3.
A PBCRF model is introduced for semantic class inference from image-level to pixel-level labels.
With trained patch based CRF, class correlation and similarity functions are added into pairwise potential function to improve the accuracy and robustness of semantic label inference; 4.
Experimental results on two publicly available datasets have fully validated the efficacy and efficiency of the proposed approach, which has outperformed quite a few state-of-the-art, including some deep learning models.

Conventional Image Segmentation
As a fundamental task for semantic image processing and image understanding, Image segmentation divides an image into different non-overlapped regions according to its color, texture and other visual properties. At present, image segmentation methods can be roughly divided into three categories [8], i.e., region-based [9], edge-based [10], and cluster-based methods [11]. Among them, the region-based segmentation is popularly used.
According to the consistency within the region and the inconsistency between regions, region-based segmentation methods can be further divided into three groups, including thresholding [12], region growing [13], and splitting and merging [14] based techniques. The advantage of thresholding is that it is easy for implementation and the computational complexity is low. However, the spatial position information of the image is ignored, which has led to the difficulty in balancing the segmentation effect in the global and local areas. The region growing method determines a suitable region by using a point as the seed point along with a growing criterion, which is often measured by the similarity of the formed region and the pixel under processing. This method improves the performance of image segmentation, but it is sensitive to noise and can easily lead to over-segmentation [13]. The splitting and merging method divide the image into many small regions by local similarity, where the neighboring small regions can be further merged iteratively if they are sufficiently similar to each other [15].

Superpixel Based Image Segmentation
As a newly proposed splitting and merging method, superpixel based image segmentation has achieved great progress in recent years. As first proposed by Ren et al. [16], superpixel is defined as a sub-region composed of adjacent pixels with similar texture, color and other visual characteristics. Superpixel based image segmentation is the process of clustering pixels into superpixels, and relevant algorithms can be roughly divided into graph-based and gradient descent-based methods. Graph-based methods mainly include: normalized cut (NC) [17], superpixel lattices (SL) [18], and Felzenszwalb and Huttenlocher (FH) algorithm [19]. Typical gradient descent-based methods are watershed [20], Meanshift [21] and the Simple Linear Iterative Cluster (SLIC) algorithm [22] et al.
Simple linear iterative clustering was first proposed in [22]. Hsu et.al [23] proposed an image segmentation algorithm based on SLIC superpixel, and region merging based on 5-D spectral clustering and boundary-focused region clustering. Ning et.al [24] proposed a novel image segmentation method based on interactive region merging, but users should roughly mark the location and region of the target and background. Gu et.al [25] proposed an algorithm to add the color covariance matrix to the features of superpixel to improve the accuracy of image segmentation.
Compared with pixels, the advantages of superpixels are reflected in two aspects: the calculation is simple, which is helpful to reduce the size of processing objects and the computational complexity of subsequent processing. The number of superpixels can be controlled by adjusting the parameter K, however K needs to be set manually. If K is too large, the advantage of superpixel segmentation will be lost and the unnecessary computational complexity of image segmentation will be increased. If K is too small, the accuracy of image segmentation results will be reduced. Therefore, it is more challenging to manually set the appropriate number of superpixels.
Image patches are merged into larger image regions based on weakly supervised information. Each image patch has only one semantic category, but a target region can be composed of multiple image patches. Compared with superpixel, the proposed method has more advantages in using image patches as the basic unit of weakly supervised ISS. There are two main advantages: the number of image patches is much less than the number of superpixels, which can greatly reduce the complexity of the algorithm. Image patches have more neat object boundaries, which can improve the accuracy of semantic label inference.

Image Semantic Segmentation
In the past years, ISS has attracted much attention and become one of the hotspots of computer vision. Among the existing ISS methods, there are two main categories: fully supervised and weakly supervised semantic segmentation algorithms. The difference between them is that full supervision requires pixel-level label learning, while weak supervision only needs image-level label learning, which greatly reduces the cost of human and material resources caused by manual annotation [26]. It has a good application prospect, although there are many issues to be addressed in weakly supervised semantic segmentation. These weakly supervised semantics segmentation methods can be roughly divided into two categories: traditional methods and deep learning methods.
Traditional semantic segmentation needs a process of feature extraction followed by several different classifiers to complete the segmentation. Duygulu et al. [27] first proposed the concept of Blob-World, and used image-level label training classifier to conduct image semantic segmentation. Zhang et al. [28] proposed an effective support vector machine classifier based on spatial sparse reconstruction method. The classifier is trained with noisy data and denoised by subspace reconstruction method. The optimal parameters are obtained by iterative optimization. Vezhnevets et al. [29] proposed multi-image model (MIM), using conditional random field model, the one-dimensional potential energy function is established with single superpixel pairs and the two-dimensional potential energy function is established with superpixel pairs. The semantic segmentation result is obtained by CRF parameter approximation solution. Liu et al. [30] proposed weakly-supervised dual clustering for image segmentation and label correspondence. Zhang et al. [31] proposed a graph model for recovering Appl. Sci. 2020, 10, 1679 4 of 16 the pixel based on the appearance similarity of training image superpixels. This model is different from the traditional classifier and has achieved good results when learning multi-class kernel matrices. Wang et al. [32] proposed a probabilistic graph model called TCPR for weakly supervised labeling. This method adds neighborhood context constraint to the MRF model and can use automatic inference mechanism to automatically infer category labels.
Deep learning semantic segmentation method generally consists of a general network framework and a segmentation network. The performance improvement of network structure also brings great improvement to the precision of image processing. To solve the problem of feature loss caused by network framework pooling and down-sampling operation, Noh et al. [33] proposed a deconvolution neural network, which combined the prediction method of network and full convolution network to achieve semantic segmentation task. Farabet et al. [34] proposed multi-scale convolution neural network based deep learning for semantic segmentation, in which pixel-level features were extracted including texture, shape and context information. Qi et al. [35] proposed a framework to reduce the error in weakly supervised learning with image-level supervision. In this way, semantic segmentation and object localization are unified to improve segmentation performance. Wei et al. [36] proposed a framework for generating localization maps by hypotheses-aware classification and cross image contextual refinement. Chen et al. proposed to refine the pixel-wise prediction from the last DCNN layer with a fully connected CRF and achieved better segmentation results [3]. Papandreou et al. [37] develop expectation maximization (EM) methods for semantic image segmentation model training under these weakly supervised and semi-supervised settings. Wei et al. [38] proposed a simple to complex (STC) framework, which used simple image-level labels to enhance the Initial-DCNNs network, and then used the Enhanced-DCNNs network to complete more complex ISS tasks. Although these DCNN-based methods improved the performance of weakly supervised ISS, they rely on the precision of pre-trained classification networks.

The Proposed Method
We propose a novel framework for weakly supervised ISS based on image patches and CRF. As shown in Figure 1, the flowchart of the proposed framework contains three main parts, i.e., superpixel generation and image segmentation, CRF model construction, and CRF based semantic inference of image patches for ISS. First, the improved SLIC algorithm is used to segment the training images into superpixels, which are merged into image patches based on the weakly supervised information. Second, the class correlation function and similarity function are introduced into the CRF model to construct a CRF model for inferring semantic class labels. Finally, the trained PBCRF model is applied for weakly supervised semantic segmentation of images. Relevant details of these three parts within the proposed framework are presented as follows.

SLIC Superpixel Generation
First, the Simple Linear Iterative Cluster (SLIC) algorithm is used to segment the training images to generate superpixels. Second, superpixels are merged into image patches based on image-level label. The termination condition is that the number of pieces equals three times the number of imagelevel labels. The better results of superpixel generation and merging are of great help to the construction of CRF model. Therefore, we have improved the SLIC algorithm so that K can be adaptively determined according to different input images, based on the color information of the images. To better reflect human visual perception, the color space of the image is converted from red, green and blue (RGB) to hue, saturation and value (HSV). In order to simplify the calculation, the H, S, and V components in the HSV space is quantified into 16, 5, and 5 levels respectively, which are further combined to yield a one-dimensional eigenvector Z as follows. 25 5 where S Q and V Q are the quantization grades of S and V respectively ( ). The median value of all elements in Z is determined as m which is used as the We put m in brackets, as it indicates a rounding up function to ensure an integer value for K ′ .

SLIC Merging Based on Image-level Labels
In the process, regional feature similarity is taken as the criterion of superpixel merging. Color feature, texture feature and scale invariant feature transform (SIFT) feature are extracted to describe

SLIC Superpixel Generation
First, the Simple Linear Iterative Cluster (SLIC) algorithm is used to segment the training images to generate superpixels. Second, superpixels are merged into image patches based on image-level label. The termination condition is that the number of pieces equals three times the number of image-level labels. The better results of superpixel generation and merging are of great help to the construction of CRF model. Therefore, we have improved the SLIC algorithm so that K can be adaptively determined according to different input images, based on the color information of the images. To better reflect human visual perception, the color space of the image is converted from red, green and blue (RGB) to hue, saturation and value (HSV). In order to simplify the calculation, the H, S, and V components in the HSV space is quantified into 16, 5, and 5 levels respectively, which are further combined to yield a one-dimensional eigenvector Z as follows.
where Q S and Q V are the quantization grades of S and V respectively ( The median value of all elements in Z is determined as m which is used as the initial value for K = [m]. We put m in brackets, as it indicates a rounding up function to ensure an integer value for K .

SLIC Merging Based on Image-level Labels
In the process, regional feature similarity is taken as the criterion of superpixel merging. Color feature, texture feature and scale invariant feature transform (SIFT) feature are extracted to describe each superpixel. The similarity of superpixels is determined by using the extracted feature vector, where the adjacent superpixels are merged according to the similarity between different superpixels to obtain the image patches.
Considering the spatial information of two superpixels, i and j, denote N(i) as the neighboring superpixels of i, the adjacency matrix B(i, j) can be defined as: In this way, the similarity function can be defined as: among them, S c ij , S t ij and S s ij , which are measured Euclidean distances of the color, texture and SIFT feature extracted from two superpixels, i and j, respectively; Ψ i,j denotes the overall distance between the two superpixels, λ i are the adjusting weights for the three featured.
If Ψ i,j is less than a threshold T, the two superpixels will be merged. The termination condition is set as P = 3L, where P is the number of target image patches in the image, and L is the number of labelled categories within the image. The flow chart of superpixel merging algorithm are shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 17 each superpixel. The similarity of superpixels is determined by using the extracted feature vector, where the adjacent superpixels are merged according to the similarity between different superpixels to obtain the image patches.
Considering the spatial information of two superpixels, i and j , denote ( ) N i as the neighboring superpixels of i , the adjacency matrix ( ) can be defined as: In this way, the similarity function can be defined as: Ψ is less than a threshold T , the two superpixels will be merged. The termination condition is set as , where P is the number of target image patches in the image, and L is the number of labelled categories within the image. The flow chart of superpixel merging algorithm are shown in Figure 2.

The Patch Based CRF Model Construct
Semantic inference has always been crucial to weakly supervised ISS as it directly affects the segmentation results. For the determined image patches, they are used as nodes of CRF to construct an undirected graph ( ) , G V E , where V and E denote respectively the node and the edge connecting the nodes. Each image patch needs to be assigned with a unique category of labelled class, which is determined via the CRF model.
On this basis, given the observed field x formed by the extracted patches, the conditional probability distribution of the CRF model for y is defined as: where ( ) | P y x is a conditional probability, ( ) , E x y is an energy function, the final category label assignment result is y  which satisfies the maximum posteriori probability.
The energy function of patch based CRF model can be defined as:

The Patch Based CRF Model Construct
Semantic inference has always been crucial to weakly supervised ISS as it directly affects the segmentation results. For the determined image patches, they are used as nodes of CRF to construct an undirected graph G(V, E), where V and E denote respectively the node and the edge connecting the nodes. Each image patch needs to be assigned with a unique category of labelled class, which is determined via the CRF model.
On this basis, given the observed field x formed by the extracted patches, the conditional probability distribution of the CRF model for y is defined as: where P(y x) is a conditional probability, E(x, y) is an energy function, the final category label assignment result is y which satisfies the maximum posteriori probability.
The energy function of patch based CRF model can be defined as: where w 1 and w 2 are weights, φ i (y i , x i ) is a unary potential energy function, which measures the probability that a node i is labeled as y i for a given x. φ ij y i , y j , x i , x j is a pairwise potential energy function between adjacent nodes i and j.
Obviously, the solution of y in Equation (5) is the minimum value in Equation (6), thus Equation (5) is equivalent to Equation (7) below. y = argmin y E(x, y) Let X = [x 1 , · · · , x P ] be an image containing P image patches, and x i is the i-th image patch. Its corresponding semantic category is labelled as y = y 1 , · · · y p where y ∈ [1, · · · L], and L denotes the total number of categories. However, we can be encoded the label information at the image level, l(x i ) = [l 1 , l 2 , · · · l L ] T , where l i ∈ [0, 1], and l i = 1 means that the category appears in this image, l i = 0 means not appearing.
The PBCRF model is used to assign similar image patches with the same semantic categories and less similar patches to different semantic categories. In the process of assigning each image patch to an appropriate semantic label, the unary potential energy function of the CRF is formulated as (8) where Z(x) is the normalization factor, N(i) refers to the set of image patches adjacent to x i . l is the value of the image label, and l i (·) equals to only 0 or 1. l(x i * ) ∈ R L is the true label, x i is the element of l i .
Furthermore, in order to assign an appropriate semantic label to each superpixel, category correlation and similarity function are added to the pairwise potential function. Pairwise potential energy functions are defined by where δ is used to adjust the width of the Gauss nucleus, which is set to δ= 1 in the experiment. Ψ x i is the feature descriptor of x i , D x i is the distance feature of x i .
It is very important to categorize association information for semantic label inference. The category correlation function can be defined by Let l = [l x 1 , l x 2 , · · · l x P ] T ∈ R P×L be the category label of the image and L the total number of label categories In Equation (10), P(l(x i )l(xj)) is the probability of both class labels l(x i ) and l x j , and P l x j is the probability of the class labels l x j . At the same time, cosine similarity function is used to test the similarity between semantic categories: (11) In this way, the semantic label inference is transformed into the energy function of minimizing conditional random fields, and the semantic category of each image patch is the result of minimizing the energy function.

CRF Based Semantic Inference of Image Patches for ISS
After each image patch is assigned the appropriate semantic label, the image patches belonging to the same class are put together. According to Equation (6), the PBCRF model is constructed, and the mapping issue between category labels and image patches is transformed into a problem of minimizing the energy function. The main steps of CRF based semantic label inference are shown in Algorithm 1: Algorithm 1: Semantic label inference based on CRF Input: Image patches P of training images, image-level semantic label and parameters Output: Semantic Segmentation Resultsŷ Step 1: Random arrangement of training images Step 2: Constructing undirected graph G(V, E) with superpixels as nodes Step 3: Calculating unary potential energy function φ i (y i , x i ) according to Equation (8) Step 4: The class correlation function t y i , y j and cosine similarity function µ y i , y j are calculated by Equation (10) and Equation (11), and the pairwise potential energy function φ i j y i , y j , x i , x j is calculated by Equation (9).
Step 5: Constructing potential energy function of patch based on CRF by Equation (6) Step 6: The semantic segmentation resultŷ can be obtained by minimizing the potential energy function In Section 3.2, we add the category correlation and similarity information to the pairwise potential energy function to for semantic label inference.

Experiments and Discussion
In this section, comprehensive experiments on two publicly available datasets, MSRC and PASCAL VOC 2012, are used to evaluate the performance of our proposed IPCRFWSS method for ISS. Relevant details including the description of the datasets, parameter settings and benchmarking with several state-of-the-art approaches are presented as follows.

Dataset Description
Comparative experiments were conducted on two standard datasets, MSRC and PASCAL VOC 2012, both are multi-class data sets including many common natural scenes as detailed below.
MSRC: A multi-class dataset which contains 591 pictures in 21 categories, of which~80% of the pictures have multiple categories. We divide the dataset into training and test sets according following the same way in [39]. As shown in Figure 3, the 21 categories include: aeroplane, building, bike, bird, book, body, boat, cow, car, chair, cat, dog, face, flower, grass, road, sheep, sky, sign, tree, and water. PASCAL VOC 2012: Serving as the segmentation benchmark for weakly supervised ISS for years [40], this dataset contains 20 object categories and one background category. It contains three parts: training (1464 images), validation (1449 images) and testing (1459 images). In this multi-class dataset, almost each image has 2 to 4 categories, where most of the images have a complicated background. Figure 4 shows for example some typical images and the corresponding ground-truth. These pictures can be subdivided into four main categories [41] i.e., Vehicles: aeroplane, bicycle, bus, car, motorbike, and train; Animals: bird, cat, cow, dog, horse, and sheep; Household: bottle, chair, dining table, PASCAL VOC 2012: Serving as the segmentation benchmark for weakly supervised ISS for years [40], this dataset contains 20 object categories and one background category. It contains three parts: training (1464 images), validation (1449 images) and testing (1459 images). In this multi-class dataset, almost each image has 2 to 4 categories, where most of the images have a complicated background. Figure 4 shows for example some typical images and the corresponding ground-truth. These pictures can be subdivided into four main categories [41] i.e., Vehicles: aeroplane, bicycle, bus, car, motorbike, and train; Animals: bird, cat, cow, dog, horse, and sheep; Household: bottle, chair, dining table, potted plant, sofa, and TV/monitor; and Person, including adults and children though these are not explicitly labelled. PASCAL VOC 2012: Serving as the segmentation benchmark for weakly supervised ISS for years [40], this dataset contains 20 object categories and one background category. It contains three parts: training (1464 images), validation (1449 images) and testing (1459 images). In this multi-class dataset, almost each image has 2 to 4 categories, where most of the images have a complicated background. Figure 4 shows for example some typical images and the corresponding ground-truth. These pictures can be subdivided into four main categories [41] i.e., Vehicles: aeroplane, bicycle, bus, car, motorbike, and train; Animals: bird, cat, cow, dog, horse, and sheep; Household: bottle, chair, dining table, potted plant, sofa, and TV/monitor; and Person, including adults and children though these are not explicitly labelled.

Parameter Settings and Evaluation
To show more intuitively the effect of initial superpixel K on superpixel segmentation, the three most widely used evaluation indicators are adopted. Boundary recall (BR) measures is the proportion of the target boundaries recovered by the superpixel boundaries. Achievable segmentation accuracy (ASA) is a performance upperbound measure. Under-segmentation error (UE) is an error generated by the algorithm when the image is segmented compared with ground truth. The definition of evaluation index used in [42] is adopted here. We compared the number of initial superpixels manually set on MSRC datasets, which visualizes the effect of the number of initial superpixel on the result of superpixel segmentation. The effect of the initial number of superpixels is shown in Figure 5.

Parameter Settings and Evaluation
To show more intuitively the effect of initial superpixel K on superpixel segmentation, the three most widely used evaluation indicators are adopted. Boundary recall (BR) measures is the proportion of the target boundaries recovered by the superpixel boundaries. Achievable segmentation accuracy (ASA) is a performance upperbound measure. Under-segmentation error (UE) is an error generated by the algorithm when the image is segmented compared with ground truth. The definition of evaluation index used in [42] is adopted here. We compared the number of initial superpixels manually set on MSRC datasets, which visualizes the effect of the number of initial superpixel K on the result of superpixel segmentation. The effect of the initial number of superpixels K is shown in Figure 5. As shown in Figure 5, from the three evaluation indicators, the performance of image segmentation also improves with the increase of the initial number of superpixels K in a specific range. After that, it will be saturated later. In this state, although the value of K continues to increase, the segmentation performance remains basically unchanged. If the value of K is too large for the image segmentation, it will increase redundant information and the complexity of subsequent calculation.
The accuracy of superpixel segmentation directly affects the results of subsequent ISS, and the value of K determines the size and number of superpixels. If the value of K is too small to achieve good segmentation results, while too large K will bring redundant information.
Therefore, the appropriate K helps to achieve good segmentation results. For visual assessment, we conduct a comparison experiment on the number of superpixels K in SLIC superpixel segmentation. As shown in Figure 6 and Table 1, the proposed algorithm is feasible in generating satisfactory results on different cases. As shown in Figure 5, from the three evaluation indicators, the performance of image segmentation also improves with the increase of the initial number of superpixels K in a specific range. After that, it will be saturated later. In this state, although the value of K continues to increase, the segmentation performance remains basically unchanged. If the value of K is too large for the image segmentation, it will increase redundant information and the complexity of subsequent calculation.
The accuracy of superpixel segmentation directly affects the results of subsequent ISS, and the value of K determines the size and number of superpixels. If the value of K is too small to achieve good segmentation results, while too large K will bring redundant information. Therefore, the appropriate K helps to achieve good segmentation results. For visual assessment, we conduct a comparison experiment on the number of superpixels K in SLIC superpixel segmentation. As shown in Figure 6 and Table 1, the proposed algorithm is feasible in generating satisfactory results on different cases. the segmentation performance remains basically unchanged. If the value of K is too large for the image segmentation, it will increase redundant information and the complexity of subsequent calculation.
The accuracy of superpixel segmentation directly affects the results of subsequent ISS, and the value of K determines the size and number of superpixels. If the value of K is too small to achieve good segmentation results, while too large K will bring redundant information. Therefore, the appropriate K helps to achieve good segmentation results. For visual assessment, we conduct a comparison experiment on the number of superpixels K in SLIC superpixel segmentation. As shown in Figure 6 and Table 1, the proposed algorithm is feasible in generating satisfactory results on different cases.     Figure 6b,e show the results of image segmentation from our algorithm with the parameter K being set as 310 and 332, respectively. To better compare the effectiveness of the algorithm, the experiment compares the K obtained by our proposed algorithm with the segmentation result of K around 100 (K is set manually with traditional SLIC). Figure 6a is the result of K = 250, the head and tail of the plane are not well segmented. Figure 6d shows the Figure 6 the segmentation result of K =250. The outline of the plane and some details are not very good. Figure 6b,e have better segmentation effects.
In Table 1, although the number of superpixels is increasing, the F 1 -score has not increased significantly. The proposed method can determine the number of superpixels according to different images without multiple attempts to determine appropriate superpixels, which thus saves the running time cost and improves the efficiency.

Comparing with the State-of-the-Art
Here, we perform a group of experiment to evaluate the performance of weakly supervised ISS method. For reference, we compared the state-of-the-art methods, such as PLSA [43], WSDC [30], Textonboost [39], MIM [29]. Table 2 shows the segmentation performance on MSRC dataset. The proposed algorithm is compared with CCNN [44], EM-Adapt [37], MIL-ILP-seg [45], SN-B [36] and H&M [46]. The experiments of these methods are carried out on PASCAL VOC 2012 dataset, and the performance of image segmentation is shown in Table 3. The performance is measured in terms of pixel intersection-over-union (IoU) and mean intersection-over-union (mIoU) across 21 classes. In Tables 2 and 3, each column represents the accuracy of each semantic class, and the last column is the average accuracy of all classes. For the values in the table, the values in bold represent the best segmentation performance of this category.
As seen in Table 2, our method provides competitive results when compared with the state-of-the-art methods on the MSRC dataset. Although the accuracy is not as high as others in some categories, yet the overall mIoU is the best among the group. Actually, our approach has produced the best results in six categories, whilst the second and the third overall best approaches, MIM and Textonboost, are dominant in eight and six categories, respectively. This shows our approach can balance in between different classes for good overall performance.
For the results on the PASCAL VOC 2012 dataset in Table 3, our results are the second best in terms of mIoU, which is quite comparable to the best one produced by SN-B, a deep learning-based model. However, our approach significantly outperforms another deep learning model, CCNN, and two other approaches, MIL-ILP-seg and EM-Adapt. Actually, SN-B produces the best results in nine categories of objects, whilst our approach generates the best in seven categories, although the overall mIoU is only 1% lower. Again, the proposed approach seems to be more robust over different categories.
To show the performance of our proposed algorithm more intuitively, extensive experiments were performed on the PASCAL VOC 2012 dataset. As shown in Figure 7, the experimental results are compared with the ground truth. It can be seen from the comparison of segmentation results in Figure 7. that better segmentation results can be achieved when the image object contains only one dominant (merged) superpixel or if the background is relatively simple. On the contrary, when the background of the image is more complex, the accuracy of ISS will also be reduced. In addition, for example, there are many objects in the image, so that occlusion or small shadows between these objects will affect the result of ISS.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 17 As seen in Table 2, our method provides competitive results when compared with the state-ofthe-art methods on the MSRC dataset. Although the accuracy is not as high as others in some categories, yet the overall mIoU is the best among the group. Actually, our approach has produced the best results in six categories, whilst the second and the third overall best approaches, MIM and Textonboost, are dominant in eight and six categories, respectively. This shows our approach can balance in between different classes for good overall performance.
For the results on the PASCAL VOC 2012 dataset in Table 3, our results are the second best in terms of mIoU, which is quite comparable to the best one produced by SN-B, a deep learning-based model. However, our approach significantly outperforms another deep learning model, CCNN, and two other approaches, MIL-ILP-seg and EM-Adapt. Actually, SN-B produces the best results in nine categories of objects, whilst our approach generates the best in seven categories, although the overall mIoU is only 1% lower. Again, the proposed approach seems to be more robust over different categories.
To show the performance of our proposed algorithm more intuitively, extensive experiments were performed on the PASCAL VOC 2012 dataset. As shown in Figure 7, the experimental results are compared with the ground truth. It can be seen from the comparison of segmentation results in Figure 7. that better segmentation results can be achieved when the image object contains only one dominant (merged) superpixel or if the background is relatively simple. On the contrary, when the background of the image is more complex, the accuracy of ISS will also be reduced. In addition, for example, there are many objects in the image, so that occlusion or small shadows between these objects will affect the result of ISS.

Conclusions
In this paper, a novel PBCRF model is proposed for ISS with image-level labels, which provides an effective solution to the weakly supervised ISS problems. It has three advantages over existing approaches. First, based on the improved SLIC algorithm, optimal numbers of superpixels are automatically estimated for different images for improving the accuracy of image segmentation rather than using a fixed parameter. Second, by taking an image patch as the basic processing unit of ISS, this has significantly improved the performance and reduced the computational costs. Last but

Conclusions
In this paper, a novel PBCRF model is proposed for ISS with image-level labels, which provides an effective solution to the weakly supervised ISS problems. It has three advantages over existing approaches. First, based on the improved SLIC algorithm, optimal numbers of superpixels are automatically estimated for different images for improving the accuracy of image segmentation rather than using a fixed parameter. Second, by taking an image patch as the basic processing unit of ISS, this has significantly improved the performance and reduced the computational costs. Last but not the least, by combining category correlation and similarity information of each semantic category in training the PBCRF model, the inference of semantic label is transformed into the problem of minimizing a potential energy function. Extensive experimental results conducted on the MSRC and PASCAL VOC 2012 datasets segmentation benchmark have demonstrated that the proposed IPCRFWSS algorithm can produce improved or comparable results in comparison to a few state-of-the-art, even some deep learning methods. An improved or much higher mIoU along with a lower variance has also indicated the proposed approach is more robust to different semantic categories. To further improve the results in semantic image segmentation, we will focus on three topics in the future. The first is fusion of color, edge and other information for refined segmentation [47,48], and the second is saliency based extraction of objects from images [49,50]. The third direction is deep learning based image segmentation and object detection, where convolutional neural networks and other models will be explored [51,52], even in combination with the first two topics such as multiscale segmentation and extreme learning machines [53,54].

Conflicts of Interest:
The authors declare no conflict of interest.