Image Segmentation from Sparse Decomposition with a Pretrained Object-Detection Network

: Annotations for image segmentation are expensive and time-consuming. In contrast to image segmentation, the task of object detection is in general easier in terms of the acquisition of labeled training data and the design of training models. In this paper, we combine the idea of unsupervised learning and a pretrained object-detection network to perform image segmentation, without using expensive segmentation labels. Specially, we designed a pretext task based on the sparse decomposition of object instances in videos to obtain the segmentation mask of the objects, which beneﬁts from the sparsity of image instances and the inter-frame structure of videos. To improve the accuracy of identifying the ’right’ object, we used a pretrained object-detection network to provide the location information of the object instances, and propose an Object Location Segmentation (OLSeg) model of three branches with bounding box prior. The model is trained from videos and is able to capture the foreground, background and segmentation mask in a single image. The performance gain beneﬁts from the sparsity of object instances (the foreground and background in our experiments) and the provided location information (bounding box prior), which work together to produce a comprehensive and robust visual representation for the input. The experimental results demonstrate that the proposed model boosts the performance effectively on various image segmentation benchmarks.


Introduction
Supervised learning has been successfully applied to many complex computer vision tasks [1,2].However, training deep neural networks requires a tremendous amount of manual labeling and prior knowledge from experts.Especially in image segmentation tasks, the cost of pixel-level annotations acquisition is significantly larger than that of region-level and image-level labels, respectively [3].This motivates the development of weakly supervised image segmentation methods [4].Different forms of supervision are explored to promote the performance of image segmentation [5,6].
Previous works [7,8] study the forms of supervision to solve the image segmentation problems.Co-segmentation regards a set of images with the same category as the supervision information and segments the common objects.The initial works are mainly based on effective computational models [9][10][11] to process an image pair.More recent cosegmentation methods are formulated as modeling optimization [12], object saliency [13], or data clustering [14][15][16] problems in more than two images.However, the existing methods still suffer from certain limitations including hand-crafted features and unscalable prior [17].Moreover, these methods rely on the collections of images to segment the objects and fail to segment in a single image.Recent works [18,19] have demonstrated that the pretrained network could provide powerful semantic representation for a given image.Therefore, it is necessary to explore the potential of exploiting the pretrained networks: networks trained for one task, but that could be reused in scenarios that are different from their original purposes [20].
Self-supervised learning aims at designing various pretext tasks to assist the main task in exploring the properties of unlabeled data [21][22][23][24][25].The methods [26,27] use the generative models to reconstruct inputs.However, most methods [21][22][23]26,27] are mainly based on static images to train the models.Video sequences give rise to characteristic patterns of visual change including geometric shape, motion tendencies and many other properties.Natural videos could serve as a more powerful signal for both static and dynamic visual tasks.Self-supervised methods based on videos [24,25] utilize optical flow estimation to find correspondences between pixels across video frames.All moving objects can result in optical flow; therefore, these methods cannot perceive the objects of interest in the scene.
To address these limitations, we aimed to design a pretext task based on the sparse decomposition of object instances in videos, and use the encoder-decoder structures to reconstruct the input to promote the image segmentation tasks.Our motivation is based on two observations.(1) The sparsity of image instances: An image typically consists of multiple objects instances (such as the foreground and background).The objects can be sparsely represented by a deep neural network to learn the features.(2) The inter-frame structure of videos.Compared with static images, the objects instances between continuous frames have high correlation and can be sparsely represented.Video data is more common and conducive to visual learning.We explore the two ideas and utilize the sparsity of image instances and the inter-frame structure of videos.
In this paper, we propose an Object Location Segmentation (OLSeg) model of three branches with bounding box prior.The model is trained from videos and is able to capture the foreground, background and segmentation mask in a single image.We consider a relatively simple scenario where each image consists of a foreground and a background, and construct the model based on the following three aspects.(1) The first of these is the foreground and background branches.On the one hand, we use an autoencoder [28] in the foreground and background branches to construct the foreground and background respectively.The encoder in the foreground branch outputs more channels than the encoder in the background branch to express more complex motion information of the foreground.On the other hand, we apply a gradient loss to smooth the background, which avoids the appearance of the foreground object.(2) The second is the mask branch.We use a U-Net [29] to generate the mask, and adopt an object loss to focus on the information in the bounding box of the foreground object.Our motivation is that the segmentation mask can be calculated if the object location is given.The location information is obtained by a pretrained object-detection network [30].In addition, we consider a closed loss to ensure that the mask shows smooth contours without holes, and a binary loss to generate a binary mask.(3) The final aspect is image reconstruction.The original image is reconstructed by combining the foreground and background with the binary mask.
We conclude our contributions as follows: • We designed a pretext task based on the sparse decomposition of object instances in videos for image segmentation.The task benefits from the sparsity of image instances and the inter-frame structure of videos;

•
We propose an OLSeg model of three branches with bounding box prior.The location information of the object is obtained by a pretrained object-detection network.The model trained from videos is able to capture the foreground, background and segmentation mask in a single image; • The proposed UnsupRL model is demonstrated to boost the performance effectively on various image segmentation benchmarks.The ablation study shows the gains of different components in OLSeg.

Image Segmentation from Unlabeled Data
Image segmentation from unlabeled data is challenging due to the fact that it does not have the pixel-level annotations rather than a given unlabeled image.Recent methods [31,32] explore different forms of supervision to promote the segmentation performance.Stretcu et al. [33] matched multiple video frames for image segmentation.Papazoglou et al. [34] relied on the optical flow to identify moving objects.Koh et al. [35] iteratively generated the object proposals.Wang et al. [36] focused on the relations between pixels across different images.Zhou et al. [37] used the optical flow and attention mechanism to segment the video objects; image co-segmentation requires a set of images containing objects from the same category as a weak form of supervision.Rother et al. [38] minimised an energy function to segment image pairs.Kim et al. [12] explored an optimization problem of saliency to find the common object in multiple images.Joulin et al. [14] formulated the co-segmentation problem in terms of a discriminative clustering task.Joulin et al. [15] used spectral and discriminative clustering for fully unsupervised segmentation.Rubinstein et al. [13] combined a visual saliency and dense correspondences to capture the sparsity and visual variability of the common objects in a group of images.Quan et al. [16] used a pretrained network to obtain the semantic features, and proposed a manifold ranking method to discover the common objects.Zhao et al. [17] constrained the proportion of foreground object in the image.These methods segmented images based on hand-crafted features and unscalable prior.Our model does not depend on any assumptions about the existence of the common objects, and is able to segment in a single image.

Pretrained Networks
Current deep learning methods have achieved great success on a variety of visual tasks [39,40].However, these deep frameworks still heavily rely on a large amount of training data and a time-consuming training process.Therefore, some researchers have recently turned to a new direction: network reuse [20].Obviously it would be appealing if a pretrained network can be reused in a new domain that is different from its initial training purpose, and without needing to fine-tune it.Some recent works attempt to train deep models on a large image dataset collecting from web images [41], which can further leverage the generalization of pretrained networks.Fortunately, benefiting from the large-scale dataset COCO [3], which consists of over 330000 images in over 80 categories, the pretrained networks have revealed powerful object detection ability.Inspired by [42], our proposed OLSeg extracts the location of the foreground object from a pretrained objectdetection network without any training or fine-tuning process.This operation can be easily implemented and is conducive to object segmentation.

Self-Supervised Learning
Self-supervised learning methods usually construct a pretext task to learn features from raw data, and improve the performance of the main task.Some methods use generationbased models [43][44][45] to obtain the latent feature representations of the input.Various pretext tasks have been explored, e.g., predicting relative patch locations within an image [21], recovering part of the data [46], solving jigsaw puzzles [22], colorizing grayscale images [23], counting visual primitives [47] and predicting image rotations [48].For videos, self-supervised signals come from motion consistency [24,25] and temporal continuity [49,50].However, the pretext tasks based on videos estimate the optical flow as the clue of the moving object.The object of interest may be stationary and cannot be perceived through optical flow.We design a pretext task based on the sparse decomposition of object instances in videos, and use the encoder-decoder structures to reconstruct the original input.

Object Location Segmentation (OLSeg)
We propose a three-branch OLSeg model with bounding box prior, as shown in Figure 1.The model consists of a foreground branch, a background branch and a mask branch.For an input image from continuous video frames, the foreground and background branches construct the foreground and background information with an autoencoder [28], respectively.The output channels of encoder in the two autoencoders are different.The foreground branch contains more encoder channels than the background branch to represent more complex foreground information.The gradient loss in the background branch smoothes the background and eliminates the influence of foreground.The mask branch with bounding box prior uses an U-Net [29] to generate the segmentation mask.The bounding box of the object is obtained from a pretrained object-detection network Yolov4 [30] acting on the input.The location information is used in an object loss to make the U-Net [29] focus more on the area where the foreground object appears.The closed loss and binary loss ensure that the generated mask is binary without concave holes.The mask is combined with the foreground and background to reconstruct the input via a reconstruction loss.The contributions of the three branches are complementary to each other for final decomposition of the object instances.For clarity, the pseudocode for OLSeg is shown in Algorithm 1.  [28] respectively, and a mask branch with an U-Net [29].The gradient loss in the background branch removes the foreground to great extent.The mask branch with bounding box prior uses an object loss, a closed loss and a binary loss.The outputs of the three branches work together to reconstruct the input.

The Foreground and Background Branches
The foreground and background branches aim to decompose the object instances into the foreground and background.The foreground branch consists of an autoencoder [28] f a 1 , and the background branch consists of an autoencoder [28] f a 2 .The two autoencoders [28] are trained separately.Let U = (u k ; k ∈ (1, ..., K)) be continu-ous video frames.Given an input image u k , the ouputs of f a 1 and f a 2 are obtained using Equations ( 1) and ( 2) respectively: where F k and B k are the foreground and background of the input image u k respectively.
Loss of the mask branch Overall loss of OLSeg The difference between f a 1 and f a 2 is that they have different output channels of encoders.Generally, the foreground information is more complex and variable than the background information.As shown in Figure 1, the horse is moving and the background is relatively simple in continuous video frames.Let C f and C b be the output channels of encoders in f a 1 and f a 2 , respectively.We use more output channels C f to express the foreground information.
The key to realizing the foreground and background separation is to ensure that the background does not contain the foreground information.The powerful generation ability of autoencoders [28] leads to the appearance of foreground in the background.In order to ensure that the background is smooth and clean, we add a gradient loss to the background branch as Equation (3): where ∇ is the operation of minimizing the gradient.
Although the foreground and background extracted from f a 1 and f a 2 are blurred in Figure 1, the outline of the horse is visible in the foreground, and the sky and grass are clean in the background.This proves the decomposition ability of the foreground and background branches to the object instances.

The Mask Branch
The mask branch consists of an U-Net [29] f u to generate the segmentation mask.Multi-scale connections between the encoding and decoding paths of U-Net [29] ensure efficient integration of information in image segmentation.The output of U-Net [29] for the input u k is represented as Equation ( 4): where M k is the segmentation mask and the values range from 0 to 1.
It is challenging for the mask branch to distinguish the foreground and background of the input.The previous methods [24,25] use optical flow to treat the moving object as the foreground.However, the foreground may be relatively static in continuous video frames.We consider the location information of the foreground object to assist the mask branch to better discover the object.Our motivation is that the segmentation mask could be obtained if the object location is known.The operation of extracting object location information is shown in Figure 2.For the given input u k , we select a Yolov4 [30] trained on the COCO dataset [3] to obtain the bounding box of the object area.The comparison of different pretrained object-detection networks is not considered because we only need to obtain the approximate location of the object.Then the corresponding location of the bounding box in the input is mapped to the segmentation mask.The object area in the bounding box for the segmentation mask is represented as A k (A k is their union if there are multiple object areas).Let the value of coordinates (i, j) in the segmentation mask M k be M k,(i,j) , and we introduce an object loss as Equation ( 5): where N mk and N rk represent the numbers of coordinates in M k and A k respectively.For the segmentation mask M k , we set the values outside A k to 0. The mask branch focuses on the object area inside the bounding box to more accurately obtain the mask.

Input
Input with bounding box Mask with bounding box The operation of extracting object location information.We obtain the bounding box of the object from a pretrained Yolov4 [30].The bounding box is then mapped into the segmentation mask, and the object area A k is obtained.
The segmentation mask sometimes shows unclosed concave holes.Small holes inside the object have an adverse impact on segmentation.We adopt a closed loss to produce a smooth mask without concave holes.The idea is that the value of the current coordinates (i, j) in the mask M k is consistent with the values of the surrounding adjacent coordinates.The closed loss L c is calculated as Equation ( 6): The reconstruction image is obtained by the combination of the foreground, background and mask.The values of the segmentation mask M k range from 0 to 1.The low-entropy prediction of the mask branch can ensure that either the foreground or the background appears in the reconstruction image, rather than both of them appearing simultaneously.We minimize a binary loss to achieve entropy minimization as Equation ( 7): where M k,(i,j) is constrained to be close to 0 or 1.
The loss of the mask branch is thus defined as Equation ( 8): where β o , β c and β b are hyperparameters to control the scales of L o , L c and L b respectively.

The Overall Loss
We reconstruct the input image by combining the outputs of the foreground, background and mask branches.The reconstruction image R k of u k is represented as Equation ( 9): where M b k is the binary mask obtained from the mask branch, and the values are close to 0 or 1.
We minimize the reconstruction loss using Equation (10): The overall loss L of OLSeg is described as Equation ( 11): where α is the hyperparameter to balance the loss.

Experiments
In this section, we conduct extensive experiments to verify the performance of the proposed OLSeg model.We first introduce the implementation details in Section 4.1.Subsequently, we study the effect of parameter selection in Section 4.2.Next, we evaluate OLSeg for the image segmentation tasks in Sections 4.3 and 4.4 respectively.Finally, we show the ablation study in Section 4.5.

•
The YouTube Objects dataset [51] is a large-scale dataset that includes 10 types of objects (e.g., airplane, bird, boat, car, cat, cow, dog, horse, motorbike, and train) downloaded from the YouTube website.It contains 5484 videos and a total of 571089 video frames.The video consists of the object entering and leaving the line of sight, the object being occluded and the significant changes in the object scale and visual angle.The dataset provides ground-truth bounding boxes on the object of interest in one frame for each of 1407 video shots as the validation set.

•
The Internet dataset [13] is a commonly used object segmentation dataset.There are 15000 images downloaded from the internet.The dataset contains 4542 images of airplanes, 4347 images of cars and 6381 images of horses with the high-quality annotation masks.

•
The Microsoft Research Cambridge (MSRC) dataset [52] contains 14 object classes and about 420 images with accurate pixel-wise labels.Each object in the dataset has different background, illumination and pose.This is a real-world dataset, which is often used to evaluate the image segmentation tasks.

Training Details
The PyTorch framework is adopted on a single GPU machine with NVIDIA TITAN V in all experiments.We train our proposed model on the YouTube Objects dataset [51], and select the appropriate parameters on the validation set.The inputs of the training phase are continuous video frames to consider the motion clue.The foreground and background branches have the same network structure.We use ResNet18 [53] as the encoder, and 4 deconvolution and convolution operations followed by 3 convolution layers as the decoder [54].For the mask branch, we use a 5-layer U-Net [29], and each layer contains 2 convolution operations.Table 1 shows the the hyperparameter settings on the YouTube Objects dataset [51].The encoders output different channels in the foreground and background branches.The trainable parameters in the foreground, background and mask networks are about 1.8 × 10 7 , 1.7 × 10 7 and 1.3 × 10 7 , respectively.The floating point operations (FLOPs) of the proposed model are about 1.8 × 10 10 .The total training time takes about 19 h.

Evaluation Metrics
For the image segmentation task, we use the P and J metrics as in [13].The P refers to the ratio of the correctly labeled pixels.The J is the Jaccard similarity, which represents the intersection over union of the prediction and the ground truth.Higher values of P and J indicate better model performance.We also adopt the correct localization (CorLoc) metric following previous image segmentation works [33,34], which measures the percentage of images that are correctly localized according to the PASCAL criterion: the intersection over union (IoU) overlap ratio of the predicted box and the ground-truth box is greater than 0.5.These metrics are commonly used for image segmentation evaluation.

Parameter Selection
We conduct detailed parameter selection experiments on the validation set of the YouTube Objects dataset [51] to study the influence of hyperparameters.

Hyperparameter α for the Gradient Loss
The separation of the foreground and background is important for reconstructing the input.The model is able to learn powerful feature representation only when the foreground and background are completely separated.We assign a hyperparameter α to balance the gradient loss in the background branch.Figure 4 shows the experimental results of different α for the gradient loss.The foreground information appears in the background when α is 0.5.We get a cleaner background with the increase of α, which is also conducive to improving the quality of segmentation mask.The hyperparameter α = 1.5 is used for subsequent experiments.

Input
Foreground Background Foreground Background Foreground Background

Hyperparameter β o for the Object Loss
The object loss uses a priori knowledge of object location in the mask branch.The hyperparameter β o controls the scale of the object loss, and affects the quality of the segmentation mask.The experimental results of different β o for the object loss are shown in Figure 5.For the mask, the object area in the red box is mapped from the original input.The information outside the red boxes cannot be completely removed when β o is 0.05.However, a larger β o may reduce the effect of other losses and lead to the unclear boundary of the object.We select β o as 0.15 to obtain an accurate mask.The binary loss makes the segmentation mask binary to great extent.The pixels at each location of the foreground and background will contribute to the reconstruction image if the mask is not binary, which is contrary to the purpose of separation.

Results and Visualization
We evaluated OLSeg on the validation set of the YouTube Object dataset [51].The dataset provides ground-truth bounding boxes on the object of interest.We used the metric CorLoc as in [33,34] for quantitative analysis.For the purpose of this evaluation, we automatically fitted a bounding box to the largest connected component in the pixellevel segmentation output by our model.We compared the proposed model with the object segmentation methods [33][34][35] in Table 2. OLSeg outperformed the others in 6 out of 10 classes and achieved the best overall performance.However, the results for birds showed a poor performance due to the lack of constraints to account for background complexity.The foreground we defined may also contain multiple instances.For motorbikes, masks of all instances in the images were obtained, such as humans and motorbikes.In addition, we show the test complexity of these methods.The proposed model processed each image in 0.03 s, which is faster than other methods.The reason is that our model only needs to forward the U-Net [29] to obtain the segmentation mask, without relying on multiple images or complex operations.Our model takes much longer during its training phase and requires a large amount of training data.Figure 8 shows the visualization results on the validation set of the YouTube Objects dataset [51].OLSeg extracted the segmentation mask, foreground and background from 10 classes.The generated background is smooth and contains little foreground information.We acquired the clear segmentation boundary of the mask.Multiple objects could also be accurately segmented, such as the cows and horses.However, as shown in the third row on the right of Figure 8, it is difficult to extract the mask when the color of the horse is similar to the background color.The proposed model produces high-quality segmentation results in multiple classes.

Input
Mask Foreground Background Input Mask Foreground Background

Evaluation on the Internet Dataset
We ran the trained OLSeg to evaluate the performance of image segmentation on the Internet dataset [13].We used two metrics, P (precision) and J (Jaccard similarity), and compared our results with five previously proposed co-segmentation methods [12][13][14][15][16].The results of different methods are shown in Table 3.The performance on airplanes was slightly better than cars and horses.For the P of airplanes, cars and horses, OLSeg is 0.87%, 1.73% and 0.26% higher than the method in [16], respectively.The J obtained by OLSeg is also greatly improved.The comparison results show that our proposed model outperformed other methods on all three object classes.We compared the proposed model with the methods in [12][13][14][15] to visualize the segmentation performance.The qualitative results of different methods on the Internet dataset [13] are shown in Figure 9. Compared with other methods, OLSeg achieves a clearer segmentation boundary.When the segmentation mask cannot be obtained by Kim et al. [12], the proposed model also acquires a more realistic mask.The proposed model has a powerful ability to adapt to the size of the foreground object.The airplane image with a complex back-ground could also be well segmented.The visualization results show the effect of OLSeg.

Evaluation on the MSRC Dataset
We evaluated the proposed model on the MSRC dataset [52].We show the metrics P (average precision) and J (average Jaccard similarity) of our model as well as five related segmentation methods [12][13][14][15][16] in Table 4. OLSeg achieves a P of 87.56% and a J of 65.85%, which are 1.35% and 2.53% better than the method in [16].Our overall P and J have higher values than the compared methods.The comparisons from Table 4 demonstrate that the proposed model produces good segmentation results on multiple object classes.We further report the qualitative results of different methods on the MSRC dataset [52] in Figure 10.The proposed model is able to accurately segment the foreground object despite the large variation in style, color, texture, scale and position.For the complex objects such as bikes, OLSeg can also segment their finer contours.The tree is not very distinctive from the background in terms of color, but it is still successfully segmented.The text inside the sign is well removed due to the closed loss.There is a clear visual improvement for OLSeg over other compared methods.The experimental results have demonstrated the generalization of the proposed model.

Ablation Study
We performed an ablation study of the proposed OLSeg on the Internet and MSRC datasets, and the results of P and J are shown in Table 5.Specifically, the contributions of different components in OLSeg were investigated.We did not analyze the contribution of the object loss because it is decisive in mask generation.Without the object loss, the segmentation mask will be cluttered and show a poor performance to distinguish the foreground from background.We removed the gradient loss, the closed loss and the binary loss, respectively, and observed the effect on the final results.We can see that the gradient loss contributes most to the performance.The results are reasonable since a clean background cannot be recovered without the gradient loss, which affects the quality of the mask.The closed loss plays an important role in preventing the generation of the mask with concave holes and noise.The binary loss also contributes to the final results.The ablation study from Table 5 demonstrates the effectiveness of the proposed OLSeg model.

Conclusions
In this paper, we designed a pretext task of decomposing object instances in video for image segmentation, and propose an OLSeg model of three branches with bounding box prior.The pretext task benefits from the sparsity of image instances and the inter-frame structure of videos.The proposed model is trained from video and is able to capture the foreground, background and segmentation mask in a single image.The constraints in the foreground and background branches ensure the generation of the foreground and background.The mask branch uses the bounding box prior, and consists of multiple losses to produce an accurate segmentation mask for the object of interest.This is consistent with the assumption that segmentation mask could be obtained if the object location is known.The experimental results show our model achieves better segmentation performance than compared methods on various image segmentation datasets.
The parameters in OLSeg are roughly estimated by grid search in this work.In future work, how to use a more adaptive method for parameter optimization is worthy of further study.In addition, we will also explore more efficient networks and other constraints to obtain more robust mask for the object of interest in unlabeled data.

Figure 1 .
Figure 1.Overview of OLSeg.The model consists of the foreground and background branches with an autoencoder[28] respectively, and a mask branch with an U-Net[29].The gradient loss in the background branch removes the foreground to great extent.The mask branch with bounding box prior uses an object loss, a closed loss and a binary loss.The outputs of the three branches work together to reconstruct the input.

Algorithm 1
Training OLSeg.Input: An autoencoder f a 1 , an autoencoder f a 2 , an U-Net f u and a pretrained Yolov4 f p .Batch of continuous video frames U = (u k ; k ∈ (1, . . ., K)), coordinates (i, j), numbers of coordinates N mk and N ak in M k and A k .The encoder channels C f , C b , and the loss balancing hyperparameters α, β o , β c , β b .1: for k = 1 to K do 2:

4. 2 . 1 .Figure 3 .
Figure 3.The experimental results of different channel combinations in the foreground and background branches.Appropriate channel combinations can achieve a cleaner foreground and background.

Figure 4 .
Figure 4.The experimental results of different α for the gradient loss.The foreground information is completely filtered out in the background with the increase of α.

Figure 5 .Figure 6 .
Figure 5.The experimental results of different β o for the object loss.Appropriate β o eliminates the information outside the red boxes and ensures the clear boundary of the object.4.2.4.Hyperparameter β c for the Closed Loss The closed loss aims to promote the aggregation of the segmentation mask, and form a smooth object shape without concave holes.The hyperparameter β c controls the proportion of the closed loss.The experimental results of different β c for the closed loss are listed in Figure 6.The segmentation mask only retains the contour of the object and cannot form a closed area under a small value of β c .The hyperparameter β c = 0.1 ensures that the mask is closed and has a clear boundary.Input Mask Mask

Figure 7 .
Figure 7.The experimental results of different β b for the binary loss.The binary mask is obtained when β b is 1.

Figure 8 .
Figure 8.The visualization results on the validation set of the YouTube Objects dataset.For each class, we show the segmentation mask, foreground and background extracted by OLSeg.

Figure 9 .
Figure 9.The qualitative results of different methods on the Internet dataset.OLSeg produces improved results compared to other methods in [12-15].

Figure 10 .
Figure 10.The qualitative results of different methods on the MSRC dataset.OLSeg achieves a clear improvement over other compared methods in [12-15].

Table 1 .
The hyperparameter settings on the YouTube Objects dataset.

Table 2 .
Comparisions of CorLoc (%) on the YouTube Object dataset.Time (s) denotes the processing speed of each image.

Table 3 .
Comparisons of P and J (%) on the Internet dataset.The P and J denote the precision and Jaccard similarity respectively.

Table 4 .
Comparisons of P and J (%) on the MSRC dataset.The P and J denote the average precision and average Jaccard similarity, respectively.

Table 5 .
Ablation study of OLSeg on the Internet and MSRC datasets.