Sequential Clique Optimization for Unsupervised and Weakly Supervised Video Object Segmentation

: A novel video object segmentation algorithm, which segments out multiple objects in a video sequence in unsupervised or weakly supervised manners, is proposed in this work. First, we match visually important object instances to construct salient object tracks through a video sequence without any user supervision. We formulate this matching process as the problem to ﬁnd maximal weight cliques in a complete k -partite graph and develop the sequential clique optimization algorithm to determine the cliques efﬁciently. Then, we convert the resultant salient object tracks into object segmentation results and reﬁne them based on Markov random ﬁeld optimization. Second, we adapt the sequential clique optimization algorithm to perform weakly supervised video object segmentation. To this end, we develop a sparse-to-dense network to convert the point cliques into segmentation results. The experimental results demonstrate that the proposed algorithm provides comparable or better performances than recent state-of-the-art VOS algorithms.


Introduction
Video object segmentation (VOS) is the task of classifying each pixel in video frames into target objects or backgrounds.VOS can be categorized according to the level of user supervision: unsupervised, weakly supervised, semi-supervised, and interactive VOS.Unsupervised VOS, in general, attempts to segment out primary objects from the background without any user annotations, where a 'primary' object [1] refers to the most salient one in a video.In contrast, users can provide annotations to facilitate VOS and obtain desired segmentation results.In interactive VOS, a user provides annotations repeatedly to refine results.Semi-supervised VOS tracks and segments a target object manually annotated in the first frame.Finally, weakly supervised VOS requires weaker supervision (e.g., points [2], scribbles, boxes [3,4], and video tags [5,6]) than pixel-level accurate annotations for target objects.
Many VOS algorithms adopt deep learning models with advances in regularization techniques [7][8][9][10].One early unsupervised VOS algorithm [11] uses motion and appearance information to produce segmentation results automatically, but it does not consider the appearance frequency of objects.In other words, it may fail to detect primary objects, which have less distinct features but appear frequently in the sequence.Some algorithms [12,13] address these problems by considering the frequently appearing characteristics of primary objects, but they are designed to extract only a single primary object.Thus, they have the common limitation that they cannot handle multiple primary objects systematically.The semisupervised VOS algorithms in [14,15] can address the multi-object problem using annotations in the first frames, but they need to fine-tune the networks for each video sequence, which is computationally expensive.Even though recent semi-supervised algorithms [16,17] do not need fine-tuning, the semi-supervised approach still demands that annotators provide accurate pixel-level masks for target objects in the first frames, which is impractical.
In this paper, we extend the work in a conference paper [18] to develop a novel unsupervised VOS algorithm, which segments out multiple primary objects without any mask annotations, and a weakly supervised VOS algorithm, which may reduce the efforts for masking target objects in the first frame.For this purpose, we develop sequential clique optimization (SCO), which can be employed in both unsupervised and weakly supervised VOS.First, we extract object instances in each frame based on the instance-wise segmentation technique [19].Then, we perform instance matching in order to construct salient object tracks.This is similar to finding multiple cliques in a complete k-partite graph [20] of object instances.Each clique should contain the instances over frames, corresponding to an identical object.Thus, the instances should be similar to one another.However, finding the optimal multiple cliques with maximal similarity weights is NP-hard.Hence, we develop the clique optimization process, called SCO, which considers both node and edge energies.SCO constructs the most salient object track by selecting one object instance from each frame, and multiple salient object tracks can be extracted by repeating the process.We convert these salient object tracks into VOS results.Then, we perform the segmentation refinement based on two-class Markov random field (MRF) optimization to improve the segmentation results.Furthermore, the proposed SCO can be adapted to perform weakly supervised VOS, which accepts only a number of point clicks on a target object in the first frame.To this end, we develop the sparse-to-dense network (SD-Net) to obtain dense segmentation masks from sparse points.The experimental results demonstrate that the proposed unsupervised and weakly supervised algorithms provide comparable or better performances than the recent state-of-the-art VOS algorithms on the DAVIS [21] benchmark dataset.
The main contributions of this paper are summarized as follows: • We propose the SCO process to extract multiple primary objects effectively.It determines a clique efficiently with O(NT 2 ) complexity, where T is the number of frames in a video, and N is the number of instances in each frame.

•
The proposed algorithm can extract multiple primary objects effectively, whereas most conventional algorithms assume a single primary object.

•
We extend the preliminary work [18] of this paper to achieve weakly supervised VOS using the SD-Net, which yields segmentation results using only a few point clicks instead of dense masks for target objects.The proposed SD-Net also improves the unsupervised VOS results of the preliminary work.

•
We develop two segmentation refinement methods to improve the unsupervised VOS results based on MRF optimization and the SD-Net.

Unsupervised VOS
The objective of unsupervised VOS is to separate foreground objects from the background in a video sequence without any user supervision, such as annotation masks.Many unsupervised VOS algorithms assume that there exists a primary object in a video sequence and attempt to extract such a single primary object.To this end, various cues, including motion boundaries [22], saliency maps [23], and object proposals [12], have been used to localize and segment a primary object.Papazoglou and Ferrari [22] constructed Gaussian mixture models (GMMs) for foreground and background from the regions delineated by motion boundaries and then used the models to segment out moving objects.By adopting the boundary before image boundary regions tend to become backgrounds, Jang [23] computed foreground and background probability maps.They also developed the alternate convex optimization to minimize a hybrid energy function, including the antagonistic energy.Object proposal techniques also have been employed to estimate the initial regions of a primary object.Koh [12] estimated the initial primary regions from object proposals and refined them based on the augmentation and reduction process.
Deep learning techniques have been employed for unsupervised VOS.Jain [11] proposed end-to-end networks to yield pixel-wise segmentation results.They considered both appearance and motion information.Tokmakov [24] developed fully convolutional networks to learn motion patterns.They used synthetic video sequences to augment training data.In [13,25,26], differentiable attention models have been adopted to detect recurring primary objects.For instance, Lu [13] developed the co-attention Siamese network.Yang [25] computed dense correspondences between frames in an embedding space to segment out foreground objects.Zhou [26] proposed the motion-attentive transition network, which transforms appearance features into motion features at each convolutional stage.Additionally, human eye gaze has been predicted to initialize primary object regions [27].Zhuo [28] achieved unsupervised online VOS by combining salient motion detection results and object proposals.Ventura [29] employed recurrent neural networks to consider the spatiotemporal features of primary objects.Moreover, object detection methods [30,31] can be adopted for unsupervised VOS.

Semi-Supervised VOS
Semi-supervised VOS segments and tracks target objects, which are annotated by users in the first frames.Recently, deep learning models have been adopted in semi-supervised VOS.For example, in [14], CNNs were fine-tuned by exploiting user annotations in the first frames.In [32], segmentation masks, propagated from previous frames, were refined by deep networks.Sun [33] performed reinforcement learning to obtain a reliable region of interest.For fast VOS, Yin [34] reduced training iterations to fine-tune the deep neural networks, which tune the target objects.Furthermore, some algorithms [16,17] perform segmentation without fine-tuning.Chen [16] classified pixels based on a nearest-neighbor criterion by employing features extracted from embedding networks.Voigtlaender [17] performed end-to-end embedding learning of the multiple object segmentation task with a cross-entropy loss.

Proposed Unsupervised VOS Algorithm
We segment out foreground objects in video frames I = {I 1 , . . ., I T }, where I t is the tth frame, and T is the number of frames in the sequence.The output is the corresponding sequence of pixel-wise maps, which locate the foreground objects in the frames.Figure 1 shows an overview of the proposed algorithm.First, we generate the object instances in each frame.Second, we construct a complete k-partite graph using the set of object instances.Gray lines in Figure 1 represent positive edges that connect the object instances in different frames.Third, we extract the salient object tracks by finding cliques in the graph and refine each object track.Finally, we convert the salient object tracks into VOS results.

Input frames Instance generation
Complete -partite graph Finding salient object tracks Segmentation results

Figure 1.
An overview of the proposed algorithm.

Generating Object Instances
To detect object instances without requiring user annotations, we employ the instanceaware semantic segmentation method, FCIS [19].Figure 2b illustrates examples of object instances in a frame in the "Boxing-fisheye" sequence.Let O t = {o t,θ | θ ∈ N N t } denote the set of detected object instances in frame I t , where N m = {1, 2, . . ., m} is the finite index set, and N t is the number of object instances in frame I t .In general, each frame contains a different number of object instances.The θth object instance o t,θ in frame I t has two attributes: the saliency score s t,θ and feature vector f t,θ .
(a) (b) (c) For the saliency score s t,θ , we adopt the boundary prior to estimating a foreground distribution map for frame I t .More specifically, we divide I t into SLIC superpixels [35].Then, we construct a three-ring graph of the superpixels; two superpixels v i and v j are connected if there is a sequence (v i = v i 1 , v i 2 , . . ., v i k = v j ) of boundary-sharing superpixels from v i to v j for k ≤ 4. The edge weights between two connected superpixels are computed by summing up the difference between the average LAB colors and the difference between the average optical flow vectors [36].By assigning nonzero restart probabilities to the superpixels on the image boundary, we obtain a background distribution map using the random walk with restart (RWR) simulation.Finally, we obtain the foreground distribution map by inverting the background distribution map, as illustrated in Figure 2c.We then determine s t,θ by averaging the foreground probability values of the pixels within the instance o t,θ .
Moreover, we use the bag of visual words (BoW) method to define the feature vector f t,θ .To obtain the bag of visual words, we extract the LAB colors from 40 video sequences and quantize them into 300 codewords using the K-means algorithm.We then build the histogram of the codewords for the pixels within o t,θ and normalize it into the feature vector f t,θ .

Problem Formulation
The set of all object instances from the video sequence, includes many non-salient objects, as well as salient ones.From this set, we attempt to extract as many salient objects as possible, while excluding non-salient ones, assuming that a salient object should have distinct features and appear frequently throughout the entire video sequence.
We construct a complete k-partite graph G = (V, E ) using the set of object instances [20].Thus, each object instance becomes a node in the graph G.Note that the sets O 1 , O 2 , • • • , O T form a partition of V, since O t ∩ O τ = for t = τ.Moreover, we define the edge set as E = {(o t,i , o τ,j ) | t = τ}.In other words, every pair of object instances in different frames are connected by an edge in E , whereas two object instances in the same frame are not connected in graph G.As a result, G is complete k-partite [20], where k = T.For example, Figure 3a illustrates the complete k-partite graph for four frames, i.e., k = 4.We assign a weight to the edge (o t,i , o τ,j ) by where d χ 2 denotes the chi-square distance, which is often used for comparing two histograms, and σ 2 = 0.01 is a scaling parameter.We perform the instance matching in order to construct multiple salient object tracks, where each salient object track is obtained by selecting one object instance (one node) from each frame (each node subset O t ).This process of finding object tracks is equivalent to finding multiple cliques in the complete k-partite graph G. Notice that selecting one node from each frame satisfies the condition of a clique [20]: Every pair of nodes within the clique is connected.When the clique corresponds to the track of an identical object in the video sequence, the features of the member nodes should be similar to one another.Therefore, we select the clique to maximize the sum of the edge weights in Equation (1).
Let Θ p = {θ t } T t=1 denote the pth clique, which is represented by the sequence of node indices.Here, θ t ∈ N N t is the index of the selected node from the instance set O t at the tth frame.Then, we define the similarity E similarity (Θ p ) of the clique Θ p as which is the sum of all edge weights in Θ p .Assuming that the features of an identical object do not change drastically across the frames, the clique Θ p should have a maximal similarity score.Moreover, object instances in the clique, representing a salient object track, should have high saliency scores.We hence define the saliency E saliency (Θ p ) of the clique Θ p as We then attempt to find the set of maximal weight cliques Θ * = {Θ * p } M p=1 that maximizes the sum of the similarity scores, subject to the constraint that Θ p is more salient than Θ q if p < q.However, even the unconstrained problem in Equation ( 4) is NP-hard [37].There are possible cliques, which makes an exhaustive search unfeasible.Some approximation methods, e.g., the local search [37] and binary integer program [38], have been developed to obtain suboptimal cliques in complete k-partite graphs; however, these methods are still computationally expensive and do not consider node energy (e.g., E saliency in this work).Instead, we develop an efficient optimization technique, called sequential clique optimization (SCO), to find the clique that considers both the node energy E saliency and the edge energy E similarity .

SCO
For efficient instance matching, we propose SCO, which extracts the most salient object track.It selects an object instance in each frame that corresponds to one identical salient object.Then, after removing all instances in the track from V, we repeat the process to extract the next salient object track, and so on.In this section, we consider the finding of one clique, Θ p , which represents the most salient object track, and omit the subscript p from all notations for the sake of simplicity.
In SCO, we first initialize the clique Θ (0) to maximize the saliency E saliency in Equation (3).Specifically, the tth element in Θ (0) is determined by Then, at iteration κ, we update θ t by selecting the node that is the most similar to the nodes in the other frames, and then set θ t to be θ t for each t sequentially from one to T. We repeat this sequential update of the nodes in all frames until t=1 .This process is theoretically guaranteed to converge since E similarity (Θ (κ) ) is a monotonically increasing function of κ.To summarize, SCO initializes the clique to maximize the saliency E saliency and then refines it iteratively to achieve a local maximum of the E similarity .Thus, at the initialization, the clique consists of salient object instances across the frames, which may not represent an identical object.However, as the iteration goes on, the clique converges to a salient object track, in which the nodes represent an identical object and thus exhibit high similarity weights in general.Algorithm 1 summarizes the proposed SCO process.In most cases, less than 10 iterations are required for the convergence.

Algorithm 1 (SCO) Sequential Clique Optimization
for each frame I t do 3: Initialize the node index in clique Θ via end for 11: until node indices are unaltered Output: Optimized clique Θ = {θ 1 , θ 2 , . . . ,θ T } Let Θ 1 denote the most salient object track, obtained by this SCO process.To extract the next track Θ 2 , we exclude the nodes in Θ 1 from G and perform SCO again.This is repeated to yield the set of tracks {Θ 1 , Θ 2 , . . ., Θ M } until no node remains in G.In general, if p < q, Θ p is more salient than Θ q .Thus, the subscript p in Θ p is the saliency rank of the track.Figure 3b,c illustrate the first two tracks Θ 1 and Θ 2 , respectively.

Salient Object Track Refinement
The track selection is greedy in the sense that, if an object instance is mistakenly included in a track Θ p , it cannot be included in a later track Θ q even when it indeed belongs to Θ q .To alleviate this problem, we perform postprocessing to maximize the sum of the similarity scores as follows.In each frame I t , we match the object instances in O t to the tracks in {Θ p } M p=1 .The matching cost C(o t,i , Θ p ) between an instance o t,i in O t and a track Θ p is defined as the sum of the feature distances from o t,i to all object instances in Θ p , except for the instance in the same frame I t .After computing the matching costs, we find the optimal matching pairs using the Hungarian algorithm and update the tracks to include the matched instances.This is performed for all frames.As a result, we obtain the set of refined salient object tracks { Θ1 , Θ2 , . . ., ΘM }.

Disappearance Detection
Next, we detect object disappearing events in each refined salient object track.Notice that, when an object disappears or is fully occluded in some frames, noisy objects are selected instead in those frames.For simple notations, let Θ = { θt } T t=1 denote a refined salient object track.We determine whether to discard o t, θt at frame I t from Θ.To this end, for each τ = t, we compare the edge weight w(o τ, θτ , o t, θt ) against the average edge weight w in the track.Specifically, we count the number of object instances o τ, θτ for τ = t, which satisfies w(o τ, θτ , o t, θt ) < w.If the number is larger than 0.7T, we declare o t, θt as noisy and discard it.On the other hand, the proposed algorithm can detect reappearing objects automatically, since every pair of object instances in different frames are fully connected in the complete k-partite graph.In other words, reappearing object instances are connected to all object instances in different frames, so that they are selected by SCO, in general, without requiring any postprocessing.Figure 4 shows an example of object disappearance and reappearance.In this example, a bicycle and its rider are the primary objects.In Figure 4c, the rider on the bicycle disappears in a frame due to occlusion by another human body, which is a noisy object.The proposed disappearance detection method identifies the noisy human body and excludes it from the salient object tracks.Moreover, in Figure 4d, the proposed SCO automatically recognizes the object's reappearance and declares the reappearing rider and bicycle as the primary objects.

Object Track Selection
We develop four schemes to choose segmentation results from the object tracks in { Θ1 , . . ., ΘM }: SCO-F, SCO-M, SCO-OF, and SCO-OM. •

SCO-F:
The first track Θ1 extracts the primary object in a video in general.Thus, SCO-F selects Θ1 .However, it may fail to extract spatially connected objects.For example, given a motorbike and its rider, it may detect only one of them.Therefore, SCO-F additionally picks another salient object track Θp , only when Θ1 and Θp are spatially adjacent in most frames in a video.• SCO-M: To handle multiple primary objects, which may not be spatially connected, we choose multiple tracks from { Θ1 , Θ2 , . . ., ΘM }.To this end, we compute the mean saliency score of the object instances in each track and discard the tracks whose mean scores are lower than a pre-specified threshold δ.We fix δ = 0.1 in all experiments.• SCO-OF: The aforementioned SCO-F is an offline approach that constructs the global T-partite graph for an entire video.In contrast, SCO-OF is an online approach that uses the t-partite graph for frames I 1 , . . ., I t to obtain the segmentation result for the current frame I t .In other words, SCO-O uses the information in the current and past frames only to achieve VOS.• SCO-OM: SCO-OM is an online approach of SCO-M.

Pixel-Wise Segmentation Refinement
The foreground region masks of the object instances from FCIS are noisy in general since the pooling layers in FCIS cause the loss of object details.We hence refine the region masks of the object instances at the pixel level based on the MRF optimization in [32].In this section, we consider the refinement of an object instance o t,θ in frame I t .
We design a weighted graph for MRF, by employing the pixel set X in frame I t as the node set.We construct the edge set H by connecting each pixel x ∈ X to its four neighbors.Then, we determine the segmentation label map L by dichotomizing each pixel x into either foreground (L(x) = 1) or background (L(x) = 0) based on the cost function where C unary and C pariwise are the unary and pairwise costs, respectively, and γ is a balance parameter.
To define the unary cost C unary , we use two GMMs for the foreground and background.To build the GMMs, we use the RGB colors of the pixels in the initial foreground region o t,θ and the initial background region o c t,θ , respectively.Each GMM is a full-covariance Gaussian mixture with 10 components.Then, the unary cost is given by where p(x; M L(x),k ) denotes the probability distribution function of the kth Gaussian component M L(x),k of the foreground GMM (when L(x) = 1) or the background GMM (when L(x) = 0).
In Equation ( 9), F and B denote the sets of definite foreground pixels and definite background pixels, respectively.In this work, we attempt to refine only the segmentation labels of the pixels near object boundaries.To this end, we define the definite foreground set F and the definite background set B using the SLIC superpixels [35] in frame I t .Specifically, F has the superpixels, which are fully included in o t,θ as subsets.Similarly, B is composed of the superpixels, which are fully included in o c t,θ .Then, we try to refine only the pixels in the other superpixels, which overlap with both o t,θ and o c t,θ .Note that, to preserve the foreground pixels in F , C unary (x, L) in Equation( 9) has a zero cost if x ∈ F is labeled as the foreground, i.e., L(x) = 1.In other words, by minimizing C unary (x, L), the pixels in F are encouraged to be labeled as the foreground.Similarly, the pixels in B are likely to be labeled as the background.
Next, we define the pairwise cost to encourage neighboring pixels to have the same labels, The distance d(x, y) = d c (x, y) + d o (x, y) is the sum of the RGB colors' difference d c (x, y) and the optical flow difference d o (x, y) given by where c(x) and u(x) are the RGB color and the optical flow vectors for pixel x, respectively.We normalize both d c (x, y) and d o (x, y) into the range [0, 1] to balance their magnitudes.
We repeat the following two steps.First, we optimize the label map L * by minimizing the MRF cost function in Equation ( 8) via the graph-cut algorithm [39].Second, we update the GMMs based on L * and the corresponding cost function.We terminate the iteration when there is no change in L * .We then determine the refined segmentation region for the object instance o t,θ using the converged optimal map L * .

Complexity Analysis
Let us analyze the computational complexity of the proposed SCO algorithm.For the convenience of analysis, we fix the number of object instances in each frame to N. Note that SCO has two steps: initialization and update.In the initialization in Equation ( 5), N − 1 comparisons are made to find the maximum saliency in each frame, requiring O(NT) comparisons in total.In the update step in Equation ( 6), N(T − 2) additions and (N − 1) comparisons are performed for each frame in one iteration.Thus, the update step demands O(KNT 2 ) complexity, where K is the number of iterations and is restricted to be less than 10 in this work.
We repeat the SCO process N times to extract N object tracks.Thus, the complexity is O(KN 2 T 2 ).Then, in the refinement, the Hungarian matching of O(N 3 ) complexity is performed for each frame.Hence the complexity of the refinement is O(N 3 T).Moreover, the complexity of the disappearance detection is O(NT 2 ).Finally, the overall complexity of the proposed SCO algorithm can be approximated to O(KN 2 T 2 ), since T is larger than N in general.This complexity is significantly lower than the binary integer program in [38], which requires O(2 T 2 N 2 ) complexity in the worst case because of the depth-first node selection.

Proposed Weakly Supervised Algorithm
In supervised VOS, a user can provide annotations about an object of interest, which is the target object to be segmented out.In particular, semi-supervised VOS [14,32] assumes that the accurate pixel-wise segmentation mask of a target object is available.Interactive VOS [40] repeatedly allows a user to check the segmentation results, select a frame with erroneous results, and give refining annotations, such as scribbles, for improving the results.For the annotations, point clicks also can be adopted.For example, the interactive image segmentation algorithm in [41] obtains an initial segmentation result when a user inputs the first click on a target object.After considering the segmentation result, the user provides a new click to refine the result.This refinement is performed recursively until the user stops clicking.Notice that point clicks are also used to refine the segmentation results in interactive VOS algorithms [16].Moreover, the four extreme point clicks, composed of the left-most, right-most, top, and bottom pixels of a target object, are used for object segmentation in [2].
In this section, we propose a weakly supervised VOS algorithm, which takes point clicks on target objects in the first frame, as illustrated in Figure 5b, but does not require repetitive interaction to refine the results.In this regard, the proposed algorithm requires weaker supervision than the semi-supervised and interactive cases.To obtain the segmentation results in Figure 5c from the weakly supervised points, we train the sparse-to-dense network (SD-Net) for binary classification, which separates each target object from the background.In this work, the proposed SD-Net is adopted to achieve weakly supervised VOS and also to refine the segmentation results of unsupervised VOS.

Network Architecture
Figure 6 is the architecture of the proposed SD-Net, which has two encoders, a feature mixer, and a decoder.The two encoders are based on a Siamese structure and thus share parameters.For each encoder, we modify ResNet50 [42] to take a four-channel input: RGB images and a point map.The spatial resolution of the input is 384 × 640.The final block of the encoder is also modified to maintain the spatial resolution by employing the dilated convolution with stride 1.We adopt the Siamese structure to use features of the first frame and the annotated point map.Thus, the top encoder takes the first frame and its annotated point map, while the bottom encoder takes the current frame and its point map warped from the previous frame.Then, they extract feature vectors from their input, respectively.
In the feature mixer, the two features, extracted from the first frame and the current frame, are concatenated, and the concatenated feature passes through a 3 × 3 convolution layer and the ReLU activation.To mix features of the first and current frames more efficiently, we use two modules: the squeeze and excitation module [43] and the dilated convolution, in parallel.The squeeze and excitation module adaptively recalibrates channel-wise features to determine which channels are significant for the segmentation task.Additionally, the dilated convolution increases the receptive field of the concatenated feature vector.The larger receptive field is beneficial since the spatial locations of a target object in the first frame and the current frame are generally different from each other due to the movements of the target object.Next, the decoder draws segmentation inferences on the current frame from the blended feature of the feature mixer and the output of the bottom encoder.We design two kinds of decoder modules to consider both the blended feature of the first and current frames and the feature of the current frame only.First, the ASPP module [44] is adopted to extract multi-scale features with different receptive fields from the blended feature.Using the ASPP module, SD-Net can enlarge the receptive field and exploit various scale information without increasing the number of parameters or the number of computations.Moreover, the two upsample modules are used in series to extract effective multi-scale features from the current frame, where each upsample module consists of one upsample layer and three convolutional layers.The upsample modules take the multi-scale interme-diate features of the bottom encoder through skip connections.Then, the output vectors of the ASPP module and the upsampling modules are concatenated and fed into the decoder combiner, which has three convolutional layers with the batch normalization and the ReLU activation.The combiner yields a segmentation probability map for the target object.By thresholding the map, a binarized segmentation result for the target object is obtained.
To train the proposed SD-Net, we use two datasets: (1) the YouTube2018 dataset [45], which is the largest VOS dataset, and (2) the training set in the DAVIS2017 dataset.YouTube2018 and DAVIS2017 are used for pre-training and fine-tuning, respectively.For each video, we randomly choose two frames that contain an identical target object.One of them becomes a reference frame, and the other becomes a target frame.Then, for each frame, we produce a point map by sampling points from the ground-truth mask for the target object.More specifically, we sample 50 points randomly from the mask in the reference frame, while we choose one point randomly from every 50 mask pixels in the target frame.We then use the reference frame and its point map as input to the top encoder.On the other hand, to mimic inaccurate optical flow warping, we deform the point map of the target frame using random rotation, scaling, translation, and erosion.Then, we input the target frame and its deformed point map to the bottom encoder.We adopt pixel-wise cross-entropy losses between a predicted probability map and the ground-truth binary mask.We use the Adam optimizer [46] with learning rates 10 −4 and 10 −5 for pre-training and fine-tuning, respectively.We decrease the learning rates by a factor of 0.1 every 20 epochs.The training is repeated for 50K iterations with an RTX 2080Ti GPU.

SD-Net for Weakly Supervised VOS
Given a number of point clicks (50 clicks in this work) on each target object in the first frame, we perform the segmentation of multiple target objects throughout all frames in a video sequence.For each object, an identical pair of the first frame and the corresponding point map are fed into both top and bottom encoders in the Siamese network in Figure 6.In this way, the segmentation results for the multiple objects in the first frame, O 1 = {o 1,p | p ∈ N M }, are obtained from the decoder, where M is the number of the target objects.Then, we initialize object tracks {Θ p } M p=1 , where Θ p = {θ 1 = p}.In other words, each initial track Θ p contains the single index of the target object in the first frame I 1 .
From the second frame I 2 , we extend the object tracks by employing warped segmentation masks from SD-Net, as well as object instance masks from FCIS.For each target object p, we generate a warped point map by randomly selecting one point for every 50 pixels in o 1,p and then transferring those points from the first frame to the second frame using optical flow vectors [36].Then, by employing SD-Net, we obtain the warped segmentation mask in the second frame.We then decide whether to add the warped segmentation mask to the set of object instances O 2 .After computing the intersection over union (IoU) ratios between the warped segmentation masks and the object instance masks, we find optimal matching pairs using the Hungarian algorithm.For each matching pair, if the IoU ratio is smaller than 0.6, we add the warped segmentation mask to the set of object instances O 2 .By adding these segmentation masks, we can boost the recall rate of target objects.
Given the set of object instances O 2 in frame I 2 , we extend the object tracks by modifying the SCO process.Specifically, for the track Θ p = {θ 1 = p} of target object p, its second element θ 2 is determined by to maximize the similarity of the clique in a greedy manner.Then, we have the extended track Θ p = {θ 1 = p, θ 2 }.The selected instance o 2,θ 2 is excluded from the set of object instances O 2 , and the track extension is performed for the next target object.This is repeated to extend the tracks for all target objects, i.e., {Θ p } M p=1 .Then, the object track refinement in Section 3.2.3 is performed to yield the refined object tracks { Θp } M p=1 .
We sequentially perform this processing from the second frame to the last frame in the video sequence to extend the target object tracks.For frame I t , t ≥ 2, the selection rule in ( 13) is generalized to Finally, the refined track Θp contains the segmentation results of target p in the video sequence.

SD-Net for Segmentation Refinement in Unsupervised VOS
SD-Net is also adopted to refine the segmentation results in unsupervised VOS.
For each frame, we generate two point maps by randomly choosing 50 points from the segmentation mask for the first point map and one point from every 50 mask pixels for the second point map.We then input the frame and the first point map to the top encoder, while we use the same frame and the second point map as input to the bottom encoder.Then, we obtain the output of SD-Net as the refined segmentation result.

Experimental Results
Given a video sequence, the proposed algorithm can yield segmentation results for each frame, which delineates target objects at the pixel level, in both unsupervised and weakly supervised scenarios.Target objects are automatically segmented in the unsupervised scenario, while they are extracted using point clicks in the first frames in the weakly supervised scenario.We compare the proposed algorithm with
We use the DAVIS dataset [21].Note that the DAVIS dataset has two versions, DAVIS2016 and DAVIS2017.DAVIS2016 is a benchmark for evaluating VOS algorithms.It consists of 50 video sequences, which are divided into training and validation videos.These videos are challenging due to various factors, including appearance change, fast motion, and motion blur.Each video contains a single object or spatially connected objects, e.g., a motorbike and its rider, which appear repeatedly in the sequence.Spatially connected objects are also regarded as primary objects.DAVIS2016 was extended to DAVIS2017.It includes 90 train-validation sequences: 60 are training sequences, while 30 are validation sequences.We evaluate the proposed algorithm on the validation sets in DAVIS2016 and DAVIS2017 unless specified otherwise.Note that DAVIS2017 is more challenging than DAVIS2016, since multiple objects, which are not connected, correspond to different targets.

Ablation Studies
In Tables 1-5, we conduct various ablation studies on the validation sets in DAVIS2016 and DAVIS2017.To assess the proposed algorithm on DAVIS2016, we adopt SCO-F, SCO-M, and SCO-OF.On the other hand, we use only SCO-M and SCO-OM, which segment out multiple objects, for DAVIS2017, whose sequences contain multiple object instances.For the evaluation metrics, we employ the region similarity J and the contour accuracy F [21].
The region similarity J is defined as the IoU ratio J = , where S p and S gt are an estimated segment and the ground-truth, respectively.Additionally, the contour accuracy F is the F-measure, which is the harmonic mean of the contour precision and recall rates.In these metrics, there are two statistics: 'mean' measures the average score and 'recall' denotes the proportion of the frames whose scores are higher than 0.5.  1 lists the J and F performances according to the refinement methods on DAVIS2016.The segmentation refinement can be performed using two methods: (1) MRF in Section 3.3.2and (2) SD-Net in Section 4.3.Without any refinement, SCO-F yields a mean J of 78.8% and a mean F of 75.7%.MRF increases these scores by 2.1% and 1.3%, while SD-Net increases them by 2.7% and 3.9%.Moreover, when both MRF and SD-Net are used sequentially, SCO-F provides the best J and F performances of 81.9% and 79.9%, respectively.The studies on SCO-OF and SCO-M exhibit similar improvements due to the refinement methods.Table 2 shows the ablation studies on DAVIS2017.Again, both MRF and SD-Net improve the segmentation accuracy of SCO-M and SCO-OM on DAVIS2017.Thus, in the following experiments, both MRF and SD-Net are used for the refinement, unless specified otherwise.
Table 3 compares the performances when different methods are used to compute the saliency scores of object instances.We replace the RWR-based saliency in Section 3.1 with the state-of-the-art salient object detection algorithm [50].This salient object detection algorithm does not improve performance since it does not consider motion information.Additionally, we compute the RWR-based saliency without using the optical flow ('w/o OF').Without the optical flow, unreliable saliency maps are obtained, degrading the performances severely.
In Table 4, we analyze the efficacy of each component of the proposed algorithm through two ablation studies.First, we measure the performance of SCO-M without the salient object track refinement.Second, we do not perform the disappearance detection.Let us refer to these settings as 'w/o SOTR' and 'w/o DD'.Table 4 provides the J and F scores on DAVIS2017 for these settings.In this test, we use only SCO-M since the two components are not applied in SCO-OM.Without track refinement or disappearance detection, the J and F scores are lowered.Thus, these components are essential in the proposed SCO-M.
Table 5 shows the J and F scores on DAVIS2017 according to the feature settings.In the proposed algorithm, a color-based BoW is employed to describe the feature of each instance.In this test, instead of the BoW, we use deep features extracted from VGG16 [51] and ResNet50 [42].To generate a feature of an object instance, we feed the rectangular patch containing the object instance to each baseline network and extract the output of the last pooling layer.For the deep features, we use two metrics, i.e., the chi-square distance and cosine similarity, to compute edge weights in the graph.In Table 5, we observe that deep features degrade the performances regardless of the metrics.This is because deep semantic features yield high similarity weights between different objects in the same class.This is undesirable in VOS applications since different objects should be clearly distinguished from each other.

Assessment of Unsupervised VOS Algorithm
Table 6 compares the proposed algorithm with the conventional unsupervised VOS algorithms on the validation set in DAVIS2016.The scores of the conventional algorithms are from the DAVIS dataset's website [21].Note that the proposed SCO-F achieves comparable performances to the recent state-of-the-art VOS algorithms AnDiff [25] and MATNet [26].In particular, SCO-F yields the highest recall score of the region similarity J , which is as high as 96.2%.As compared with SCO-F, SCO-M yields lower performances, since it selects non-primary objects, as well as primary ones, in some videos.Additionally, the online version SCO-OF even surpasses the offline approach LSMO [24], as well as the online UOVOS [28].[11] 0.707 0.835 0.653 0.738 ARP [12] 0.762 0.911 0.706 0.835 LSMO [24] 0.782 0.891 0.759 0.847 AGS [27] 0.797 0.911 0.774 0.858 COSNet [13] 0.805 0.931 0.795 0.895 AnDiff [25] 0.817 0.909 0.805 0.851 MATNet [26] 0.824 0.945 0.807 0.902 UOVOS [28] 0.739 0.885 0.680 0.806 Zhao [47] 0.634 0.703 0.602 0.627 FEM-Net [48] 0.799 0.939 0.769 0.883 Wang [49] 0 The video sequences in DAVIS2016 have multiple attributes that describe the challenging factors.In Table 7, we analyze the performances according to the nine attributes: low resolution (LR), scale variation (SV), fast motion (FM), camera shake (CS), dynamic background (DB), motion blur (MB), occlusions (OCC), out of view (OV), and appearance change (AC).For the evaluation, we compute the average of the mean J and mean F scores (mean J &F ) on the validation set in DAVIS2016.For the LR, FM, CS, and AC attributes, SCO-F experiences no or negligible performance losses as compared with the overall J &F score of 80.9%, which is computed by averaging the mean J and mean F scores of SCO-F in Table 6.However, the DB and MB attributes decrease the performances of SCO-F since the refinement methods are less effective in the presence of motion blur and dynamic background.Nevertheless, except for the OCC attribute, the proposed SCO-F provides better performances than the conventional algorithms.Table 8 compares the proposed algorithm with RVOS [29], which yields multiple segment tracks, on the validation set in DAVIS2017.We see that the proposed SCO-OM and SCO-M outperform RVOS.The experimental results in Tables 6-8 indicate that the proposed algorithm is more effective than the existing unsupervised VOS algorithms at segmenting both a single primary object and multiple primary objects.

Assessment of Weakly Supervised VOS Algorithm
Table 9 compares the proposed weakly supervised algorithm with existing weakly supervised algorithms on the validation sets in DAVIS2016 and DAVIS2017.The scores of the conventional algorithms [2][3][4]6] are from their respective papers.In Table 9, 'Annotation' denotes the types of annotations, which are provided in the first frame.We compare the proposed weakly supervised algorithm with two existing weakly supervised VOS algorithms in [3,4] that take the bounding box annotation.Even though the point-click annotation in the proposed algorithm requires more human effort than the bounding box annotation in [3,4], the proposed weakly supervised algorithm achieves more accurate VOS and provides the best performances on both DAVIS2016 and DAVIS2017.Moreover, the proposed algorithm outperforms [2,6], which take four points for a target object per frame and category labels, respectively.Table 9 also shows the performance of the proposed algorithm on DAVIS2017, when only the number of target objects is provided without point clicks.In other words, supervision is limited to the number of target objects.Then, the proposed algorithm selects as many salient object tracks as the provided number of targets.Even with this minimal supervision, the proposed algorithm achieves comparable performances to SiamMask [4].Figure 7 shows the mean J &F scores on DAVIS2017 according to the number of point clicks in the inference.We see that the performance generally increases as the number of point clicks becomes larger, but it is saturated when more than 30 points are used.We fix the number of point clicks to 50 in this work.Moreover, the proposed algorithm outperforms SiamMask [4] using only five point clicks.Note that the proposed algorithm requires 50 point clicks on each target object in the first frame of a video sequence.Then, SD-Net in Figure 6 uses and propagates the information to segment the object in all frames in the sequence.We can generalize this weakly supervised algorithm straightforwardly to perform interactive VOS, in which repetitive user interactions are provided to refine the segmentation results.In the first interactive segmentation round, given 50 point clicks for each object in the first frame, the proposed SD-Net obtains the segmentation results for all frames.In the next round, we find the frame with the worst segmentation result and then provide additional point clicks to SD-Net to refine the inaccurate result.Then, SD-Net propagates the refined result bi-directionally to both ends of the sequence.This is repeated until a desired level of segmentation is achieved.Table 10 shows the segmentation performances according to the number of interaction rounds.The performances increase quickly and saturate at approximately the fifth round.Figure 8 shows the qualitative results of the proposed weakly supervised algorithm on DAVIS2017.The first column illustrates the point clicks, annotated by users on the first frames, and the other columns are the corresponding segmentation results in the first and subsequent frames.We see that multiple target objects, either connected or not, are segmented out well using the point clicks.Table 11 shows the performance on the SegTrack v2 dataset [52].SegTrack v2 contains 14 low-resolution videos with 24 generic foreground objects.We perform the experiments on the full videos in SegTrack v2 using the proposed weakly supervised algorithm with 30 initial point clicks.Notice that the proposed SD-Net is trained on DAVIS2017 and YouTube2018 and is not fine-tuned on SegTrack v2.In Table 11, the mean J score is averaged across all instances.The scores of the conventional algorithms are from their respective papers.We see that, despite requiring weaker supervision, the proposed algorithm achieves a higher score than RGMP [53] and a comparable score to [54][55][56][57].From top to bottom, "Dogs-jump", "Gold-fish", "Horsejump-high", "Loading", "Motocross-jump", "Pigs", "Lab-coat", and "Soapbox".
Table 11.Comparison of the proposed weakly supervised algorithm with the conventional semisupervised algorithms on SegTrack v2.

Running Time Analysis
We measure the running time of the proposed SCO algorithm for finding cliques in a complete k-partite graph.In this test, we use the "Boxing-fisheye" sequence in the DAVIS2017 dataset.We use a computer with a 2.6GHz CPU.The running time of SCO is affected by two factors: (1) the number N of object instances in a frame and (2) the number T of frames in a sequence.Figure 9a shows the running times according to N, when T is fixed to 50. Figure 9b plots the running times according to T, when N is limited to 10.The proposed algorithm is faster than the binary integer program in [38], which consumes about 1 s when N = 10 and T = 50.Table 12 analyzes the running times of SCO-F.The proposed algorithm performs FCIS [19] for generating the object instances and also the optical flow estimation [36], saliency estimation, and feature extraction in each frame.Then, it performs SCO for the global optimization.In this analysis, the number of frames is 80, and the number of object instances in each frame is 10.SCO takes 0.21 s for the entire sequence, which is negligible.Then, the proposed algorithm also performs two segmentation refinement methods based on MRF and SD-Net in each frame.In total, the proposed algorithm takes 1.79 s per frame (SPF).It is comparable to MATNet [26] (0.75 SPF) and faster than UOVOS [28] (9.96 SPF).

Conclusions
We proposed a novel algorithm to segment out objects in a video sequence in both unsupervised and weakly supervised scenarios by solving the problem of finding cliques in a complete k-partite graph.We first generated the object instances in each frame.Then, we chose a salient instance from each frame to construct the salient object track.For this purpose, we developed the SCO technique using both the saliency and similarity energies.By applying SCO repeatedly, we obtained multiple salient object tracks.Finally, we transformed these tracks into VOS results.For weakly supervised VOS, we adapted SCO and developed SD-Net to produce segmentation results by exploiting point clicks on the target objects in the first frame.The experimental results showed that the proposed algorithm provides comparable or better performances than the state-of-the-art VOS algorithms on the DAVIS2016 and DAVIS2017.
In spite of its achievements, the proposed algorithm still has a limitation.To obtain the set of object instances, the proposed algorithm uses FCIS, which is trained on still-image instance segmentation.In future work, we plan to develop an instance segmentation network for video sequences that takes advantage of the temporal context information in different frames.Instead of frame-by-frame instance segmentation, we expect that the instance segmentation results from multiple frames can provide more reliable salient instance segments to perform the proposed SCO process.Moreover, we will improve SD-Net by adding the transformer in the feature mixer to fuse the two features of the first frame and the current frame more effectively.

Figure 6 .
Figure 6.The architecture of the proposed SD-Net.

Figure 7 .
Figure7.The mean J &F performances on DAVIS2017 according to the number of point clicks[19].

Figure 9 .
Figure 9.The running times according to (a) the number of object instances and (b) the number of frames.

end for 6: repeat 7: for each frame I t do
4:θ t ← arg max θ∈N N t s t,θ 5:

Table 1 .
Ablation studies on the validation set in DAVIS2016 according to the refinement methods.

Table 2 .
Ablation studies on the validation set in DAVIS2017 according to the refinement methods.

Table 3 .
Performances on the validation set in DAVIS2017 according to the saliency estimation methods: 'w/o OF' denotes that optical flow is not used in the proposed saliency estimation.

Table 4 .
Ablation studies on the validation set in DAVIS2017: 'w/o SOTR' and 'w/o DD' mean that the salient object track refinement and the disappearance detection are not used, respectively.

Table 5 .
Performances on the validation set in DAVIS2017 according to the feature settings.The performance of the proposed setting is boldfaced.

Table 6 .
Comparison of the proposed SCO algorithm with the conventional unsupervised VOS algorithms on the validation set in DAVIS2016.The best results are boldfaced, and the second best ones are underlined.

Table 7 .
Attribute-based performance comparison on the validation set in DAVIS2016.

Table 8 .
Comparison of the proposed SCO algorithm with the conventional unsupervised algorithms on the validation set in DAVIS2017.

Table 9 .
Comparison of the proposed weakly supervised algorithm with the conventional weakly supervised algorithms on the validation sets in DAVIS2016 and DAVIS2017 according to annotation types: '# of targets' denotes the number of target objects.

Table 10 .
Performance on the validation set in DAVIS2017 according to the number of interaction rounds.

Table 12 .
Running times in seconds per frame (SPF).