Complementary Segmentation of Primary Video Objects with Reversible Flows

Segmenting primary objects in a video is an important yet challenging problem in computer vision, as it exhibits various levels of foreground/background ambiguities. To reduce such ambiguities, we propose a novel formulation via exploiting foreground and background context as well as their complementary constraint. Under this formulation, a unified objective function is further defined to encode each cue. For implementation, we design a Complementary Segmentation Network (CSNet) with two separate branches, which can simultaneously encode the foreground and background information along with joint spatial constraints. The CSNet is trained on massive images with manually annotated salient objects in an end-to-end manner. By applying CSNet on each video frame, the spatial foreground and background maps can be initialized. To enforce temporal consistency effectively and efficiently, we divide each frame into superpixels and construct neighborhood reversible flow that reflects the most reliable temporal correspondences between superpixels in far-away frames. With such flow, the initialized foregroundness and backgroundness can be propagated along the temporal dimension so that primary video objects gradually pop-out and distractors are well suppressed. Extensive experimental results on three video datasets show that the proposed approach achieves impressive performance in comparisons with 18 state-of-the-art models.


INTRODUCTION
S EGMENTING primary objects aims to delineate the phys- ical boundaries of the most perceptually salient objects in an image or video.By perceptual saliency, it means that the objects should be visually salient in image space while present in most of the video frames.This is an useful assumption that works under various unconstrained settings, thus benefiting many computer vision applications such as action recognition, object class learning, video summarization, video editing and content-based video retrieval.
Despite impressive performance in recent years [4], [5], [6], [7], [8], [9], primary object segmentation remains a challenging task since in real world images there exist various levels of ambiguities in determining whether a pixel belongs to the foreground or background.The ambiguities are more serious in video frames due to some video attributes representing specific situations, such as fast-motion, occlusion, appearance change and cluttered background [10].Specially, these attributes are not exclusive, thus a sequence can be annotated with multiple attributes.As shown in Fig. 1, due to the camera and/or object motion, the primary objects may suffer motion blur (e.g., the last dog frame), occlusion (e.g., the second dog frame) and even out-of-view (e.g., the last two turtle frames).Moreover, the primary objects may co-occur with various distractors in different frames (e.g., the turtle video frames), making them difficult to consistently pop-out throughout the whole video.
• J. Li • A preliminary version of this work has been published in ICCV 2017 [1].
Fig. 1: Primary objects may co-occur with or be occluded by various distractors.They may not always be the most salient ones in each separate frame but can consistently pop-out in most video frames (frames and masks are taken from the datasets VOS [2] Youtube-Objects [3], respectively).
To address these issues, there exist three major types of models which can be roughly categorized into interactive, weakly-supervised and fully-automatic ones.Interactive models require manually annotated primary objects in the first frame or several selected frames before the automatic segmentation [11], [12], [13], while weakly-supervised ones often assume that the semantic tags of primary video objects are known before segmentation so that external cues like object detections can be used [14], [15].However, the requirement of interaction or semantic tags prevents their usage in processing large-scale video data [16].arXiv:1811.09521v1[cs.CV] 23 Nov 2018 Fig. 2: Framework of the proposed approach.The framework consists of two major modules.The spatial module trains CSNet to simultaneously initialize the foreground and background maps of each frame.This module operates on GPU to provide pixel-wise predictions for each frame.The temporal module constructs neighborhood reversible flow so as to propagate foregroundness and backgroundness along the most reliable inter-frame correspondences.This module operates on superpixels for efficient temporal propagation.Note that E(•) is the cross-entropy loss that enforce F → G and B → 1 − G.The proposed complementary loss Ω(F, B) contains intersection loss Ω ∩ (F, B) and union loss Ω ∪ (F, B) for a complementary constraint.F , B and G are foreground, background and groundtruth, respectively.λ ∩ and λ ∪ are corresponding weights.Moreover, more details about CSNet are shown in 3.2.
Beyond the two kinds of models, fully-automatic models aim to directly segment primary objects in a single video [17], [18], [16], [19] or co-segment the primary objects shared by a collection of videos [20], [21], [22] without any prior information about objects.Although CNNs have achieved impressive progress in object segmentation, insufficient video data with pixel-level annotations may prevent the end-to-end training of a complex spatiotemporal model.
In view of remarkable performance in image-based primary object segmentation, an easy way is to extend the image-based models to videos by considering spatial attributes and the additional temporal cues of primary video objects [17], [23], [24].Such spatiotemporal attributes like attractive appearance, better objectness, distinctive motion from its surroundings and frequent occurrence in the whole video, mainly focus on foreground features and have attracted much attention from most models [25], [26], [27].While actually background is symbiotic with foreground and contains much connotative information.Thus some models pay more attention to background cues, such as boundary connectivity [25], [28], surroundings [29], even including complex dynamic background modeling [30].Naturally it leads to several models [17], [31] that consider both foreground and background cues to assist foregroundness segmentation.However, there exist two issues.On one hand, sometimes the complexity of primary objects renders these attributes insufficient (e.g., distractors share common visual attributes with targets), then these models may fail on certain videos in which the assumptions may not hold.On the other hand, these models either ignore foreground/background or only utilize one to facilitate the other, which may miss some important cues and result in more ambiguities between foreground and background.Moreover, temporal coherence is an important issue for primary video object segmentation, and directly applying image based algorithms to videos is vulnerable to inconsistent segmentation.To reduce such inconsistency, costly processing steps are usually adopted, such as object/trajectory tracking and sophisticated energy optimization models [17], [24], [31], [32].Particularly, pixel-wise optical flows are widely used to propagate information between adjacent frames.Unfortunately, optical flows are often inaccurate in case of sudden motion changes or occlusions, by which errors may be accumulated along time.Moreover, the correspondences in adjacent temporal windows may prevent long-term information being propagated more effectively.
Considering all these issues, this paper proposes a novel approach that effectively models the complementary nature of foreground/background in primary video object segmentation, and efficiently propagates information temporally within neighborhood reversible flow (NRF).Firstly, the problem of primary object segmentation is formulated into a novel objective function that explicitly considers foreground and background cues as well as their complementary relationships.In order to optimize the function and obtain the foregroundness and backgroundness prediction, a Complementary Segmentation Network (CSNet) with multi-scale feature fusion and foreground/background branching is proposed.Then, to enhance the temporal consistency of initial predictions, NRF is further proposed to establish reliable, non-local inter-frame correspondences.These two techniques constitute into the spatial and temporal modules of the proposed framework, as shown in Fig. 2.
In the spatial module, CSNet is trained on massive annotated images as an optimizer of the proposed complementary objective so as to simultaneously handle two complementary tasks, i.e., foregroundness and backgroundness estimation, with two separate branches.By using CSNet, we can obtain the initialized foreground and background maps on each individual frame.To efficiently and accurately propagate such spatial predictions between far-away frames, we further divide each frame into a set of superpixels and construct neighborhood reversible flow so as to depict the most reliable temporal correspondences between superpixels in different frames.Within such flow, the initialized spatial foregroundness and backgroundness are efficiently propagated along the temporal dimension by solving a quadratic programming problem that has analytic solution.In this manner, primary objects can efficiently pop-out and distractors can be further suppressed.Extensive experiments on three video datasets show that the proposed approach acts efficiently and achieves impressive performances compared with 18 state-of-the-art models (7 image-based & non-deep, 6 image-based & deep, 5 video-based).This paper builds upon and extends our previous work in [1] with further discussion of the algorithm, analysis and expanded evaluations.We further formulate the segmentation problem into a new objective function based on the constraint relationship between foreground and background and optimize it using a new complementary deep networks.
The main contributions of this paper include: 1) we formulate the problem of primary object segmentation into a novel objective function based on the relationship between foreground and background, and incorporate the objective optimization problem into end-to-end CNNs.In this manner, two dual tasks of foreground and background segmentation can be simultaneously addressed and primary video objects can be segmented from complementary cues.2) we construct neighborhood reversible flow between superpixels which effectively propagates foreground and background cues along the most reliable inter-frame correspondences and leads to more temporally consistent results.
3) Based on the proposed method, we achieve impressive performance compared with 18 image-based and videobased existing models, achieving state-of-the-art results.
In the rest of this paper, we first conduct a brief review of previous studies on primary/salient object segmenta-tion in Section 2.Then, we present the technical details of the proposed spatial initialization module in Section 3 and temporal refinement module in Section 4. Experimental results are shown in Section 5.At last, we conclude with a discussion in Section 6.

RELATED WORK
A great performance of primary video object segmentation is contributed by good performance of each frame.In this section, we give a brief overview of recent works in salient object segmentation in images and primary/semantic object segmentation in videos.

Salient Object Segmentation in Images
Salient object segmentation in images is a research area that has been greatly developed in the past twenty years in particular since 2007 [33].
Early approaches treat saliency object segmentation as an unsupervised problem and focus on low-level and midlevel cues, like contrast [25], [34], focusness [35], spatial property [36], [37], spectral information [38], objectness [29], etc.Most of the cues build upon foreground priors.For example, the widely used contrast prior believes that the salient regions present high contrast over background in certain context [36], [39], and the focusness prior considers that a salient object is often photographed in focus to attract more attention.From the opposite perspective, background prior is first proposed by Wei et al. [37], who assume the image boundaries are mostly background and build a saliency detection model based two background priors, i.e., boundary and connectivity.After that, some approaches [28], [40], [41], [42] successively appear.Unfortunately, these methods usually require a prior hypothesis about salient objects and their performance heavily depend on the prior reliability.Besides, the methods that only use purely low-level/midlevel cues are difficult to detect salient objects in complex scenes due to unawareness of image content.
Recently, learning based methods, especially deep networks methods (i.e., CNN-based models and FCN-based models), attract much attention because of the ability to extract the high-level semantic information [6], [8], [43].In [8], two neural networks DNN-L and DNN-G are proposed to respectively extract local features and conduct a global search for generating the final saliency map.In [7], Li and Yu introduce a neural network with fully connected layers to regress the saliency degree of each superpixel by extracting multiscale CNN features.While these CNN-based models with fully connected layers that operate at the patch-level may result in blurry saliency maps, especially near the boundary of salient objects, thus in [44], fully convolutional networks considering pixel-level operations is applied for salient object segmentation.After that, various FCN-based salient object segmentation approaches are explored [45], [46], [47] and obtain impressive performance.
However, most of the methods focus on independent foregroud or background features and only several models [48], [49] pay attention to both of them.While to the best of our knowledge, few models explicitly model the constraint relationship between them although it may be very helpful in complex scenes.Therefore, in this work, we simultaneously consider foreground and background cues as well as their complementary relationships and optimize their joint objective by using the powerful learning ability of deep networks.

Primary/Semantic Object Segmentation in Videos
Different from salient object segmentation in images, primary video object segmentation face more challenges and criteria (e.g., spatiotemporal consistency) due to the additional temporal attributes.
Motion information (e.g., motion vectors, feature point trajectories and optical flow) is usually used in spatiotemporal domain to facilitate primary/semantic video object segmentation and enhance the spatiotemporal consistency of segmentation results [50], [51], [52].For example, Papazoglou and Ferrari [18] first initialize foreground maps with motion information and then refine them in the spatiotemporal domain so as to enhance the smoothness of foreground objects.Zhang et al. [16] use optical flow to track the evolution of object shape and present a layered Directed Acyclic Graph based framework for primary video object segmentation.In a further step, Tsai et al. [32] utilize a multilevel spatial-temporal graphical model with the use of optical flow and supervoxels to jointly optimize segmentation and optical flow in an iterative scheme.The re-estimated optical flow (i.e., object flow) is used to maintain object boundaries and temporal consistency.Nevertheless, there still exist several issues.Firstly, some models [51], [53], [54] are built upon certain assumptions, for instance foreground objects should move differently from its surroundings in a good fraction of the video or should be spatially dense and change smoothly across frames in shapes and locations, which may fail on certain videos that contain complex scenarios in which assumptions may not hold.Secondly, the pixel-wise optical flow are usually computed between adjacent frames since their similarity can offer more accurate flow estimation, while it is disadvantageous to obtain more valuable inter-frame (e.g., two far-away frames) cues since adjacent frames may not offer useful cues due to occlusion, blur and out-of-view, etc.
Recently, a number of approaches attempt to address video object segmentation via deep neural networks.While due to lacking sufficient video data with per-frame pixellevel annotations, most of them exploit temporal information over image segmentation approaches for video segmentation.One popular thought is to calculate a kind of correspondence flow and propagate it in inter-frames [55], [56], [57].In [55] based on optical flow, a Spatio-Temporal Transformer GRU is proposed to temporally propagate labeling information between adjacent frames for semantic video segmentation.In [57] a deep feature flow is presented to propagate deep feature maps from key frames to other frames, which is jointly trained with video recognition tasks.Although these methods are helpful for transfering imagebased segmentation networks to videos, the propagation flows are still limited by adjacent frames or training complexity.
Therefore in our work, we enhance inter-frame consistency by constructing neighborhood reversible flow(NRF) instead of optical flow to efficiently and accurately propagate the initialized predictions between adjacent key frames, which is simple but effective for popping out the consistent and primary object in the whole video.

INITIALIZATION WITH COMPLEMENTARY CNNS
In this section, starting from the complementary peculiarity of foreground and background, we reformulate the problem of primary video object segmentation into a new objective function.Then we design complementary CNNs to conduct deep optimization of the objective function and yield the initial foreground and background estimation.

Problem Formulation
Typically, a frame I consists of the foreground area F and the background area B with F ∩ B = Ø and F ∪ B = I, i.e., the foreground and background should be complementary in image space.Considering that foreground objects and background distractors usually have different visual characteristics (e.g., clear versus fuzzy edges, large versus small sizes, high versus low objectness), we can attack the problem of primary object segmentation at the frame I from a complementary perspective, estimating foreground and background maps, respectively.In this manner, the intrinsic characteristics of foreground and background regions can be better captured by two models with different focuses.Keeping this in mind, we propose the following formulation to explicitly consider foreground and background cues min where F and B are two binary matrices representing F and B. G is the ground-truth map that equals 1 for a foreground pixel and 0 for a background pixel.W F and W B are two sets of parameters for the foreground and background prediction models φ F and φ B .For the sake of simplifications, the values of F and B are assumed to be in the range [0,1].The first term L(F, B, G) is the empirical loss defined as where E(•) is the cross-entropy loss that enforce F → G and B → 1 − G. Ideally, salient objects and background regions can be perfectly detected by minimizing these two losses.However, errors always exist even when two extremely complex models are used.In this case, conflicts and unlabeled areas may arise in the predicted maps (e.g., both F and G equals 1 or 0 at the same location).
To reduce such errors, we refer to the constraint relationship F ∩ B = Ø and F ∪ B = I and incorporate the complementary loss Ω(F, B): where Ω ∩ (•) and Ω ∪ (•) are two losses with non-negative weights λ ∩ and λ ∪ to encode the constraint F ∩ B = Ø and F ∪ B = I, respectively.Here, λ ∩ and λ ∪ are both set as 0.4.The intersection loss term Ω ∩ (•) tries to minimize the conflicts between F and B: where I indicates the number of pixels in the image I and p is a pixel with predicted foregroundness F(p) and backgroundness B(p).σ ∩ is a positive weight to control the penalty of conflicts.The minimum value of (4) will be reached when F(p)•B(p) = 0, implying that at least one map has zero prediction at every location.
Similarly, the union loss term Ω ∪ (•) tries to maximize the complementary degree between F and B: We can see that the minimum complementary loss can be reached when F(p) + B(p) = 1 (i.e., perfect complementary predictions).The parameter σ ∪ is a positive weight to control the penalty of non-complementary predictions.

Deep Optimization with Complementary CNNs
Given the empirical loss (2) and the complementary loss (3), we can derive two models φ F (•) and φ B (•) for per-frame initialization of the foreground and background maps by solving the optimization problem of objective function (1).Toward this end, we need to first determine the form of the models and the algorithm for optimizing their parameters.
Considering the impressive capability of convolutional neural network (CNN), we propose to solve the optimization problem in a deep learning paradigm.The architecture of the proposed CNN can be found in Fig. 3, which starts from a shared trunk and ends up with two separate branches, i.e., foreground branch and background branch.Main configurations and details are shown in Table 1.For simplicity, only the foreground branch is illustrated in Table 1 as the background one adopts the same architecture.Note that this network simultaneously handles two complementary tasks as well as their relationships, which is denoted as Complementary Segmentation Network (CSNet).The parameters of the shared trunk are initialized from the ResNet50 networks [58], which are used to extract low to high-level features that are shared by foreground objects and background distractors.We remove the pooling layer and the fully connected layer after RELU layer of res5c, and introduce two pooling blocks (see Fig. 3) to provide features from additional levels and reduce parameters.In order to integrate both local and global context, we sum up different levels of features output by layer Res3, Res4 and Res5 and two pooling blocks by appropriate up/down-sampling operations.After that, a residual block with a 3x3 CONV layer and a 1x1 CONV layer is used to post-process the integrated features as well as increase their nonlinearity.Finally, the shared trunk takes a 320 × 320 image as the input, and outputs a 40 × 40 feature map with 512 channels.
After the shared trunk, the features are fed into two separate branches that address two complementary tasks, i.e., foreground and background estimation.Note that the two branches share with the input, the architecture, but Fig. 3: Architecture of the proposed CSNet.Note that layer Res1 and Res2/3/4/5 correspond to layer conv1 and conv2_x/3_x/4_x/5_x in [58], respectively.More details are shown in Table 1.
produce complementary outputs.In each branch, the shared features pass through a sequential of convolution blocks.These blocks all consist of 1×1 and 3×3 CONVs, but with different dilations.As such, we concatenate the output of each block to constitute feature maps at 40 × 40 resolution with 1280 channels.These features, which have a wide range of spatial context and abstraction levels, are finally fed into several CONV layers for dimensional reduction and postprocessing, and upsampled to produce output segmentation maps at size 161 × 161.With such designs, the foreground branch mainly focuses on detecting salient objects, while the background one suppresses distractors.In addition to the empirical loss defined in (2), two additional losses (4), ( 5) are also adopted to penalize the conflicts and complementary degree of the output maps for more accurate predictions.
In the training stage, we collect massive images with labeled salient objects from four datasets for image-based salient object detection [7], [40], [59], [60].We down-sample all images to 320 × 320 and their ground-truth saliency maps into 161 × 161.For the pretrained ResNet50 trunk the learning rate is set to 5 × 10 −7 , while for the two branches they are 5 × 10 −6 .We train the network with a mini-batch of 4 images, using SGD optimizer with momentum 0.9 and weight decay 0.0005.

EFFICIENT TEMPORAL PROPAGATION WITH NEIGHBORHOOD REVERSIBLE FLOW
The per-frame initialization of foregroundness and backgroundness can only provide a location prediction of the primary objects and background distractors at the spatial domain.However, the concept of primary objects is defined from a more global spatiotemporal perspective, not only salient in intra-frame but also consistent in inter-frame and throughout the whole video.Just as mentioned earlier, the primary video object should be spatiotemporally consistent, i.e., the saliency foreground regions should not change dramatically along the time dimension.This implies that there still exists a large gap between the frame-based initialization results and the video-based primary objects.Therefore, we need to further infer the primary objects that consistently pop-out in the whole video [2] according to the spatiotemporal correspondence of visual signals.In this process, two key challenges need to be addressed, including: 1) how to find the most reliable correspondences between various (nearby or far-away) frames?
2) how to infer out the consistent primary objects based on spatiotemporal correspondences and the initialization results?
To address these two challenges, we propose a neigh-borhood reversible flow algorithm to find and propagate neighborhood reversible subset from inter-frames.Details of our solutions will be discussed in the following part of this section.

Neighborhood Reversible Flow
The proposed Neighborhood Reversible Flow (NRF) propagates information along reliable correspondences established among several key frames of the video, thus preventing errors to be accumulated fast and involving larger temporal windows for more effective context exploitation.
Instead of pixel-level correspondence, NRF operates on superpixels to achieve region-level matching and higher computational efficiency.Given a video V = {I u } K u=1 , we first apply the SLIC algorithm [61] to divide a frame I u into N u superpixels, denoted as {O ui }.For each superpixel, we compute its average RGB, Lab and HSV colors as well as the horizontal and vertical positions.These features are then normalized into the same dynamic range [0, 1].
Based on the features, we need to address two fundamental problems: 1) how to measure the correspondence between a superpixel O ui from the frame I u and a superpixel O vj from the frame I v , and 2) which frames should be referred for a given frame?Inspired by the concept of neighborhood reversibility in image search [62], we can compute the pair-wise 1 distances between {O ui } Nu i=1 and {O vj } Nv j=1 .After that, we denote the k nearest neighbors of O ui in the frame I v as N k (O ui |I v ).As a consequence, two superpixels O ui and O vj are k-neighborhood reversible if they reside in the list of k nearest neighbors of each other.That is, From ( 6), we find that the smaller k, the more tightly two superpixels are temporally correlated.Therefore, the correspondence between O ui and O vj can be measured as where k 0 is a constant to suppress weak flow and k is a variable.A small k 0 will build sparse correspondences between I u and I v (e.g., k 0 = 1), while a large k 0 will cause dense correspondences.In this study, we empirically set k 0 = 15 and represent the flow between I u and I v with a matrix F uv ∈ R Nu×Nv , in which the component at (i, j) equals to f ui,vj .Note that we further normalize F uv so that each row sums up to 1. Considering the highly redundant visual content between adjacent frames, for each video frame I u we pick up its adjacent keyframes {I t |t ∈ T u } to ensure sufficient variation in content and depict reliable temporal correspondences.In this paper, we refer the interval d k of annotated video frames, which usually contain most critical information of the whole video, to determine the interval of adjacent keyframes.Later, we estimate the flow matrixes between a frame I u and the frames {I t |t ∈ T u }, where T u can be empirically set to {u−2×d k , u−d k , u+d k , u+2×d k }.

Temporal Propagation of Spacial Features
The flow {F uv } depicts how superpixels in various frames are temporally correlated, which can be used to further propagate the spatial foregroundness and backgroundness.Typically, such temporal refinement can obtain impressive performance by solving a complex optimization problem with constraints like spatial compactness and temporal consistency.However, the time cost will also grow surprisingly high [15].Considering the requirement of efficiency in many real-world applications, we propose to minimize an objective function that has analytic solution.For a superpixel O ui , its foregroundness x ui and backgroundness y ui can be initialized as where p is a pixel with foregroundness X u (p) and backgroundness Y u (p).|O ui | is the area of O ui .For the sake of simplification, we represent the foregroundness and backgroundness scores of all superpixels in the uth frame with column vectors x u and y u , respectively.As a result, we can propagate such scores from I v to I u according to F uv : After the propagation, the foregroundness vector xu and backgroundness vector ŷu can be refined by solving xu = arg min where λ c is a positive constant whose value is empirically set to 0.5.Note that we adopt only the 2 norm in (10) so as to efficiently compute an analytic solution By observing ( 9) and ( 11), we find that the propagation process is actually calculating the average foregroundness and backgroundness scores within a local temporal slice under the guidance of neighborhood reversible flow.After the temporal propagation, we turn superpixel-based scores into pixel-based ones as where M u is the importance map of I u that depict the presence of primary objects.δ(p ∈ O ui ) is an indicator function which equals to 1 if p ∈ O ui and 0 otherwise.Finally, we calculate an adaptive threshold which equals to the 20% of the maximal pixel importance to binarize each frame, and a morphological closing operation is then performed to fill in the black area in the segmented objects.

EXPERIMENTS
In this section, we first illustrate experimental settings about datasets and evaluation metrics in Section 5.1.Then based on the datasets and metrics, we compare quantitatively our primary video object segmentation method with 18 state-ofthe-art approaches in Section 5.2.After that, in Section 5.3 we further demonstrate the effectiveness of our approach by offering more detailed exploration and dissecting various parts of our approach as well as running time and failure cases.

Experimental Settings
We test the proposed approach on three widely used video datasets, while their ways in defining primary video objects are different.Details of these datasets are described as follows: 1) SegTrack V2 [53] is a classic dataset in video object segmentation that are frequently used in many previous works.It consists of 14 densely annotated video clips with 1, 066 frames in total.Most primary objects in this dataset are defined as the ones with irregular motion patterns.
2) Youtube-Objects [3] contains a large amount of Internet videos and we adopt its subset [69] that contains 127 videos with 20, 977 frames.In these videos, 2, 153 keyframes are sparsely sampled and manually annotated with pixel-wise masks according to the video tags.In other words, primary objects in Youtube-Objects are defined from the perspective of semantic attributes.
3) VOS [2] contains 200 videos with 116, 093 frames.On 7, 467 uniformly sampled keyframes, all objects are presegmented by 4 subjects, and the fixations of another 23 subjects are collected in eye-tracking tests.With these annotations, primary objects are automatically selected as the ones whose average fixation densities over the whole video fall above a predefined threshold.If no primary objects can be selected with the predefined threshold, objects that receive the highest average fixation density will be treated as the primary ones.Different from SegTrack V2 and Youtube-Objects, primary objects in VOS are defined from the perspective of human visual attention.On these three datasets, the proposed approach, denoted as CSP, is compared with 18 state-of-the-art models for primary and salient object segmentation, including: 1) Image-based & Non-deep (7): RBD [28], SMD [65], MB+ [42], DRFI [4], BL [63], BSCA [41], MST [64].
In the comparisons, we adopt two sets of evaluation metrics, including the Intersection-over-Union (IoU) and the Precision-Recall-F β .Similar to [2], the precision, recall and IoU scores are first computed on each video and finally averaged over the whole dataset so as to generate the mean Average Precision (mAP), mean Average Recall (mAR) and mean Average IoU (mIoU).In this manner, the influence of short and long videos can be balanced.Furthermore, a unique F β score can be obtained based on mAR, mAP and a parameter β, the square of which is set as 0.3 to emphasize precision more than recall in the evaluation.

Comparisons with State-of-the-art Models
The performances of our approach and 18 state-of-the-art models on three video datasets are shown in Table 2. Some representative results of our approach are demonstrated in Fig. 4. From Table 2, we find that on Youtube-Objects and VOS such larger datasets our approach obtains the best F β and mIoU scores, while on SegTrack V2 our approach ranks the second places (worse than NLC).This can be explained by the fact that SegTrack V2 contains only 14 videos, among which most primary objects have irregular motion patterns.Such videos often perfectly meet the assumption of NLC on motion patterns of primary objects, making it the best approach on SegTrack V2.However, when the scenarios being processed extend to datasets like VOS that are constructed without such "constraints" on motion patterns, the performance of NLC drops sharply as its assumption may sometimes fail (e.g., VOS contains many videos only with static primary objects and distractors as well as slow camera motion, see Fig. 4).These results further validate that it is quite necessary to conduct comparisons on larger datasets with daily videos (like VOS) so that models with various kinds of assumptions can be fairly evaluated.Moreover, there exist some approaches (BL and MB+) on the three datasets that outperform our approach in recall, and some other approaches (NLC, ACO and FST) may achieve better or comparable precision with our approach on SegTrack V2.However, the other evaluation scores of the approaches are much worse than our method on the three datasets.That is, none of these approaches simultaneously outperforms our approach in both recall and precision so that our approach often have better overall performance, especially on larger datasets.This may imply that the proposed approach is more balanced than previous works.By analyzing the results on the three datasets, we find that this phenomenon may be caused by the conduction of complementary tasks in CSNet.By propagating both foregroundness and backgroundness, some missing foreground information may be retrieved, while the mistakenly poppedout distractors can be suppressed again, leading to balanced recall and precision.
From Table 2, we also find that there exist inherent correlations between salient image object detection and primary video object segmentation.As shown in Fig. 4, primary objects are often the most salient ones in many frames, which explains the reason that deep models like ELD, RFCN and DCL outperforms many video-based models like NLC, SAG and GF.However, there are several key differences between the two problems.First, primary objects may not always be the most salient ones in all frames (as shown in Fig. 1).Second, inter-frame correspondences provide additional cues for separating primary objects and distractors, which depict a new way to balance recall and precision.Third, primary objects may be sometimes close to video boundary due to camera and object motion, making the boundary prior widely used in many salient object detection models no valid any more (e.g., the bear in the last row of the last column of Fig. 4).Last but not least, salient object detection needs to distinguish a salient object from a fixed set of distractors, while primary object segmentation needs to consistently pop-out the same primary object from a varying set of distractors.To sum up, primary video object segmentation is a more challenging task that needs to be further explored from the spatiotemporal perspective.

Detailed Performance Analysis
Beyond performance comparison, we also conduct several experiments on VOS, the largest one of the three datasets, to find out how the proposed approach works in segmenting primary video objects.Moreover, an additional metric, i.e., temporal stability measure T [10], is applied to evaluating the relevant aspect in primary video objects segmentation in addition to the aforementioned four metrics.After all, mIoU only measures how well the pixels of two masks match, while F β measures the accuracy of contours.None of  them consider the temporal aspect.However, video objects segmentation is conducted spatiotemporal dimensions.So the additional temporal stability measure is a appropriate choice to evaluate the temporal consistency of segmentation results.The main quantifiable results can be found in Table.3. In Table .3, the first group is our previous work in [1] and the second group is our current work extended from [1].In order to illustrate the effect of each component in our approach, the two group of tests are based on the same parameters except for the last case R-Init.+NRFp,which is the final test result in generally by using some data argumentation and parameter adjustment on the base of case R-Init.+NRF.

Performance of Complementary CNNs
In this section, some detail analysis will be given to further verify the effectiveness of the proposed complementary CNN branches and complementary loss.Impact of two complementary branches.To explore the impact of two complementary network branches, we evaluate the foreground maps and background maps initialized by the two complementary branches, as well as their fusion maps.As shown in Table .3, in the first group the evaluation scores of case V-Init.FG and V-Init.BG are equally matched for their same branch structure, while the ones of their fusion maps are increased at different degree, which suggests that the complementary characteristics of initialized foreground maps and background maps can contribute to and constrain each other to generate more accurate predic-Then what will happen if we abandon the background branch?To this end, we conduct two additional experiments in our previous work [1].First, if we cut down the background branch and retrain only the foreground branch, the final performance decreases by about 0.9%.Second, if we retrain a network with two foreground branches, the final F β and mIoU scores decrease from 0.806 to 0.800 and 0.710 to 0.700, respectively.These experiments indicate that, beyond learning more weights, the background branch does have learned some useful cues that are ignored by the foreground branch, which are expected to be high-level visual patterns of typical background distractors.These results also validate the idea of training deep networks by simultaneously handling two complementary tasks.
Therefore, network structure with two similar branches are still adopted in this extension work.What is different is that the new network structure is assisted with more designment based on the deeper and more effective ResNet50 instead of simple VGG16.From Table .3, we can find that the initialized results are distinctly improved when the backbone VGG16 is replaced by ResNet50.The aforementioned four evaluation scores are all increased, e.g., the F β and mIoU scores increase from 0.776 to 0.791 and 0.684 to 0.710, respectively, although the temporal stability performance is affected.This reveals the better performance of our new network structure, and at the same time hints a fact that a favourable per-frame initialization cannot stand for a good video initialization because of the temporal consistency attribute in video.Thus, it is necessary to conduct optimization in temporal dimension, which will be explained in next subsection.
Effect of the complementary loss.Except for the specific network, another main difference is that the two complementary CNN bracnches in CSNet are also constrained by our complementary loss.To verify the effectiveness, we optimize two sets of foreground and background prediction models based on the new network structure, one with the constraint of the penalty term in the objective function, and the other without.Based on the two sets of models, we can initialize a foreground and a background map for each video frame.The quantitative evaluations of initialization results respectively correspond to the cases R-Init.FGp/BGp/(FGp+BGp) and the cases R-Init.FG/BG/(FG+BG) in Table .3. From Table .3 we can find that compared to the predictions without penalty term constraint, the foreground and background models with the additional complementary loss can achieve better performance in predicting both foreground maps and background maps, shown as the better F β and mIoU scores.Moreover, some visual examples are shown in Fig. 5. Obviously, if we only use the empirical loss (2), some background regions may be wrongly classified into foreground (e.g. the first three columns in Fig. 5) while some foreground details may miss (e.g. the last three columns in Fig. 5).By incorporating the additional complementary loss (3), these mistakes can be fixed (see Fig. 5(d)(e)).Thus, the complementary loss is effective for boundary localizations and suppressing background distractors.These results validate the effectiveness of handling two complementary tasks with explicit consideration of their relationships.
Combining the two differences, the F β and mIoU scores of initialization results output by our previous network CCNN (the case V-Init.(FG+BK)) increase by about 3.6% and 7.2%, i.e., from 0.800 to 0.829 and 0.689 to 0.739, respectively, with the increased mAP and mAR scores.This also means that the combination of complementary loss and two complementary branches of foreground and background are valid and ingenious.
In particular, the complementary CNN branches in our two networks both show impressive performance in predicting primary video objects over the other 6 deep models when their pixel-wise predictions are directly evaluated on VOS.By analyzing the results, we find that this may be caused by two reasons: 1) using more training data, and 2) simultaneously handling complementary tasks, whose effectiveness is just verified.To explore the first reason, we retrain CCNN on the same MSRA10K dataset used by most deep models.In this case, the F β (mIoU) scores of the foreground and background maps predicted by CCNN decrease to 0.747 (0.659) and 0.745 (0.658), respectively.Note that both branches still outperform RFCN on VOS in terms of mIoU (but F β is slightly worse).

Effectiveness of Neighborhood Reversible Flow
Through above complementary network branches, the salient foreground and background maps in intra-frame are well obtained, while the initialization operation cannot ensure temporal consistency of segmented objects, e.g., the initialized predictions by CSNet outperform the ones by CCNN in term of mAP, mAR, F β and mIoU, while become inferior in temporal stability mT.Thus a thought of improving temporal relationship is proposed in section 4, i.e., finding and propagating the reliable inter-frame correspondences by applying the neighborhood reversible flow to make the consistent salient subsets enhanced and Fig. 6: Performance of the proposed neighborhood reversible flow.The first row is the video squences from Youtube-Objects [3], the second row is corresponding initialized foreground maps, and the third row is optimized results by neighborhood reversible flow.accidental distractors suppressed.Consequently, the final primary video objects with spatiotemporal consistency are yielded.
Effectiveness of Neighborhood Reversible Flow.To prove this thought, we compare the initialized results by CCNN (CSNet) with the optimized results (V-Init.(R-Init.)+NRF)by neighborhood reversible flow.As shown in Table 3, the temporal stability measure mT of optimization results in CCNN (CSNet) cases decrease from 0.121 (0.122) to 0.109 (0.108) comparing with the initialized predictions (V-Init.(FG+BK), R-Init.(FGp+BKp)).At the same time, the other evaluation scores are also improved, e.g., the mIoU score from 0.689 to 0.710.The superiority will become more obviously if we directly compare V-Init.+NRF with V-Init.FG, i.e., conduct the fusion operation on foreground and background in the process of neighborhood reversible flow just like we really do.This means that by propagating the neighborhood reversible flow, the spatial subsets of primary objects in intra-frame can be refined from a temporal perspective and the inter-frame temporal consistency can be enhanced.the primary video objects with favourable spatiotemporal consistency can pop-out.As shown in Fig. 6, the primary objects in most video frames are initialized as the horce, while the objects that only lasts for a short while are mistakenly classified into foreground due to their spatial saliency in certain frames.Fortunately, the distractors are well suppressed by the optimization of neighborhood reversible flow (see the third rows in Fig. 6).Thus, via propagating salient cues in inter-frames, background objects could be effectively suppressed, only preserving the real primary one.
To further demonstrate the effectiveness of neighborhood reversible flow, we test our approach with two new settings based on the CCNN.In the first setting, we replace the correspondence from Eq. ( 7) to the Cosine similarity between superpixels.In this case, the F β and mIoU scores of our approach on VOS drop to 0.795 and 0.696, respectively.Such performance is still better than the initialized foreground maps but worse than the performance when using the neighborhood reversible flow (F β =0.806, mIoU = 0.710).This result indicates the effectiveness of neighbor- Fig. 7: Influence of parameters k 0 and λ c to our approach.
hood reversibility in temporal propagation.
In the second setting, we set λ c = +∞ in Eq. ( 10), implying that primary objects in a frame are solely determined by the foreground and background propagated from other frames.When the spatial predictions of each frame are actually ignored in the optimization process, the F β (mIoU) scores of our approach on VOS only decrease from 0.806 (0.710) to 0.790 (0.693), respectively.This result proves that the inter-frame correspondences encoded in the neighborhood reversible flow are quite reliable for efficient and accurate propagation along the temporal dimension.
It's worth mentioning that in the previous initialization process, the predictions are all pixel-wise, while the temporal optimization via neighborhood reversible flow are conducted on superpixel wise foreground/background maps in order to reduce time consumption, i.e., the predictions need to be converted from the pixel to superpixel and finally converted to pixel.However, the superpixel wise predictions are relatively coarse and may affect the following process.To explore the effect, we convert the foreground/background maps and their fusion maps in cases R-Init.FGp/BKp from pixel-wise to superpixel wise, as shown in Table 4. Fortunately, both F β and mIoU scores of foreground (background) maps only slightly decrease by 0.003 (0.004) and the mT scores increase by 0.009, while the negative effect on fusion maps mainly manifests in mT scores, which can be improved by the propagation of neighborhood reversible flow.So the trade-off is worthy.Meanwhile, this also hints the important effect of neighborhood reversible flow on temporal stability or consistency.
Parameter setting.In the experiment based on CCNN, we smoothly vary two key parameters used in NRF, includ- ing the k 0 in constructing neighborhood flow and the λ c that controls the strength of temporal propagation.As shown in Fig. 7, larger k 0 tends to bring slightly better performance, while our approach performs the best when λ c = 0.5.In experiments, we set k 0 = 15 and λ c = 0.5 in constructing the neighborhood reversible flow.Selection of color spaces.In constructing the flow, we represent each superpixel with three color spaces.As shown in Table 5, a single color space performs slightly worse than their combinations.Actually, using multiple color spaces have been proved to be useful in detecting salient objects [4], while multiple color spaces make it possible to assess temporal correspondences from several perspectives with a small growth in time cost.Therefore, we choose to use RGB, Lab and HSV color spaces in characterizing a superpixel.

Running Time
We test the speed of the proposed approach with a 3.4GHz CPU (only use single thread) and a NVIDIA TITAN Xp GPU (without batch processing).The average time cost of each key step of our approach in processing 400 × 224 frames are shown in Table 6.Note that the majority of the implementation runs on the Matlab platform with several key steps written in C (e.g., superpixel segmentation and feature conversion between pixels and superpixels).We find that our approach takes only 0.20s to process a frame if not using multi-test, and no more than 0.75s even using, which is much faster than many video-based models (e.g., 19.0s for NLC, 6.1s for ACO, 5.8s for FST, 5.4s for SAG and 4.7s for GF).This may be caused by the fact that we only build correspondences on superpixels with the neighborhood reversibility, which is very efficient.Moreover, we avoid using complex optimization objectives and constraints.Instead, we use only simple quadratic optimization objectives so as to obtain analytic solutions.The high efficiency of our approach makes it possible to be used in some real-world applications.

Failure Cases
Beyond the successful cases, we also show in Fig. 8 some failures.We find that failures can be caused by the way Fig. 8: Failure cases of our approach.Rows from top to bottom: video frames, ground-truth masks and our results.of defining primary objects.For example, the salient hand in Fig. 8 (a) is not labeled as primary object as the corresponding videos from Youtube-Objects are tagged with "dog".Moreover, shadow (Fig. 8 (b)) and reflection (Fig. 8 (c)), generated by the target object and environment, are some other reasons that may cause unexpected failures due to their similar saliency with the target object.It is also easily to fail when part regions of the target salient object are similar to background (Fig. 8 (d)).Specially, successful segmentation is also very hard for some minuscule objects, e.g., the crab in water (Fig. 8 (e)).Such failures need further exploration in future.

CONCLUSION
In this paper, we propose a simple yet effective approach for primary video object segmentation.Based on the complementary relationship of foreground and background, the problem of primary object segmentation is turned into an optimization problem of objective function.According to the proposed objective function, a complementary convolutional neural network is designed and trained on massive images from salient object datasets to handle complementary tasks.Then by the trained models, the foreground and background in a video frame can be effectively predicted from the spatial perspective.After that, such spatial predictions are efficiently propagated via the inter-frame flow that has the characteristic of neighborhood reversibility.In this manner, primary objects in different frames can gradually pop-out, while various types of distractors can be well suppressed.Extensive experiments on three video datasets have validated the effectiveness of the proposed approach.
In the future work, we tend to improve the proposed approach by fusing multiple ways of defining primary video objects like motion patterns, semantic tags and human visual attention.Moreover, we will try to develop a completely end-to-end spatiotemporal model for primary video object segmentation by incorporating the recursive mechanism.

Fig. 4 :
Fig. 4: Representative results of CSP.Red masks are the ground-truth and green contours are the segmention objects.TABLE 3: Detail performances of our approaches.The first test group is our previous work in [1] and the second group is our current work.V-Init/R-Init.: corresponding results initialized by previous/current network.FG (FGp)/BG (BGp): foreground/background estimation with (without) the constraint of complementary loss.NRF (NRFp): Neighborhood Reversible Flow (with multi-test).CE: cross-entropy.Comple.: complementary loss.mT: mean temporal stability metric, the smaller the better.Bold and underline indicate the 1st and 2nd performance in each column.

Fig. 5 :
Fig. 5: Foreground and background maps initialized by CSNet as well as their interaction and union maps (a) video frames, (b) foreground maps and (c) background maps generated by CSNet without the complementary loss.(d) foreground maps and (e) background maps generated by CSNet with the complementary loss.

TABLE 2 :
Performances of our approach and 18 models.Bold and underline indicate the 1st and 2nd performance in each column.ImageN: Image-based & Non-deep.ImageD: Image-based & Deep.

TABLE 4 :
Performance of superpixel wise initialization by CSNet on VOS.FGp: foreground branch, BKp: background branch.Sup. is short for superpixel.

TABLE 5 :
Performance of our CCNN based approach on VOS when using different color space in constructing neighborhood reversible flow.

TABLE 6 :
Speed of key steps in our approach.Mark + means using multi-test.