Saliency Detection with Moving Camera via Background Model Completion

Detecting saliency in videos is a fundamental step in many computer vision systems. Saliency is the significant target(s) in the video. The object of interest is further analyzed for high-level applications. The segregation of saliency and the background can be made if they exhibit different visual cues. Therefore, saliency detection is often formulated as background subtraction. However, saliency detection is challenging. For instance, dynamic background can result in false positive errors. In another scenario, camouflage will result in false negative errors. With moving cameras, the captured scenes are even more complicated to handle. We propose a new framework, called saliency detection via background model completion (SD-BMC), that comprises a background modeler and a deep learning background/foreground segmentation network. The background modeler generates an initial clean background image from a short image sequence. Based on the idea of video completion, a good background frame can be synthesized with the co-existence of changing background and moving objects. We adopt the background/foreground segmenter, which was pre-trained with a specific video dataset. It can also detect saliency in unseen videos. The background modeler can adjust the background image dynamically when the background/foreground segmenter output deteriorates during processing a long video. To the best of our knowledge, our framework is the first one to adopt video completion for background modeling and saliency detection in videos captured by moving cameras. The F-measure results, obtained from the pan-tilt-zoom (PTZ) videos, show that our proposed framework outperforms some deep learning-based background subtraction models by 11% or more. With more challenging videos, our framework also outperforms many high-ranking background subtraction methods by more than 3%.


Introduction
High-level applications such as human motion analysis [1] and intelligent transportation system [2] demand the localization of targets in the video.For instance, in video surveillance, humans are detected for motion recognition.Vehicles are located in intelligent transportation system.This foremost task can be achieved via saliency detection.Assuming that the background scene possesses invariant characteristics, target is detected due to its deviated visual cues.One approach is to formulate the task as background/foreground segmentation.With the estimated background scene model, the foreground (i.e.saliency) is segmented by a pixelwise background subtraction algorithm.
However, the two assumptionsinvariant background and deviation of foreground, may be violated in some circumstances.For instances, background motion and illumination change can result in false detection (false positive).With the pre-generated background model, background pixels may be predicted as foreground pixels.This is false positive error which is due to image feature(s) of the background pixel different from the background model.On the other hand, camouflage and intermittent object motion will lead to missing detection.This false negative error is due to the fact that some foreground pixels may be erroneously identified as the background pixels if they have similar image feature(s) to the background model.
The saliency detection task becomes more challenging with the use of moving camera.The assumption of a static background is violated.Videos can be captured by a pan-tilt-zoom (PTZ) camera or free-moving (e.g.hand-held) camera.Systems that can handle such type of video are of interest.They demand sophisticated techniques for generating and maintaining the background model, as well as foreground segmentation.
Researchers have proposed various background subtraction algorithms.Many background subtraction methods are deterministic, i.e. background/foreground segmentation is achieved based on hand-crafted features.One earliest approach is to adopt statistical models [3,4].Elgammal et al. [5] utilized kernel estimator to characterize the probability density function (pdf) of the background pixels.Some researchers have presented the survey on the background subtraction techniques [6,7].Sobral and Vacavant [8] evaluated 29 background subtraction methods.In general, background subtraction comprises of three main partsbackground modeler, background/foreground classifier, and background updating.
Another approach is to use neural network for saliency detection.Its cognitive power is made possible with the structure simulating the complex connectivity of neurons.Maddalena and Petrosino [9] proposed the Self Organizing Background Subtraction (SOBS).The background scene is modeled with the weights of the neurons.The network compares the current image frame with the background model and outputs the pixelwise background/foreground classification.Recently, a popular approach is to develop deep learning models, such as convolutional neural network (CNN).The layered structure can accommodate the multi-scale representation, with which image data are transformed and abstract features are extracted.Wang et al. [10] proposed a basic CNN model, with which multi-resolution CNN and cascaded CNN architectures were designed for object segmentation.Lim and Keles [11], proposed an encoder-decoder network for object segmentation.The encoder part is a triple CNNs for multi-scale feature extraction.The concatenated feature map is fed to a transposed convolutional network in the decoder part.They [12] further proposed another model which uses feature pooling module on top of the encoder part.
In this paper, we propose a new framework, called saliency detection via background model completion (SD-BMC), that comprises of a background modeler and the deep learning background/foreground segmentation network.Our framework can detect saliency in videos captured by moving camera.The results, obtained from the benchmark datasets, show that our proposed framework outperforms many high-ranking background subtraction models.Figure 1 shows an overview of SD-BMC which performs two main tasksgeneration of initial background model, and continuous saliency detection with updating of background model.Our contributions can be summarized as follows: • Inspired by the filling of missing pixels via the inpainting technique, we adopt the video completion module for modeling the background scene.To generate a clean background frame, foreground objects will be substituted by the estimated background colors.Guided by the optical flow, the video completion module can generate good background model for video captured by moving camera, which is not possible for other existing methods.• We adopt the BSUV-Net 2.0 [13] for background/foreground segmentation.Although the model is pre-trained with the CDNet [14] video dataset, it can also segment foreground in unseen videos.However, most of the videos in CDNet are captured by static camera.BSUV-Net 2.0 still produces some FP and FN errors in moving camera videos.Therefore, we replace the background frame generation method of BSUV-Net 2.0 with our video completion based background modeler.• We propose a framework that comprises of the video completion-based background modeler and the enhanced BSUV-Net 2.0 foreground segmentation network.To thoroughly evaluate the new framework, we create our own video dataset with videos captured by PTZ camera and free-moving camera.The results show that our framework outperforms many highranking background subtraction models.The paper is organized as follows.The related researches on background subtraction, in particular with videos captured by moving camera, are reviewed in the following section.Section 3 elaborates our saliency detection framework.We compare our framework with other high-ranking background subtraction algorithms.Quantitative and visual results are presented in section 4.
Discussion is also made on the performance of all these methods.Finally, we draw the conclusion in section 5.

Related work
Many methods have been proposed for segmenting foreground in videos captured by stationary cameras.In this section, we review sophisticated methods that are proposed to handle videos captured by moving cameras.Moving cameras can be categorized into two types: constrained moving camera, and freely moving camera.For instance, PTZ camera belongs to the first category.
In the second category, examples are hand-held camera, smartphone, or camera mounted on drone.Methods developed for constrained camera may not perform well with freely moving camera.
Hishinuma et al. [15] considered the camera small pan/tilt motion as translational.The translation amount is computed from the correlation of the FFT phase terms of stationary background blocks.
The synthesized still background model is then used for foreground segmentation.In [16], camera motion is compensated by calculating the homography transformation between two image frames.Scene model, which is a panoramic background, is then generated from the motion compensated video.Foreground objects are detected by comparing the panoramic background with individual image frames of the video.Szolgay et al. [17] proposed a method for detecting moving objects in video taken by a wearable camera.Global camera motion is estimated first by a hierarchical block matching algorithm and then refined by a robust motion estimator.Foreground is identified as the difference between motion-compensated image frames.Tao and Ling [18] proposed a neural network for segmenting foreground in videos captured by PTZ cameras.Deep learning features are extracted by a pre-trained network.Homography matrix is estimated from previous image frames and current image frame with a semantic attention based deep homography estimator.The warped previous frames, current frames, and their features are fed into the fusion network for foreground mask prediction.Komagal and Yogameena [19] reviewed the methods and the datasets for foreground segmentation research with PTZ camera.
With the use of freely-moving camera, both background and foreground are changing.The assumption of background modeling may be violated.For instance, when background and foreground motions are similar, the background model is contaminated with foreground colors.In another scenario, inaccurate camera motion estimation will give rise to false positive errors.Yun et al. [20] proposed an adaptive scheme that can update the background model in accordance with the changes of background.The scheme compensates three types of changebackground motion produced by moving camera, foreground motion, and illumination change.Knowing that an explicit camera motion model is not reliable, Sajid et al. [21] proposed an online framework such that both background and foreground models are continuously updated.Background motion is estimated with a low-rank approximation.Motion and appearance models are combined to produce the background/foreground classification.Zhu and Elgammal [22] proposed a multi-layered framework for background subtraction.In each layer, both motion and appearance model are estimated and used for foreground detection.Probability map is inferred by kernel density estimator [5].Finally, segmented foreground is generated from the multi-layered outputs by multilabel graph-cut.Chapel and Bouwmans [23] reviewed the moving object detection methods with moving camera.They grouped the methods into two categories in accordance with scene representationsingle-plane and multi-plane.Methods in the first group may generate a panoramic background by image mosaic.Some methods detect moving objects via motion segmentation.Multi-plane approach estimates several planes (may be real or not) as scene representation.Matched feature points are located and eventually used for background/foreground classification.
We adopt the background-centric approach for saliency detection.Instead of modeling the background based on camera motion compensation which may be inaccurate, we generate and update the background dynamically via video completion and continuous monitoring of foreground segmentation result.Our background modeler, based on the optical flow information, can generate a much better background frame for video captured by a moving camera than other methods.As demonstrate in our results, our framework with the cascade of background modeler and deep-learning foreground segmenter, outperforms many high-ranking background subtraction models in saliency detection.
Various video datasets were created for background subtraction research.The CDNet 2014 dataset [14] contains videos grouped under 11 categories.Each video record provides the original image sequence and the corresponding ground truths.Many of the videos were captured in different challenging scenes.For instance, the "PTZ" category contains four videos captured by PTZ camera.The Hopkins 155 dataset [24] contains indoor and outdoor panning videos.Perazzi et al. [25] proposed three versions of the Densely Annotated Video Segmentation (DAVIS) dataset.Some videos were captured by shaking camera.The SegTrack v2 dataset [26] contains videos captured by moving camera with ground truth of moving object.Labeled and Annotated Sequences for Integral Evaluation of SegmenTation Algorithms (LASIESTA) [27] contains real indoor and outdoor videos with pan, tilt or shaking cameras.In our experimentations, we create our own dataset which comprises of videos captured by PTZ camera and moving camera extracted from various publicly available video datasets.

Saliency detection framework
The saliency detection framework SD-BMC is shown in Figure 1.First, we initialize the system with the first 100 frames.We use the background image generated by the foreground segmenter (BSUV-Net 2.0 [13]) with median filter to create masks in this step.These masks, together with the initial image sequence, will be put into the video completion-based background modeler (FGVC [28]).From the sequence of completed frames, the most recent one is selected as the background.In the stream of saliency detection, the initial background frame and the current image sequence will be input to the foreground segmenter.The background model will be updated based on the foreground segmentation result.The stream of saliency detection with feedback will continue until all the video frames are processed.

Background modeler
Many background modeling algorithms can estimate a clean background frame, even the image sequence contains moving objects.However, if the foreground objects exist too long, there will be phenomena like ghosts in the background image.The problem becomes more complicated with video captured by a moving camera.Deep learning-based methods have been proposed for background modeling.For instance, Farnoosh et al. [29] proposed a variational autoencoder (VAE) framework for background estimation from videos recorded by fixed camera.In our experimentation on videos of moving camera, there are always blur pixels existing in the final background images.
We adopt and modify the video completion method FGVC [28] for background modeling.The algorithm can generate a clean background image with more attention to the masks corresponding to the foreground objects and also the changing scene between adjacent image frames.Figure 2 shows our video completion-based background modeler.In part (a), the color video sequence and the corresponding binary masks are input to the background modeler.The masks are the foreground regions that need to be completed.Next, in part (b), optical flow between adjacent frames is computed with FlowNet2 [30].Moreover, flow between some non-adjacent frames is also computed.This can help to estimate the missing background colors when camera motion is large.The background flow is predicted from the color video sequence, while the foreground flow is predicted from the masks.In each flow field, flow edges are extracted.Guided by the flow edge map, a completed optical flow field is generated.In part (c), a set of candidate pixels are computed for each missing pixel.Most of the missing pixels can be filled with inpainting via fusion of the candidate pixels.After that, the network will use Poisson reconstruction to generate the initial completed background frame.Finally, in the last part (d), the modeler will fix the remaining missing pixels with a number of inpainting iterations until there is no missing pixel.Experimentation is performed to determine the length of the video sequence for background generation.If it is too short, there may not be enough number of candidate pixels for background color synthesis.If it is too long, the time for background generation will be long, and the actual saliency detection will be delayed.First, we choose 30 frames for background generation.Figure 3 shows the background modeling on one video.It can be seen that ghosts exist in the background frame.Then, we lengthen the initialization sequence to 100 frames.Most of the foreground pixels can be substituted with the background colors.Table 1 compares the F-measure of saliency detection on the PTZ category of CDNet 2014 dataset.Finally, we fix the length to 100 frames.Table 1.F-measure of saliency detection with two settings for background modeling.Length of initialization sequence F-measure 30 frames 0.8062 100 frames 0.8147

Foreground segmentation
We adopt BSUV-Net 2.0 [13] as foreground segmenter.As shown in Figure 4, it has a U-Net [31] like structure.Based on BSUV-Net [32], BSUV-Net 2.0 further improves the background subtraction performance on complicated videos with more spatio-temporal data augmentations.
The encoder-decoder structure contains five convolutional blocks in the downsampling path, four convolutional blocks in the upsampling path, and their links via concatenation.The detail of the configuration is shown in Table 2. Table 2. Layer configuration of foreground segmenter (SD: spatial dropout layer; BN: batch normalization).
An empty background frame, a recent background frame, the current frame and corresponding foreground probability maps (FPM) are needed for background/foreground separation.The input has a total of 12 channels.To avoid overfitting problems and increase the generalization of the network, it uses a batch normalization layer for each convolution layer in the encoder part and a convolution transpose layer in the decoder part.Dropout layers are also used before max-pooling to make the network more generative.Finally, the network uses the sigmoid function to get the prediction value of the pixels in the output binary saliency detection.
Tezcan et al. [13] simulated some changes, e.g.changes that look like videos captured by PTZ camera, for data augmentation in training the model.However, as shown in our experimental results, BSUV-Net 2.0 is still not good enough in saliency detection with videos captured by PTZ camera and freely-moving camera.It is because the background modeling method cannot generate a fairly good background frame for complicated videos.Therefore, in our saliency detection stream, we disable the default background modeling method.Instead, we use the video completion-based background modeler which can generate a better background frame.

Datasets
We test our saliency detection framework on CDNet 2014 dataset [14] and our customized dataset.As shown in Table 3, CDNet 2014 comprises of 11 categories, each of which contains 4 to 6 videos.
Each video record provides the original image sequence and the corresponding ground truths.Some videos, e.g. the PTZ category, were captured in challenging scenes.
Our customized dataset comprises of 22 videos from the FBMS dataset [33] and 8 videos from the LASIESTA dataset [27].The videos were captured by handheld cameras and PTZ cameras.For videos selected from the FBMS dataset, we used manually define the ground truths for 20 continuous frames randomly chosen after the 100th frame in each video.For videos from the LASIESTA dataset, each record provides a number of ground truth images.Table 4 shows the details of our customized dataset.
Table 4. Customized dataset categories and video scene.

Evaluation metrics
To evaluate the performance of our framework and other baseline methods, we compute

Performance evaluation
We implement SD-BMC with the Python-based Pytorch.The computing platform comprised of Intel Xeon Silver 4108 CPU 1.8G 16 Cores, and a HPC Cluster with NVIDIA RTX 2080Ti 11GB x 8 GPU nodes.The background frame, either in the initialization or in the updating process, is generated from a sequence of 100 image frames.Therefore, each video is partitioned into sections of 100 frames.If the last section contains less than 100 frames, we input all the remaining frames into our framework.We resize the original image sequence and the ground truth images with the resolution of 320 * 240.
We compare SD-BMC with six background subtraction algorithms -BSUV-Net [32], BSUV-Net 2.0 [13], Fast BSUV-Net 2.0 [13], PAWCS [34], SuBSENSE [35], and ViBe [36].Tezcan et al. [32] first proposed the BSUV-Net.Background frames are estimated from the video.The current frame of the video and the background frames are input to the fully-convolutional neural network for background subtraction.They proposed the second version of the model [13] by training with data simulating spatio-temporal changes.Moreover, they developed the Fast BSUV-Net 2.0 [13] which is a real-time version of the model.St-Charles et al. proposed SuBSENSE [35] and PAWCS [34] for change detection.The background model is a codebook which is generated based on the persistence of pixel features.They are among the high-ranking methods in CDNet 2014.Barnich et al. [36] adopted the bag of words approach and proposed an efficient background subtraction method ViBe.At each pixel location, some samples are randomly selected from the image sequence and stored as background colors.The background model is also updated with a random process.

Quantitative and visual results
Table 5 shows the numerical results of SD-BMC on CDNet 2014 dataset.Table 6 shows the average results of BSUV-Net, BSUV-Net 2.0, Fast BSUV-Net 2.0, and SD-BMC.The bold numbers represent the best results.Table 7 compares the results of BSUV-Net, BSUV-Net 2.0, Fast BSUV-Net 2.0, and SD-BMC on the PTZ category of CDNet 2014. Figure 5 shows some visual results of BSUV-Net 2.0 and SD-BMC.7 clearly indicate that SD-BMC outperforms the other three models in all evaluation metrics on PTZ videos.When a single numeric result, F-measure, is chosen for ranking, SD-BMC outperforms all other methods by more than 11%.
Table 8.Average evaluation metrics of BSUV-Net, BSUV-Net 2.0, PAWCS, SuBSENSE, ViBe, and SD-BMC on customized dataset.The videos in the customized dataset are more challenging.We classify the videos, in according to their contents, into 3 groups: animals, people, and things.SD-BMC achieves the best average Recall, FNR, PWC, and F-Measure.We select a single numeric result, F-measure, for assessing the performance of all methods on individual videos.As shown in Table 9, SD-BMC achieves the best F-measure in many videos.The average F-measure in "animals" and "things" groups are higher than all other methods, while in "people" group the average F-measure is slightly lower than BSUV-Net 2.0.As shown in Figure 6, SD-BMC can detect saliency very close to the ground truth.The second best method, BSUV-Net 2.0, produces more FP and FN errors.Overall, SD-BMC outperforms BSUV-Net 2.0 by more than 3%.

Comparative analysis
According to the results on the CDNet 2014 dataset, we found that SD-BMC outperforms other methods in the PTZ category.For other video categories, SD-BMC achieves comparable results with other methods.The reason is that our video completion-based background modeler, together with the feedback scheme, can generate clear and updated background images.FGVC is a nonscene-specific method, which could generalize to unseen videos.As FGVC can capture temporal and spatial information, this modeler can generate much better background images.On contrary, the empty background images used in BSUV-Net 2.0 are very blur, which could significantly affect the final saliency detection result.PAWCS and SuBSENSE, which are designed for fixed camera, produce even worse background frame.Figure 7 shows the comparison of the background images.

Conclusion
We propose a new framework, SD-BMC, for the detection of salient regions in each video frames.The framework contains two major modules: video completion-based background modeler and the deep learning-based foreground segmenter network.In order to enable our framework for longterm saliency detection, the background modeler can adjust the background image dynamically via a feedback mechanism.SD-BMC can best segment foreground in videos captured by moving camera.To demonstrate this capability, we create our customized dataset with challenging videos captured by PTZ camera and handheld camera.The results, obtained from the PTZ videos, show that our proposed framework outperforms some deep learning-based background subtraction models by 11% or more.With more challenging videos, our framework also outperforms many high ranking background subtraction methods by more than 3%.
Although the results show that SD-BMC is superior to other deterministic as well as deep learningbased background subtraction methods, there are still ways to further improve it.In this work, we focus on designing the saliency detection framework for moving camera videos.In the future, we will work on new models for other challenging scenarios.The background modeling process can be made faster in order to tackle abrupt changes.Also, the foreground segmenter can adopt the teacher-student structure.While the complex teacher model is used in the training process, the testing process will be performed by a simpler student model.The lite model can be used for realtime saliency detection.