Spatio-Temporal Attention Model for Foreground Detection in Cross-Scene Surveillance Videos

Foreground detection is an important theme in video surveillance. Conventional background modeling approaches build sophisticated temporal statistical model to detect foreground based on low-level features, while modern semantic/instance segmentation approaches generate high-level foreground annotation, but ignore the temporal relevance among consecutive frames. In this paper, we propose a Spatio-Temporal Attention Model (STAM) for cross-scene foreground detection. To fill the semantic gap between low and high level features, appearance and optical flow features are synthesized by attention modules via the feature learning procedure. Experimental results on CDnet 2014 benchmarks validate it and outperformed many state-of-the-art methods in seven evaluation metrics. With the attention modules and optical flow, its F-measure increased 9% and 6% respectively. The model without any tuning showed its cross-scene generalization on Wallflower and PETS datasets. The processing speed was 10.8 fps with the frame size 256 by 256.


Introduction
Detecting foreground plays an important role in an intelligent surveillance system. It is often integrated with various tasks, such as tracking objects, recognizing their behaviors, and alerting when abnormal events occur. However, object detection suffers from non-stationary scenes in surveillance videos, especially in two potentially serious cases: Illumination variation, such as outdoor sunlight changes and indoor lights turning on/off, physical motion, such as ripples on the water surface, atmospheric disturbance, trees swaying and the motion of indoor artificial objects, which include fans, escalators and auto-doors. If the actual background contains a combination of the factors mentioned above, it becomes even more difficult to perform foreground detection.
In order to eliminate illumination changes and dynamic backgrounds, early studies focus on statistical distributions to build the background model [1][2][3][4]. To cover the variation of illumination change, the background model occupies a large range of intensity, so that the detection would be insensitive. Local features can represent the spatial characters [5][6][7][8][9][10] but cannot adapt to many non-ideal cases, such as texture-less background. In addition, conventional algorithms handle gradual illumination changes by updating the statistical background models progressively as time goes by. In practice, this kind of model update is usually relatively slow to avoid mistakenly integrating foreground elements into the background model, making it difficult to adapt to sudden illumination changes and burst motion. Modern deep learning based semantic or instance segmentation approaches could provide high-level semantic annotation for each frame, but ignore the temporal relevance. On the other hand, the obstacle in introducing a more sophisticated learning technique is that foreground detection is a scene-dependent and pixel-wise processing procedure [11] which requires a relatively representation of image pixels in a local spatial distribution (proximal pixels) and color information to build both background and foreground KDE models competitively in a decision framework. Heikkilä and Pietikäinen [5] used a local binary pattern (LBP) to subtract the background and detect moving objects in real time. [21] modeled appearance changes by incrementally learning a tensor subspace representation by adaptively updating the sample mean and an eigenbasis for each unfolding matrix. In our previous research, we pay attention on co-occurrence pixel-pair background models [22][23][24][25]. The models employed an alignment of supporting pixels for the target pixel which held a stable intensity subtraction in training frames without any restriction of locations. The intensity subtraction of the pixel pairs allowed the background model to tolerate noise and be illumination-invariant.

CNN Based on Foreground Detection
A surveillance video can be split into frames and then segmented as foreground and background frame by frame. Instance segmentation approaches based on deep convolutional networks have great potential in this task. The approaches could be roughly divided into two families. One relies on the R-CNN proposals, which is a bottom-up pipeline that the segmentation results are based on the proposals and then labeled by a classifier [26,27]. The other family relies on semantic segmentation results [28,29] where instance segmentation following semantic segmentation by classifying pixels into different instances. A state-of-the-art method Mask-RCNN [30], built upon object detectors [31], also depends on the proposals but features are shared by classes, box predictors, and mask generators, then all results are collected in parallel.
The first approach for background subtraction using CNN was proposed by Brahamand Droogenbroeck [32]. It was generated from a temporal median operation over N video frames. Afterwards, a scene-specific CNN was trained with corresponding image patches from the background image, video frames, and ground truth pixels. After extracting a patch around a pixel by feeding the patch through the network and comparing it with the score threshold, the pixel is assigned with either a background or a foreground label. However, the network is scene-specific, i.e., can only process a certain scenery and needs to be retrained for other video scenes. Another approach is DeepBS [33], which utilizes a trained CNN and a spatial-median filter to realize foreground detection across video scenes. This approach is fast running, but as the foreground is detected based on independent frame and the temporal relevance of the neighboring frames is ignored. In Cascade CNN [34], CNN branches processing images in different sizes are cascaded together that helps the cascade CNN to detect foreground objects in multi-scale. Temporal information has not been taken into consideration in this model. A recent study [35] proposed a probabilistic model of the features discovered by stacked denoising autoencoders. The model divides each video frame in patches that are fed to a stacked denoising autoencoder, which is responsible for the extraction of significant features from each image patch. Then, a probabilistic model decides whether the given feature vector describes a patch belonging to the background or the foreground.

Attention Model
Evidence from human perception process [36] illustrates the importance of attention mechanism, which uses top information to guide bottom-up feed-forward process. The attention mechanism of the human brain is, at a particular moment, always focused on a part of the scene, while ignoring the other parts. The attention mechanism of human brain could be equivalent to a resource allocation model. Tentative efforts have been made towards applying attention into a deep neural network. Deep Boltzmann Machine (DBM) [12] contains top-down attention by its reconstruction process in the training stage. The attention mechanism has also been widely applied to recurrent neural networks (RNN) and long short term memory (LSTM) to tackle sequential decision tasks [12,13]. Top information is gathered sequentially and decides where to attend for the next feature learning steps. In image classification, top-down attention mechanism has been applied using different methods: Sequential process, region proposal, and control gates. Sequential process [36,37] models image classification as a sequential decision. This formulation allows end-to-end optimization using RNN and LSTM and can capture different kinds of attention in a goal-driven way. Li [38] proposed a pyramid attention model for semantic segmentation that contains a feature pyramid and global attention. The former part merges features in at various scales while the later guides the low-level features making fusion with high-level ones.

Attention-Guided Weight-Able Connection Encoder-Decoder
High-level features have a larger reception field, contain global context, and are good at scene classification but weak in predicting labels for every pixel in input resolution [38]. While low-level features carry much fine grained information which can help high-level features to reconstruct objects' details during up-sampling process. U-net is an efficient structure to combine these features [39,40], it propagates information from the down-sampling layers to all corresponding symmetrical up-sampling layers. However, U-net concatenates the encoder and decoder features without any selection, so it cannot determinate whether the features chosen are necessary for foreground segmenting or not. The design of the proposed attention structure is inspired by the recent development of a semantic segmentation model [38], which employs high-level features to re-weight the fine-grained features in channel-wise. The proposed model merges the decoder and encoder features through serious attention processes during the decoder phase. In detail, high-level features provide global information to guide attention modules to select (weight) proper low-level features who make a contribution to binary prediction in an input image in which the encoder features are re-weighted by the decoder layers at a pixel-level and concatenated with the later.

Model Structure
As illustrated in Figure 1, the model combines spatial and temporal information, and the attention module is employed to mix encoder features together with decoder ones. The blocks in green represent the encoder layers and "IConv" and "OConv" are two encoders fed with static image and optical flow, respectively. The blocks in pink and orange represent the decoder layers and attention modules. The plus sign in green means the addition in pixel-level while the plus sign in red represents the concatenate operation. For example, there are two feature maps with dimension m × m × n, and there is a m × m × n tensor that goes through addition and a m × m × (2 × n) tensor outputted by the later operation.  Table 1 shows details of each layer in STAM. It is fed with a 256 × 256 × 3 static image and a 256 × 256 × 1 optical flow then outputs a 256 × 256 × 1 foreground mask. "IConv" and "OConv" are two encoders with the same structure and eight convolution layers. Additionally, the decoder has eight layers and up-sampling processed in each layer and seven attention modules are applied to make features mixtures. The stride for every convolution is two in both encoder and decoder but one in the attention module. Dropout is utilized to avoid over-fitting in the first three layers of decoder and nodes in these layers with a 50% probability to be dropped in the training phase. Table 1. Filter size and output size of each layer in encoders, decoders, and attention modules.

The Proposed Attention Module
The design of the proposed attention structure is inspired by a semantic segmentation model [38], employing high-level features to re-weight the fine-grained features in channel-wise. Different from [38], the proposed model merges the decoder and encoder features through a serious attention processes during the decoder phase. In detail, high-level features provide global information to guide attention modules to weight proper low-level features contribute to binary prediction in the inputting image that encoder features are re-weighted by the decoder layers in pixel-level and concatenated with the latter. As shown in Figure 2, the proposed attention modules merge the high-level and low-level features guided by the former ones. Y1 and Y2 are features from image encoder and optical flow encoder, and X is the decoder feature respectively. H, W, and C are the height, width, and channel numbers of a feature map. It applies a single convolution operation conv() onto X followed by a sigmoid activation function σ that makes the weights belong to 0 to 1. Where b is the bias value of a convolution operator. Then it uses those weights f weights to re-weight the sum of the encoder features. Finally, the decoder feature X and the re-weighted features are concatenated f output as the input of next convolutional layer. (1) where ⊗ and ⊕ denote the pixel-wise multiplication and sum operation, and concat(, ) is a concatenate process on two features.

Loss Function
STAM is fed with a static image x img and its optical flow image x o f , and then a foreground mask G(x img , x o f ) is generated. Manhattan distance is measured between the generated mask and ground truth one y. So the loss function of STAM is, STAM is trained by minimizing L STAM . It detects foreground in each video frame by feeding spatio-temporal information without any post-processing like median-filtering.

Motion Cue
Sequences of ordered frames allow the estimation of motion as either instantaneous image velocities or discrete image displacements. Most of the pixel-level motion estimation method is based on optical flow. Optical flow field actually represents the motion vector of each pixel in the image, which are taken at times t and t + ∆t at every pixel position, which can also be understood as the projection in a two-dimensional imaging plane of the movement field. There are a large number of an effective optical flow algorithm widely used in motion estimation tasks [41,42]. These methods are called differential since they are based on local Taylor series approximations of the image signal. They use partial derivatives with respect to the spatial and temporal coordinates. Optical flow indicates the global movement of the scene and local movement of objects where the moving areas have a high probability to carry foreground objects in real-world. So it provides prior knowledge to guide where it should be focused on in a scene. In this work, we employed Lucas and Kanade's optical flow method [43] which makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. This technique is fast because it examines few potential matches between the images.

Model Training
For all the scene-specific models, even each training set are with enough number of samples and achieve a high F-measure, this model could an over-fit specific scene and their generalization capabilities are limited. So we avoid training scene-specific models for every scene but use all of the scenes in CDnet 2014 to train a single model. Following the training setting in DeepBS [33], for the training data, we randomly select 5% samples with their ground truths of each subset from CDnet 2014. The left 95% samples are used to test the model. The optical flow of every video is extracted at a 50% down-sampling ratio in advance. All the frames, ground truths, and optical flow images are resized to 256 × 256. Since the optical flow of an image has only one channel, we extend the number of channels to 3 to match its original frame. The proposed model is trained for about 4.5 h in 100 epochs with 28 samples as a mini-batch. As for the optimizer, we use the Adam optimizer with β 1 = 0.95, β 2 = 0.90 and a small learning rate 3 × 10 −5 . The parameters in the proposed model are initialized randomly without any pre-trained model. The model is trained on two RTX2080TI GPU with Ubuntu 16.04 LTS OS and Tensorflow.

Data Preparing and Experiment Setting
All the testing results on different scenes are given by the single STAM model. Segmented foreground was obtained without any post-processing.
In order to test the foreground detecting in cross-scenes, Wallflower [44] and PETS [45] datasets were introduced. We applied STAM trained on CDnet 2014 to test these two datasets without any additional training phase.
In order to do the ablation experiments, we removed the attention module from STAM and concatenate the encoder feature and decoder feature straightly, called STAM NoAtt . We removed the encoder layers associated with optical flow from STAM, which output the foreground mask and thus relied only on static image, called STAM NoOF .  We computed seven different evaluation metrics for each algorithm compared in CDnet 2014, shown in Table 2. The STAM based method surpassed the state-of-the-art algorithms in most of the metrics. The Precision of STAM was 0.9851, while the Precision of Cascade CNN ranked second with 0.8997, and DeepBS ranked third with 0.8332. STAM improved the precision 9-15%. For Recall and FNR, CascadeCNN surpassed STAM, but by less than 1%. For F-measure, STAM outperformed by 4% than the rank second Cascade CNN. Meanwhile, the STAM containing the attention mechanism and temporal information exceeded models that excluded these parts, with F-measure increasing respectively by 9% and 6%.  Table 3 represents the F-measures computed through STAM and the state-of-the-art approaches in different sub-sets. STAM gained the highest F-measure scores among the other algorithms in six out of 11 categories, under the cases of bad weather, intermittent object motion, shadow, thermal, turbulence, and light switch. The visualized results are provided in Figure 5 and the average F-measures of all the methods are illustrated in Figure 6. Note that, STAM gave all the testing results on different scenes are given by this single model. While another model CascadeCNN was trained with a scene-specific style following its original experiment setting. For example, training a model on subset PTZ in CDnet2014 and tested on PTZ, while for another subset, Baseline, retrain the models and tested on Baseline. So their models could over-fit a specific scene. However STAM solved all the sub-scenes in CDnet2014 without retraining. However, the proposed model still brought improvement in scenes like bad weather, shadow, thermal, and overall performance. More importantly, compared to the state-of-the-art cross-scene single model DeepBS [33], the proposed model achieved significant improvements in all the seven metrics.

Cross-Scene Segmentation Results on Wallflower and PETS
We directly applied the STAM trained on CDnet 2014 to Wallflower and PETS without any tuning to test its capability to copy with cross scenes segmentation. There were seven different scenes in Wallflower, and only one hand-segmented ground truth was provided for each scene. Since there was no foreground illustrated in the ground truth of "Moved Object", we excluded this scene in the experiments. Table 4 illustrates the quantitative results on Wallflower, showing that STAM presented a better performance on two subsets than DeepBS, and gained the best performance in overall F-measure. Quantitative comparisons on another dataset PETS are exhibited in Table 5.  On PETS, we compared STAM with some background modeling approaches, including the newly proposed CPB+HoD approach [25]. The F-measure of STAM was comparable with the standard background modeling approach GMM without any training on PETS, but failed to outperform the CPB+HoD and Vibe approaches. The reason was that the proposed model emphasized a generalization performance via the process of big dataset training, but at the same time, it may have failed to preserve small details in a specific scene. In PETS dataset most of the foreground are quite small compared with the foreground in Wallflower dataset, so that its performance was not as good as in Wallflower dataset. Figure 7 illustrates some samples segmented by the proposed model on Wallflower and PETS datasets, which also indicates the weakness of STAM in preserving the detail of very small foreground.
The test speed of STAM is 10.8 fps for the frame size 256 by 256 on a single GTX1080TI with a 32GB RAM and Ubuntu 16.04 LTS operating system.

Input Image
Ground Truth STAM Input Image Ground Truth STAM Figure 7. Foreground detection in Wallflower and PETS. The first three rows are from Wallflower while the other three rows are from PETS.

Conclusions
We proposed a Spatio-Temporal Attention Model for cross-scene foreground detection. The benefit of the proposed model was that appearance and motion features, low-level, and high-level features are synthesized by attention modules via feature learning. The ablation experiments validated the model with an optical flow that had a 6% better F-measure than without it, Additionally, the model with attention had a 9% better F-measure than without it. The proposed model surpassed state-of-the-art methods under the cases of bad weather, intermittent object motion, shadow, thermal, turbulence, and light switch. It improve the overall precision by 9% and F-measure by 4% over a scene-specific model Cascade CNN. Quantitative and visualized performance on Wallflower and PETS benchmarks show its promising generalization ability of the scene without any additional training. Furthermore, it shows promise in processing surveillance videos in real time.