Unsupervised Multi-Object Detection for Video Surveillance Using Memory-Based Recurrent Attention Networks

: Nowadays, video surveillance has become ubiquitous with the quick development of artiﬁcial intelligence. Multi-object detection (MOD) is a key step in video surveillance and has been widely studied for a long time. The majority of existing MOD algorithms follow the “divide and conquer” pipeline and utilize popular machine learning techniques to optimize algorithm parameters. However, this pipeline is usually suboptimal since it decomposes the MOD task into several sub-tasks and does not optimize them jointly. In addition, the frequently used supervised learning methods rely on the labeled data which are scarce and expensive to obtain. Thus, we propose an end-to-end Unsupervised Multi-Object Detection framework for video surveillance, where a neural model learns to detect objects from each video frame by minimizing the image reconstruction error. Moreover, we propose a Memory-Based Recurrent Attention Network to ease detection and training. The proposed model was evaluated on both synthetic and real datasets, exhibiting its potential.


Introduction
Video surveillance aims to analyze video data recorded by cameras.It has been widely used in crime prevention, industrial processes, traffic monitoring, sporting events, etc.A key step in video surveillance is object detection, i.e., locating multiple objects with bounding boxes in each video frame.This is crucial for downstream tasks such as recognition, tracking, behavior analysis, and event parsing.
Multi-object detection (MOD) from visual data has been extensively studied for many years by computer vision communities.Classical methods such as Deformable Part Models (DPMs) [1] follow the "divide and conquer" pipeline that a sliding window approach is first used to generate image regions, then a classifier (e.g., a Support Vector Machine [2]) is employed to categorize each region into object/non-object, and finally post-processing is applied to refine the bounding boxes of object regions (e.g., removing outliers, merging duplicates, and rectifying boundaries).To improve both the efficiency and performance of MOD, methods based on Region-based Convolutional Neural Networks (R-CNNs) [3][4][5][6] are proposed and perform well on various popular object detection datasets [7][8][9][10][11].In contrast to previous methods, they selectively generate only a small amount of image region proposals and use Convolutional Neural Networks (CNNs) [12,13] as more expressive classifiers.However, as this "divide and conquer" pipeline breaks down the MOD problem as several sub-problems and optimizes them separately, the resulting solutions are usually sub-optimal.To jointly optimize the MOD problem, Huang et al. [14], Redmon et al. [15] and Liu et al. [16] formulated object detection as a single regression problem that directly maps the image to object bounding boxes, achieving end-to-end learning which greatly simplifies the MOD process.
Nevertheless, all above methods rely on supervised learning that requires labeled data, while manually labeling the object bounding boxes is very expensive.Moreover, unlike general MOD tasks, for video surveillance, we are more interested in a specific class of objects, and the backgrounds usually does not change over time and thus can be easily extracted to ease detection.To this end, we propose a novel framework to achieve unsupervised end-to-end learning of MOD for video surveillance.We summarize our contribution as follows:

•
We propose an Unsupervised Multi-Object Detection (UMOD) framework, where a neural model learns to detect objects from each video frame by minimizing the image reconstruction error.

•
We propose aMemory-Based Recurrent Attention Network to improve detection efficiency and ease model training.

•
We assess the proposed model on both the synthetic dataset (Sprites) and the real dataset (DukeMTMC [17]), exhibiting its advantages and practicality.

Unsupervised Multi-Object Detection
The UMOD framework is composed of four modules, including: (i) an image encoder extracting input features from the input image; (ii) a recurrent object detector recursively detecting objects using the input features; (iii) a parameter-free renderer reconstructing the input image using the detector outputs; and (iv) a reconstruction loss driving the learning of Modules (i) and (ii), in an unsupervised and end-to-end fashion.

Image Encoder
Firstly, we use a neural image encoder NN enc to compress the input image X into input feature C: where X ∈ [0, 1] H×W×D has a height H, width W, and channel number D); C ∈ R M×N×S has a height M, width N, and channel number S; and θ enc is the network parameters to be learned.By making C contain significantly fewer elements than X and taking it as the input for succeeding modules, we can largely reduce the computation complexity for object detection.

Recurrent Object Detector
Based on the observation that different objects usually have common patterns in video surveillance MOD tasks, we iteratively apply a same neural model, namely the recurrent object detector, to extract objects from the input feature C.This can not only regularize the model, but also reduce the number of parameters, thereby maintaining learning efficiency when object number increases.
The Recurrent Object Detector consists of a recurrent module NN rec and a neural decoder NN dec .In the t-th iteration (t ∈ {1, 2, . . ., T} where T is the maximum detection step), the detector first updates its state vector (Throughout this paper, we assume the vectors are in row form) h t ∈ R R via NN rec (parameterized by θ rec ): Although NN rec could be naturally represented as a Recurrent Neural Network (RNN) [18][19][20] (C needs to be vectorized), we model NN rec using a novel architecture to improve the network efficiency, which is discussed in Section 3.
Given h t , the detector output Y t can be then generated through a neural decoder NN dec (parameterized by θ dec ): where As the sampling process is not differentiable, a Straight-Through Gumbel-Softmax estimator [21] is employed to reparameterize both distributions so that back propagation can still be applied.We have found by experiments that discretizing y l t and Y s t is crucial to obtain interpretable output variables.The mid-level representation defined above is both flexible and interpretable.As would be shown below, the output variables can be directly used to reconstruct the input image, through which their interpretability is enforced.

Renderer
Given the detector outputs {Y t | t = 1, 2, . . ., T} without training labels, how can we define a training objective?Our solution is to first convert all these outputs into a reconstructed image using a renderer which is differentiable, and then use back propagation to minimize the reconstruction error.To enforce the model to learn to produce desired detector outputs, we make the renderer deterministic and contain no parameters.In this case, correct detector outputs correspond to a correct reconstruction.
Firstly, we use the object pose y p t to scale and shift its shape Y s t and appearance Y a t by using a Spatial Transformer Network (STN) [22]: where T s t ∈ {0, 1} H×W×1 is the transformed shape and T a t ∈ [0, 1] H×W×D is the transformed appearance.Then, by using the object confidence y c t and layer y l t , we compose I image layers (I ≤ T), where the i-th layer can be possessed by several objects and is obtain by: where H×W×D are the layer's foreground mask and foreground, respectively, and is the element-wise multiplication (If the operands have different sizes, we simply broadcast them).
Finally, by using these layers, the input image could be reconstructed in an iterative way, i.e., for i = 1, 2, . . ., I: where we initialize X(0) as the background (Note that the background is assumed easy to extract or known in advance), and take X(I) as the final reconstructed image.Illustrations of the UMOD framework and the renderer are shown in Figures 1 and 2, respectively.Note that our rendering process can be accelerated since matrix operations can be used to parallelize the composition of layers (defined in Equations ( 6) and ( 7)).Although handling occlusion requires iterations (defined in Equation ( 8)) that still cannot be parallelized, we can use fewer layers by setting a smaller I.This is reasonable since occlusion usually happens among a few objects, and it is unnecessary to allocate a layer for each object (non-occluded objects can share a same layer).

Loss
With the reconstructed image X, we can then define the loss l for each sample to drive the learning of the image encoder and recurrent object detector: where MSE (•, •) is the Mean Squared Error for reconstruction, and λ > 0 is the coefficient of the tightness constraint 1 T ∑ t sx t sy t used to penalize object scales in order to avoid loose bounding boxes.

Memory-Based Recurrent Attention Networks
When using RNN to model the recurrent module NN rec defined in Equation ( 2), it can suffer two issues: (i) to avoid repeated detection, the detector state h t must carry information from the previous detections (for t < t), which couples memory and computation, thereby making the detection of the current object less effective; and (ii) to extract features for a specific object, the detector must learn to focus on a local area on the input feature C, making training more difficult.To this end, we propose a Memory-Based Recurrent Attention Network (MRAN), which overcomes Issue (i) by directly taking the input feature as an external memory, and overcomes Issue (ii) by explicitly employing the attention mechanism.
Concretely, we initialize the memory as C 0 = C, which is then sequentially read and written by the recurrent object detector so that all messages from the past t detections are recorded by C t instead of h t .In iteration t, the detector first reads from the previous memory C t−1 , then updates its state h t , and finally write new contents into the current memory C t .Thus, in contrast to Equation (2), the recurrent module NN rec has the form: We set NN rec defined in Equation ( 10) as a MRAN, where a location-based addressing is first adopted to explicitly impose attention on the input feature.The attention weight W t is generated by an attention network NN att : where W t ∈ [0, 1] M×N satisfies ∑ m,n W t,m,n = 1 (by using a softmax output layer).Then, let c t−1,m,n ∈ R S be a feature vector of C t−1 , we define the read operation as: where the read vector r t ∈ R S represents the attended input features relevant to the current detection.
Next, the detector state is updated through a linear transformation followed by a tanh function, where r t is taken as the input feature (instead of C t−1 ): ĥt = Linear r t ; θ upd (13) Finally, we use h t to generate an erase vector e t,i ∈ [0, 1] S and a write vector v t,i ∈ R S : e t = sigmoid ( êt ) (16) and define the write operation as: An illustrations of the MRAN is shown in Figure 3.Although the attention is now imposed on the input feature C t , we would like to further impose it on the input image X so that the detector is only related to a local image region rather than the whole image.Therefore, we use a Fully Convolutional Network (FCN) [23] (only contains convolution layers) as the image encoder NN enc .By controlling the receptive field of c t,m,n , X can also be attentively accessed by the detector.Another advantage of using FCN is that, through parameter sharing, it can well-capture the regularity among different objects (they usually have similar patterns).
Our MRAN (Equations ( 11))-(( 17)) is similar to the Neural Turing Machine [24,25].As the detector uses interface variables to interact with the external memory, messages from the previous detections do not need to be encoded into its working memory h t , thereby improving the detection efficiency.

Experiments
The goals of our experiments were: (i) to investigate the importance of the layered representation and MRAN in our model (Note that setting a supervised counterpart for our model could be difficult as computing the supervised loss requires finding the best matching between the detector outputs and the ground truth data, which is also an optimization problem); and (ii) to test whether our model is well-suited for video surveillance data taken from cameras.For Goal (i), we created a synthetic dataset, namely Sprites, and were interested in the configurations below: • UMOD-MRAN.UMOD with MRAN, which is our standard model as described in Sections 2 and 3. • UMOD-MRAN-noOcc.UMOD-MRAN without occlusion reasoning, which is achieved by fixing the layer number I to 1. • UMOD-RNN.UMOD with RNN, which is achieved by setting the recurrent module NN rec as a Gated Recurrent Unit [20] as described in Section 2.2, thereby disabling the external memory and attention.• AIR.Our implementation of the generative model proposed in [26] that could be used for MOD through inference.
For Goal (ii), we evaluated UMOD-MRAN on the challenging DukeMTMC dataset [17], and compared results with those of the state-of-the-art.
There are some common settings for the implementation of the above configurations.For the image encoder NN rec defined in Equation ( 1), we set it as a FCN, where each convolution layer was composed via Convolution→Pooling→ReLU and the convolution stride was set to 1 for all layers.For the decoder NN dec defined in Equation ( 3), we set it as a Fully-Connected network (FC), where the ReLU was chosen as its activation function for each hidden layer.We also set the object scale coefficients η x = η y = 0.4.For the renderer, we set the image layer I = 3 (except for UMOD-MRAN-noOcc and AIR where I = 1).For the loss defined in Equation ( 9), we set λ = 1.To train the model, we minimized the averaged loss on the training set with respect to all network parameters Θ = {θ enc , θ rec , θ dec } using Adam [27] with a learning rate of 5 × 10 −4 .Early stopping was used to terminate training.

Sprites
As a toy example, we wanted to see whether the model could robustly handle occlusion and infer the object existence, position, scale, shape, and appearance, thereby generating accurate object bounding boxes.Therefore, we created a new Sprites dataset composed of 1 M color images, each of which is of size 128 × 128 × 3, comprising a black background and 0-3 sprites which could occlude each other.Each sprite is a 21 × 21 × three-color patch with a random scale, position, shape (diamond/rectangle/triangle/circle), and color (cyan/magenta/yellow/blue/green/red).
To deal with the task, for the UMOD configuration's we set the detection step T = 4 and the background X(0) t = 0. Please refer to Table 1 for other configurations.The mini-batch size was set to 128 for training.

Hyper-parameter Sprites DukeMTMC
Figure 4 shows the training curves.UMOD-MRAN and UMOD-MRAN-noOcc converge significantly faster than UMOD-RNN, indicating that, with MRAN, UMOD can be trained more easily.However, for the final validation losses, UMOD-MRAN and UMOD-RNN are slightly better than UMOD-MRAN-noOcc, meaning that, without layered representation which can model occlusion, the input images could not be well-constructed.To visualize the detection performance, UMOD-MRAN was compared against other configurations on sampled images.Qualitative results are shown in Figure 5.We can see that UMOD-MRAN performs well and can robustly infer the existence, layer, position, scale, shape, and appearance of the object.UMOD-RNN performs slightly worse than UMOD-MRAN since it sometimes fails to recover the occlusion order (Columns 2, 5, and 7).However, with a layer number I = 1, UMOD-MRAN-noOcc and AIR perform even worse since they could handle occlusion (Columns 3, 5, 6, and 9), sometimes losing detection (Columns 1, 2, 7, and 8)-we conjecture that the model has learned to suppress the occluded outputs, as adding their pixel values to a single layer probably causes a high reconstruction error.To quantitatively assess the model, we also evaluated different configurations with the commonly used MOD metrics, including the Average Precision (AP) [9], Multi-Object Detection Accuracy (MODA), Multi-Object Detection Precision (MODP) [28], average False Alarm number per Frame (FAF), total True Positive number (TF), total False Positive number (FP), total False Negative number (FN), Precision (TP/(TP + FP)), and Recall (TP/(TP + FN)).Results are presented in Table 2. UMOD-MRAN outperforms all other configurations with respect to all metrics.Without layered representation, the performances of UMOD-MRAN-noOcc and AIR are largely affected by higher FNs (1702 and 1964, respectively), which again suggests that the detector outputs are suppressed when only a single image layer is used.Moreover, when the attention and external memory are not explicitly modeled, UMOD-RNN and AIR perform slightly worse than UMOD-MRAN and UMOD-MRAN-noOcc (in all metrics), respectively, which means incorporating these two prior knowledge can well regularize the model so that it can learn to extract more desired outputs.

DukeMTMC
To examine the performance of our model when applied to real-world data that are highly flexible and complex, we assessed the UMOD-MRAN on the challenging DukeMTMC dataset [17].It is a video surveillance dataset comprising eight videos with 60 fps and a resolution of 1080 × 1920, which were collected from eight fixed cameras that record people's movements at different places in Duke university.Each video is divided into a training set (50 min), a hard test set (10 min), and an easy test (25 min).
For UMOD-MRAN, the detection step T was set to 10 and the IMBS algorithm [29] was used for extracting the background X(0) t .Please see Table 1 for other configurations.For training, we set the mini-batch size to 32.To ease processing, we resized the input images to 108 × 192.We trained a single model and evaluated it on the easy test sets of all scenarios.Note that we did not evaluate our model on the hard test sets as they contain very different data statistics from the training sets.
Qualitative results are shown in Figure 6.We can see that UMOD-MRAN performs well under various scenarios:  Table 3 reports the quantitative results.UMOD-MRAN outperforms the DPM [1] with respect to all metrics.It reaches an AP of 87.2%, which is significantly higher than that of the DPM (79.3% AP), and is also very competitive to the recently proposed Faster R-CNN [5] (89.7% AP).Although the CRAFT [30,31] perform the best (with 91.1% and 92.0%APs, respectively), our model is the first one free of any training labels or extracted features.

Visualizing the UMOD-MRAN
To further understand the model, we visualize the inner working of the UMOD-MRAN on Sprites, as shown in Figure 7.Both the memory C t and the attention weight W t are visualized as M× N (8 × 8) matrices (brighter pixels indicate higher values), where for C t the matrix consists of its mean values along the last dimension.The detector output Y t is visualized as (y c t Y s t Y a t ) ∈ [0, 1] U×V×D .At detection step t, the memory C t−1 produces an attention weight W t , through which the detector first reads from C t−1 and then writes to C t .We can find that at each detection step, the memory content (bright region on C t−1 ) related to the associated object (Y t ) is erased (becomes dark) by the write operation, thereby preventing the detector from reading it again in the next detection step.

Related Work
Recently, unsupervised learning has been used in some works to extract desired patterns from images.For example, Kulkarni et al. [32], Chen et al. [33] and Rolfe [34] focused on finding lower-level disentangled factors; Le Roux et al. [35], Moreno et al. [36] and Huang and Murphy [37] focused on extracting mid-level semantics; and Eslami et al. [26], Yan et al. [38], Rezende et al. [39], Stewart and Ermon [40] and Wu et al. [41] focused on discovering higher-level semantics.However, unlike these methods, the proposed UMOD-MRAN focuses on MOD tasks.It uses a novel rendering scheme to handle occlusion and integrates memory and attention to improve the efficiency, being well-suited for real applications such as video surveillance.

Conclusions
In this paper, we propose a novel UMOD framework to tackle the MOD task for video surveillance.The main advantage of our model over other popular methods is that it is free of any training labels or extracted features.Another important advantage of our model is that the MRAN module can largely improve the detection efficiency and ease model training.The proposed model was evaluated on both synthetic and real datasets, exhibiting its superiority and practicality.
For future work, we would like to extend our model in two aspects.First, it is useful to incorporate the idea of "adaptive computation time" [42] into our framework so that the recurrent object detector can adaptively choose an appropriate detection step T for efficiency.Second, it is intriguing to model the object dynamics by employing temporal RNNs so that our model can directly deal with multi-object tracking problems for video surveillance.

Figure 2 .
Figure 2. Illustration of the renderer that converts the detector outputs to the reconstructed image, where the detection step T = 4 and the layer number I = 2.

Figure 3 .
Figure 3. Illustration of the Memory-Based Recurrent Attention Network (MRAN), where the detection step T = 3 and the green/blue bold lines denote the attentive read/write operations on the memory.

Figure 4 .
Figure 4. Training curves of different configurations on Sprites (Note that we do not compare the loss of AIR since it uses a different training objective).

Figure 5 .
Figure 5. Qualitative results of different configurations on Sprites.For each configuration, the reconstructed images are shown, with the detector outputs on the bottom (produced at detection Steps 1-4 from left to right).
(i) a few people (Column 4 in Rows 1-3); (ii) many people (Column 5 ins Row 1 and 2); (iii) occluded people (Column 2 in Row 2); (iv) people that are near to the camera (Column 4 in Row 1); (v) people that are far from the camera (Column 6 in Row 2); (vi) people with different shapes/appearances (Column 8 in Row 1); and (vii) people that are hard to be distinguished from the background (Column 6 in Row 1).

Figure 6 .
Figure 6.Qualitative results in different scenarios on DukeMTMC.
U×V×Dis the object appearance.To obtain output variables of desired range, in the final layer of NN dec , we use the sigmoid function to generate y c t and Y a t , use the tanh function to generate y

Table 1 .
Model hyper-parameters for Sprites and DukeMTMC.

Table 2 .
Detection performances of different configurations on Sprites.

Table 3 .
Detection performance of the UMOD-MRAN compared with those of the state-of-the-art methods on DukeMTMC.