An Efﬁcient Anomaly Detection System for Crowded Scenes Using Variational Autoencoders

: Anomaly detection in crowded scenes is an important and challenging part of the intelligent video surveillance system. As the deep neural networks make success in feature representation, the features extracted by a deep neural network represent the appearance and motion patterns in different scenes more speciﬁcally, comparing with the hand-crafted features typically used in the traditional anomaly detection approaches. In this paper, we propose a new baseline framework of anomaly detection for complex surveillance scenes based on a variational auto-encoder with convolution kernels to learn feature representations. Firstly, the raw frames series are provided as input to our variational auto-encoder without any preprocessing to learn the appearance and motion features of the receptive ﬁelds. Then, multiple Gaussian models are used to predict the anomaly scores of the corresponding receptive ﬁelds. Our proposed two-stage anomaly detection system is evaluated on the video surveillance dataset for a large scene, UCSD pedestrian datasets, and yields competitive performance compared with state-of-the-art methods.


Introduction
With the wide use of video surveillance systems, the conventional manual analysis for labelling abnormal events in the amount of video data captured from crowd surveillance and public place monitoring is time-consuming and inefficient. Therefore, an intelligent surveillance system that can recognize and detect anomalies is urgently needed and has been a hotspot of computer vision researches and applications [1][2][3][4].
However, anomaly detection and localization is still a challenging problem in intelligent video surveillance, though some great progress has been made in feature extraction, behavior modeling, and anomaly measuring. The most challenging issue is that the definition of the anomaly is indefinite in most of the real-world surveillance videos. In general, events that are significantly different from common events are defined as anomalies, which means anomalies are defined by normal events instead of classifications or details of themselves. An event anomalous in one scene (such as a person running) may not be anomalous in a second scene, since the normal events in the second scene may include running people whereas the first does not. Therefore, anomalies are of insufficient sizes and similarities to be effectively modeled. Anomaly detection for crowd scene is essentially a novelty detection, which is also known as a one-class, semi-supervised learning problem [5][6][7], since the training data of the existing datasets contains only normal events while the data to be verified contains both normal and abnormal events.
Traditional solutions in the literature concentrated on the analysis of local or individual spatiotemporal patterns in the scene [8][9][10]. Therefore, various feature descriptors were designed to extract low-level features from the appearance and motion cues. Some works [11,12] used the popular low-level features including the histogram of oriented gradients (HOG), the 3D spatiotemporal gradients, and the histogram of oriented flows (HOF) to describe the patterns of the minimal units. However, adopting the hand-crafted generic feature extractors rather than specific descriptors learned from the scene is a clear limitation. Some other systems [5,13] were based on the analysis of the motion information in the scene. In these works, the local trajectories or the optical flows of the pixels were computed and modeled in order to describe the motion patterns. However, these approaches lacked stability when dealing with complex scenes, as the accuracy of the pixels trajectories extraction significantly degraded when a dynamic occlusion occurred among multiple objects.
Recently, deep learning approaches have achieved remarkable success in various computer vision tasks, such as object classification and object detection [14,15], and these applications were based on supervised learning that required labels. In the meantime, unsupervised learning based approaches such as auto-encoder and variational auto-encoder (VAE) have been widely used for feature extraction [16][17][18]. These works have shown that comparing with traditional methods, rich and specific features can be learned. Therefore, in the anomaly detection tasks, the hand-crafted feature extractors are now being replaced by the auto-encoders. Based on these learned features, some works reconstructed or generated a whole new frame by using a fully convolutional network (FCN) [19] and the total deviation between the generation and the original frame was used to predict the anomaly [6,20]. However, these methods can hardly locate abnormal events in frames. Some other works used probability estimation models, such as one-class SVM models [21] and Gaussian models [22], to predict the anomaly scores of the learned features. In these works, all the features shared the same model, even though they were extraced from different regions.
In this paper, we propose a baseline framework of an anomaly detection system for complex surveillance scenes by using a VAE with convolution kernels, which is inspired by the convolutional auto-encoder and the FCN architecture. In the first stage, the still frame series are provided as input to our VAE, then the appearance and motion features of the receptive fields, which are densely distributed throughout the frames, are extracted by the encoder network. In the second stage, comparing to the solutions based on reconstruction or generation, our system locates the anomalies by using multiple multivariate Gaussian models to predict the anomaly scores of each receptive field. Besides that, each multivariate Gaussian model is fitted to the corresponding feature representations, which means that the receptive fields at different loactions have their own Gaussian models. Futhermore, according to the principle of VAE, our feature vectors are more independent and decoupled than those extracted by original auto-encoders and convolutional auto-encoders. Then, an averaging operation is used to handle the overlapping parts of the receptive fields. The proposed anomaly detection system is evaluated on the challenging large-scale surveillance scene datasets and compared with several methods. The experiments show that our method outperforms most previous methods and yields competitive performance comparing with two state-of-the-art methods.
Our main contributions are as follows: • We propose a new baseline of the anomaly detection and location framework using the variational auto-encoder to learn the discriminative feature representations of appearance and motion patterns.

•
We extract the feature representations of all receptive fields at one time and model a unique Gaussian for each feature.

•
The variational auto-encoder is used to decouple the components of the feature from each receptive field as much as possible so that it is easier and more accurate to model the Gaussian for the features.
The remainder of this paper is organized as follows. Section 2 reviews the related work of anomaly detection and localization. A detailed description of our proposed method is given in Section 3: first the overall framework, then the variational auto-encoder with convolution kernels for feature extraction and the anomaly estimation at last. Section 4 presents experimental results and comparisons. The conclusion is finally summarized in Section 5.

Hand-Crafted Features Based Method
Generally, three modules can be extracted from the hand-crafted features based anomaly detection method: (i) extracting features from the normal patterns; (ii) modeling to characterize the distribution of the extracted features; (iii) identifying the outliers as anomalies based on the model. For the feature extraction module, various feature descriptions are designed. In some works, low-level trajectory features from a sequence of images were utilized to describe normal motion patterns [23][24][25]. However, these methods focused on the anomaly caused by a crowd instead of a single object as a fundamental unit. These trajectory features were mainly based on crowd tracking so that these methods were unable to handle single object anomaly detection. In addition to these trajectory features, some other low-level spatiaotemporal features were widely used, such as the histogram of oriented flows (HOF) [26] and the histogram of oriented gradients (HOG) [27]. Kratz et al. [28] used the distribution of spatiotemporal gradients to represent the rich motion information in local spatiotemporal motion patterns. In the work of [29], a motion feature represented by the histogram of the optical flow was used as a low-level feature for the motion-pattern description. To model the extracted features, Adam et al. [30] utilized an exponential distribution to characterize the flow probability matrix. Kim and Grauman [31] applied the mixture of probabilistic principal component analyzers (MPPCA) algorithm to model the local activity patterns with the optical flow as a low-level measure. Mahadevan et al. [32] learned a model for normal crowd features based on mixtures of dynamic textures (MDT) and Li et al. The authors of [33] used a conditional random field (CRF) to integrate the outputs of the model on this basis. In order to model the appearance and motion features from principal component analyzers, Feng et al. [34] constructed a deep Gaussian mixture model (GMM). Besides the literatures above, some sparse coding or dictionary learning based methods were used to encode the normal patterns. In the work of [35], a normal dictionary was learned from an over-complete normal basis set, then the sparse reconstruction cost was used to measure the normalness of the testing sample. In order to accelerate both the training and testing process, Lu et al. [36] learned multiple dictionaries to encode normal size-invariant patches from multiscale frames. Yu et al. [37] captured the low-rank property of the bases in dictionary learning phase, then a weighted sparse reconstruction method was used to measure the abnormality of testing samples.

Deep Learning Based Method
In recent years, deep learning approaches have been successfully applied to many computer vision tasks [14,15], as well as in the field of anomaly detection [38,39].
In some works, convolutional auto-encoders or some fully convolutional networks were used to reconstruct or generate a new set of frames or feature maps [6,20,40]. For a sequence video frames without anomalies, Liu et al. [6] trained a fully convolutional network (FCN) model that resembled the U-Net to predict the next frame. Then the deviations between the predicted frame and its groundtruth frame were used to predict the anomalies in the detection phase. Instead of the latent codes from the middle of auto-encoders, Ribeiro et al. [20] used the output of a convolutional auto-encoder, which could be considered the reconstruction of the input frame sequences. As the auto-encoder was trained from the normal video sequences, the reconstruction error was applied as an anomaly score. However, due to the good capacity and generalization of the deep neural network, the assumption that abnormal events would trigger larger reconstruction errors or generation deviations does not necessarily hold. Therefore, reconstruction errors or generation deviations of normal and abnormal patterns will be unstable and have no fixed measurement range. Therefore, we focus on the approaches of extracting features by the auto-encoder and detecting anomalies by estimating probabilities of the features [21,41,42]. Sabokrou et al. [42] structured a deep convolutional neural network with the kernels trained by a sparse auto-encoder. Taking the cubic patches captured from the original images as inputs, the feature maps from three intermediate and the last layers were pushed into their corresponding Gaussian classifiers. In the work of [21], three stacked denoising auto-encoders were proposed to learn spatial features, temporal features and their fusion. Then three one-class SVM models were used to evaluate the learned features and predict the anomaly score of each patch. Apart from the preprocessing of cropping the input frames into patches, the main problem of these methods is that all the features share the same Gaussian classifiers or one-class SVM models, even though the features are extracted from different regions of the input.
Furthermore, in the work of [22], part of a pre-trained convolutional neural network (CNN) was intercepted as the feature extractor in the form of fully convolutional network (FCN), which can extract the features of each receptive field without cropping the input frames into patches. However, the pre-trained CNN was trained as a classification from other databases that consist of static natural images and the default number of the input channels was set as three according to RGB images, thus we are skeptical of the reasonableness of using the pre-training networks.

Overall Scheme
Anomaly detection is the identification of events with low probabilities, which represent the irregular shape or motion patterns in the video frames. Thus, identifying the irregular appearance and motion patterns is the essential issues in anomaly detection. Because of the insufficiency of the labels for the anomaly frames and pixels, the supervised learning based feature extraction methods barely work, in despite of the succes in other tasks with specific categories. Therefore a semi-supervised method is required to model the normal patterns including background of scene and the regular shape and motion. The work-flow of the proposed detection method is outlined in Figure 1. In our work, a series of frames instead of a single frame are used as the input and each training frame series consists of several normal frames. Then, a convolution based variational auto-encoder is constructed to learn appearance and motion representations of the normal frame series, as a method of semi-supervised learning. According to the convolutional neural network, the feature vector at each location of the feature map, which is the output of the encoder, is considered as the appearance and motion representations of its corresponding receptive field. Then, for the feature vectors of the receptive fields at the same location in all inputs, a multivariate Gaussian is modelled to fit them so that each receptive field at different locations has the own corresponding Gaussian model. Once the encoder network and the Gaussian models are trained, receptive fields of low probability under the corresponding Gaussian are considered abnormalities. Given a test frame series, the feature vector of each receptive field is extraced by the encoder network, and its negative log-likelihood under its own Gaussian is computed. In the following we describe the proposed system in detail.

Convolutional VAE Architecture and Feature Extraction
The network architecture of our convolution based variational auto-encoder (VAE) [18] is shown in Figure 2. In our work, a frame sequence is represented by a set of regional appearance and motion feature vectors, which are extracted densely from the corresponding receptive fields by the following convolution based VAE. Taking both appearance and motion patterns into account, a series of frames is used as the input. Specifically, suppose we have T frames in the training dataset and all the frames are free of abnormal events, the pixel-wise average of frame I t−1 and previous frame I t−2 denoted by I (−1) t−1 and the same as the frame I t+1 , with the next frame I t+2 denoted by I (+1) t+1 , are used, where I t is the tth frame in the video. Thus, the sequence I t = {I t+1 } is used as as a multichannel image to detect anomalies in frame I t and the sequence set is I = {I t |t = 3, 4, . . . , T − 2}.
In the feature description step, our convolution based VAE has a deep fully convolutional architecture that contains an encoder network and a decoder network. The encoder can be divided into three blocks according to the size of the convolution kernel. The first block consists of three convolutional layers with the same kernel size 7 × 7 and the same stride 1 × 1, and a max pooling layer with a 2 × 2 kernel and 2 × 2 stride. The second block has two convolutional layers with kernel size 3 × 3 and stride 1 × 1. The third block contains three convolutional layers with the same kernel size 1 × 1 and stride 1 × 1. Then according to the VAE, which assumes the posterior distribution of each latent variable takes on an approximate Gaussian form, the encoder outputs a mean map µ and a standard deviation map σ, which are the parameters for the posterior distributions of the latent variables. The reparameterization trick is used to generate samples from the posterior distributions of the latent variables, µ + σ ⊗ , where ∼ N (0, I) and ⊗ as an element-wise product. Next, the samples are pushed into the following decoder network. Compare with the encoder, the followed decoder has the same reversed structure with the deconvolutional layers instead of the convolutional layers.
Specifically, for the tth sequence I t with the resolution h 0 × w 0 , our encoder network outputs a mean tensor µ t and a standard deviation tensor σ t , where µ t , σ t ∈ R h l ×w l ×c l , h, w are the height and width of the tensor map, c is the number of channels, and l is the layer of the network.
Following the work in VAE, the loss function is mainly comprised of two parts. Here we give the equations used for the calculation as follows. The first part could be considered as the reconstruction loss to make the generated frame seriesÎ t close to the original series I t , thus the loss function can be defined as follows: In addition, as we assume the distribution of each feature representation that described by a mean map and a standard deviation map takes on an approximate Gaussian form with an approximate diagonal covariance, the second part is the Kullback-Leibler (KL) divergence, which measures the difference between two probability distributions.
Specifically, the KL divergence can be computed and differentiated without estimation: where K = h l × w l × c l is the total number of the pixels in the mean tensor (also the standard deviation tensor). Finally, the total loss function is as follows.
In the training processing, Adam [43] based Stochastic Gradient Descent method is used for parameter optimization and the learning rate is set to 0.0001. As a fully convolutional network, the resolutions of input and output are the same. Consequently, there is no need to crop or resize the input video frames.
Different from ordinary auto-encoder, the variational auto-encoder aims to learn the posterior distributions of feature representations, in the form of parameters mean µ and standard deviation σ. From the view of numerical simulation, the mean µ can be treated as a statistical representation of the input data and the standard deviation σ as a noise intensity regulator. Therefore, for the input series I t , we define the mean tensor µ t from the encoder as the feature representation f t of the appearance and motion patterns. Suppose we have the mean tensor µ t with size h l × w l × c l from the encoder at the lth layer, the feature representation f t is considered to consist of h l × w l feature vectors with c l dimensions. Specifically, for each position (i, j), where i ∈ [1, h l ], j ∈ [1, w l ], the feature vectors can be written as: According to the architecture of the conventional neural network, each feature vector f t (i, j) is derived from a specific receptive field, which is a sub-region of the original sequence I t . In other words, instead of cropping the sequence into patches and extracting the features one by one, we divide the sequence I t into overlapping patches containing appearance and motion information densely and extract features f t = { f t (i, j)|i = 1, 2, . . . , h; j = 1, 2, . . . , w}, from all the receptive fields at one time.
In general, the features learned by our convolutional VAE are the further statistical representations of that learned by an ordinary convolutional auto-encoder, which means our features can represent the appearance and motion patterns that the receptive fields belong to rather than just some specific receptive fields samples. On the other hand, according to the assumption and the corresponding constraint condition of VAE, the elements of our feature vector tend to be independent and decoupled, comparing with the feature learned by the conventional auto-encoder.

Anomaly Detection and Localization
In the training phase, all the frames {I t |t = 1, 2, . . . , T} and the frame series {I t |t = 3, 4, . . . , T − 2} are free of abnormal events and thus the features extraced from them are considered as normal features. As mentioned above, the appearance and motion patterns in each receptive field of a video sequence I t are represented using a feature vector f t (i, j) = [ f t (i, j, 1), f t (i, j, 2), . . . , f t (i, j, c l )] from the encoder of our fully convolutional network, where (i, j) is corresponding location on the feature map and To model all the normal features that form the training frame series and to check whether the upcoming features are abnormal or not, Gaussians are constructed to fit the normal features. Different from other works that used only one model to fit all the feature vectors, we construct different Gaussian models to fit the normal feature vectors extracted from each receptive field. Besides this, considering the residual correlation among the elements of a feature vector, the multivariate Gaussian is adopted to improve the accuracy. Specifically, for each location (i, j) of the feature tensor, an exclusive multivariate Gaussian model G ij with c l variables is fitted to all normal feature vectors { f t (i, j)|t = 3, 4, . . . , T − 2} extracted from all frame series {I t |t = 3, 4, . . . , T − 2}, where T is the number of the normal frames: where µ G ij ∈ R c l is the mean vector, Σ G ij ∈ R c l ×c l is the covariance matrix and c l is the dimension of f t (i, j). Therefore, we define these Gaussian models {G ij |i = 1, 2, . . . , h; j = 1, 2, . . . , w} as our reference models for normal appearance and motion patterns.
In the anomaly detection phase, input sequences are pushed into the encoder network first. Then h l × w l feature vectors are extracted by the encoder network. These feature vectors are varified by the corresponding Gaussian model G ij . Observations of low probability under these Gaussian model are declared anomalies. In our work, the log-likelihood of a feature vector f test t (i, j) under the Gaussian model G ij is used to measure the anomaly so that the abnormality map at location (i, j) is the negative log-likelihood of the feature vector f test t (i, j): In addition, the location of the anomalies can be mapped from the feature maps layer to the original frame. According to the convolution and pooling processes of CNN, all the feature vectors f test t are extraced from the corresponding receptive fields that overlap each other in sequence I test t for considering the tth frame. As the kernel of each convolution and pooling processes has the fixed size, therefore, the abnormality of each receptive field in the original tth frame I test t can be backward mapped by the abnormality at each location in the abnormality map. For the regions where the receptive fields overlap in the original frame, an averaging operation is used to calculate the final anomaly score.

Experimental Results and Comparisons
We evaluate the performance of the proposed method mainly on the large scene surveillance dataset: UCSD Anomaly Detection Dataset (http://www.svcl.ucsd.edu/projects/anomaly/dataset. htm). The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd density in the walkways was variable, ranging from sparse to very crowded. The UCSD dataset includes two subsets: Ped1 and Ped2. Both contain training set and testing set. Specifically, Ped1 contains 34 training video sequences and 36 testing video sequences. The frame resolution is 158 × 238 pixels. In Ped1, people walk towards and away from the camera so that the foreshortening effects occur; Ped2 contains 16 training video sequences and 12 testing video sequences with pedestrian movement parallel to the camera plane. The frame resolution is 240 × 360 pixels. All frames in the training set are normal and contain only pedestrians. In addition to normal frames, the testing set contains abnormal frames with bikers, skaters, small carts or people walking in the grass as anomalies. We also make an additional experiment on the ShanghaiTech Campus dataset (https://svip-lab.github.io/dataset/campus_dataset.html) to evaluate our method. The ShanghaiTech Campus dataset includes 13 scenes and each scenes are of complex light conditions and camera angles. Each color frame resolution is 856 × 480 pixels. The same as the UCSD dataset, all the videos for training are normal and contain only pedestrians and the testing frames for each scene contains abnormal events, such as bikers, people running and fighting.
All experiments are carried out on a dedicated GPU server with Intel Xeon E5-2620 CPU running at 2.1 GHz, 128 GB of RAM, a Nvidia TITANX GPU and running Ubuntu Mate 16.04. We use the Pytorch library, which is an open-source machine learning library for Python, to implement our anomaly detection architecture.

Visualization of the Feature Distribution
Given a sequence I as input, our VAE outputs the feature map in the size of h l × w l × c l , which means we extract a c-dimension feature vector at each location (i, j). To observe the distribution of the feature vectors, we set the feature dimension c to 3 as an example and select a fixed location (i * , j * ). Then we can scatter the feature vectors [ f (i * , j * , 1), f (i * , j * , 2), f (i * , j * , 3)] , which are extracted from the same receptive field of both training and testing sequence sets, in a three-dimensional grid as follows.
As shown in Figure 3, all the feature vectors [ f (i * , j * , 1), f (i * , j * , 2), f (i * , j * , 3)] extracted from the same receptive field of the input series are plotted as scatter points in the 3-dimensional coordinate system. The distribution of both training and testing scatter points is almost a 3D fusiform shape, which is a typical shape of the multivariate Gaussian distribution. Moreover, the contours of the projection on the three planes also indicate that the distribution of the scatter points is Gaussian. Then we examine the relationship between the receptive patches and the feature points and take three from testing samples as examples. The feature vectors that near the center come from the normal samples (the pedestrian and background) and some similar samples of anomalies (the cyclist). The points that far from the center come from the anomaly samples (the car). This distribution trend indicates that the fitting the feature vectors by Gaussian is feasible.

Qualitative and Quantitative Results
For the UCSD Anomaly Detection Dataset, the receiver operating characteristic (ROC) curves, the equal error rate (EER), and the area under curve (AUC) are used to compare our results with state-of-art methods. Two measures at frame level and pixel level are used, which are introduced in [32] and widely used in later works. In essence, both of the two measures focus on the anomaly of a frame. For frame level evaluation, a frame is considered an anomaly if at least one pixel is recognized as an anomaly, whether the recognition result is correct or wrong. For pixel level evaluation, a frame is considered to contain anomalies if at least 40% of anomaly ground truth are covered by the regions that are detected by the algorithm. In other words, the anomaly of a frame can be determined only when the anomaly object in it is accurately located.
We compare our anomaly detection method with several methods. Specifically, we consider some classical methods that are widely cited as the baselines for the UCSD Anomaly Detection Dataset, which contains the sparse combination learning framework (SCLF) in [36], the mixture of probabilistic principal component analyzers (MPPCA) approach in [31], the social force model (SF) in [44], and their extension (SF+MPPCA) in [32], mixture of dynamic texture (MDT) in [32] and Adam method in [30]. In addition to these classical baselines, we also consider two state-of-the-art methods, the sparse reconstruction method in [37] and the Appearance and Motion DeepNet (AMDN) method in [21]. Figures 4 and 5 plot the ROC curves of the various algorithms for comparison. By varying the threshold parameter, we can obtain a series anomaly detection results and their corresponding false positive rates (FPR) and true positive rates (TPR). Thus, the ROC curve can be plotted by the series of coordinate points composed of FPRs and TPRs. The ROC curves of the baseline methods are taken from the original papers (when available). From the frame-level evaluation results, it shows that our method outperforms most previous methods and yields competitive performance comparing with two state-of-the-art methods AMDN [21] and Sparse reconstruction [37]. Moreover, from the pixel-level evaluation results, which reflect the accuracy of anomaly localization, our method outperforms all the competing approaches.  In addition to the ROC curves, the evaluation criteria also include two numerical indices, AUC and EER in frame-level and pixel-level, and the results are presented in Figure 6 and Table 1. It is noted that the lower EER and higher AUC indicate better performance. Comparing with the sparse reconstruction methods [37], which achieved an outstanding result without using deep learning methods, our method achieves only about 1% AUC increase and the same EER for frame-level detection. However, for pixel-level detection, our method achieves about 6% AUC increase, which means our methods locates the anomalies more accurately. Compared with other AMDN methods [21], which had superior performance by using the deep neural network based auto-encoders to learn feature representations, our method achieves about 3% AUC increases, 4% EER reduction for Ped1 frame-level detection and about 1.5% AUC increases, 2% EER reduction for Ped2 frame-level detection. For pixel-level detection, our method achieves a 3% AUC increase.
We report some examples of anomaly detected with our method on the UCSD dataset in Figure 7. Our detection results are marked with red color and the groundtruth manually labeled are marked with green color. Besides the obvious single anomaly, such as the bikers and the vehicles in (a), (b), (f), and (g), our method also works well in other complex scenes. In the scenes (c), (h), and (i), two abnormal events occur simultaneously. In (d), as our method can learn the appearance and motion feature representations, the skater that is almost the same as the pedestrian can also be detected. In (e) and (j), the anomalies are surrounded by normal pedestrians. Some failure cases that impede the performance of our method are reported in Figure 8. In Figure 8a, a man is walking along the street with nothing, but a box suddenly appears in his hand at the end of the sequence. Then our system wrongly detects him as an anomaly. In Ped1 sequence (b), a biker is almost the same as a pedestrian in some special angles, which causes our system to fail to detect it. In Ped2 sequence (c), part of the biker that appears at the bottom of the frame is correctly detected as an anomaly. However, for the biker that appears in the middle of the frame, our system misses it since the biker is similar to the grass background and the edge of the bike is too tiny to activate our system. In addition to UCSD Anomaly Detection Dataset, we also evaluate the performance of the proposed methods on the new colord surveillance dataset: ShanghaiTech Campus dataset. We compare our anomaly detection method with two methods, Conv-AE [40] and FFP [6], since the dataset has not been widely used. The Area Under Curve (AUC) is cumulated to a scalar for performance evaluation. Following the work in [6], we leverage frame-level AUC for performance evaluation.
The AUC of these two methods are taken from FFP [6] and listed with our method together in Table 2. For the 13 different scenes in the dataset, we trained their own models instead of only one model on all 13 scenes altogether like FFP did, because the definition of anomaly is different in these different scenes. We can see that our method outperforms the other two methods, which demonstrates the effectiveness of our method for colored scenes.

Run-Time Analysis
We compare the running time of our method with the other approaches on UCSD dataset. Different image resolutions affect the time required to process each frame. Specifically, the resolutions are 158 × 238 for the UCSD Ped1 dataset and 240 × 360 for the UCSD Ped2 dataset. Table 3 reports the average running time of each frame during the test phase. Since the original implementations of the other methods are not publicly avilable, we report the running times taken from [21,36], specifying the working environment. Inevitably, the improvement in terms of accuracy obtained with deep neural network comes at a price of an increased computational cost. Compared with AMDN [21], which also adopted a deep architecture, the computational speed of our method is faster. The main reason is that our method benefits from fully convolutional neural network that can extract the feature representations of all the receptive fields in the input frames at one time, and AMDN crops the input into patches and the same feature extraction process is repeated among these patches. As shown in Table 3, with a GPU, our method has great time efficiency in terms of anomaly detection.

Conclusions
In this paper, we introduce a novel unsupervised learning approach for a video anomaly detection system based on convolutional auto-encoder architectures. We focus on the anomalies that occur in outdoor scenes, considering the challenging publicly available UCSD anomaly detection datasets. The fundamental advantage of our approach is the use of a variational convolutional auto-encoder. On the one hand, our approach can extract features independent of the prior knowledge of hand-crafted features (the input of our detection system are raw pixels) and dispenses with any object-level analysis, like object detection and tracking. On the other hand, we omit the process of cropping the input into patches by the convolution principle of the convolutional neural network, which makes our framework simple and clear. We demonstrate the effectiveness and robustness of the proposed approach, showing competitive performance to existing methods.
In fact, the approcah we present for the anomaly detection system can be viewed as a baseline of using a variational auto-encoder to detect anomalies in surveillance video. Further research directions will include jointing the input with richer temporal and contextual information and combining the feature extraction with the final anomaly decision. Besides, we can learn from the deep neural network frameworks for object detection and classification tasks to design more sophisticated frameworks, in order to represent the multiple patterns from the input video.