Research on Anomaly Detection of Surveillance Video Based on Branch-Fusion Net and CSAM

As the monitor probes are used more and more widely these days, the task of detecting abnormal behaviors in surveillance videos has gained widespread attention. The generalization ability and parameter overhead of the model affect how accurate the detection result is. To deal with the poor generalization ability and high parameter overhead of the model in existing anomaly detection methods, we propose a three-dimensional multi-branch convolutional fusion network, named “Branch-Fusion Net”. The network is designed with a multi-branch structure not only to significantly reduce parameter overhead but also to improve the generalization ability by understanding the input feature map from different perspectives. To ignore useless features during the model training, we propose a simple yet effective Channel Spatial Attention Module (CSAM), which sequentially focuses attention on key channels and spatial feature regions to suppress useless features and enhance important features. We combine the Branch-Fusion Net and the CSAM as a local feature extraction network and use the Bi-Directional Gated Recurrent Unit (Bi-GRU) to extract global feature information. The experiments are validated on a self-built Crimes-mini dataset, and the accuracy of anomaly detection in surveillance videos reaches 93.55% on the test set. The result shows that the model proposed in the paper significantly improves the accuracy of anomaly detection in surveillance videos with low parameter overhead.


Introduction
To maintain public safety, more and more surveillance probes are deployed at road intersections, shopping centers, subway entrances, and other public locations, resulting in a flood of surveillance videos [1,2]. Law enforcement officials usually spend a lot of time watching surveillance videos to find crimes, but still frequently miss important details. People urgently need an algorithm that automatically detects abnormal behaviors in surveillance video to lessen the workload of law enforcement officials while also dealing with abnormal criminal behaviors faster and more correctly [3][4][5].
Video anomaly detection methods can be divided into four categories: reconstructionbased method, prediction-based method, classification-based method, and regression-based method. The reconstruction-based method works by training normal video data to generate a distribution representation of normal data [6,7]. The prediction-based method assumes that a continuous normal video has a regular contextual connection and that future frames can be predicted by learning the dependencies [8]. However, there are no such dependencies between frames in abnormal videos [9]. The classification-based method treats the detection of anomalies as a classification problem [10]. After learning multiple distribution patterns of normal samples, samples that do not follow these distribution patterns are classified as abnormal. The above three categories of methods are suitable for frame-level detection of video, but not for large-scale detection [11].
Therefore, video anomaly detection is mostly performed by the regression-based method. The method's idea is to use the anomaly score as an evaluation indicator. After Multiple-instance learning. The surveillance video is divided into a fixed number of video segments, and the segments are placed in positive and negative bags. Each video segment is called an instance in the bags [15]. Then, the feature extraction network is used to extract features from the fixed number of video clips in the two bags. Next, the multilayer perceptron is used to calculate the anomaly scores of the features. Finally, the scores are used to determine whether there is abnormal behavior.
To address the problems, we propose a feature extraction network named "Branch-Fusion Net". The multi-branch structure not only reduces parameter overhead but also allows the network to understand the input feature maps from different perspectives to improve the generalization ability.
The features extracted by networks usually contain features that are not relevant to the detection task [16]. To suppress useless features and enhance important features, we propose a simple yet effective Channel Spatial Attention Module (CSAM). The CSAM is an end-to-end generic module that can be seamlessly integrated into three-dimensional convolutional neural networks.
Since the convolutional neural networks are suitable for extracting local features [17], we combine the Branch-Fusion Net and the CSAM as the local feature extraction network and use the Bi-Directional Gated Recurrent Unit (Bi-GRU) for the global feature extraction.
The contributions of our work include: (1) We propose the Branch-Fusion Net, which not only greatly reduces parameter overhead, but also improves the generalization ability of feature extraction by understanding the input feature maps from multiple perspectives. The network achieves state-of-the-art performance for behavior recognition tasks on multiple benchmarks. (2) We propose a simple yet effective CSAM to ignore useless features during the model training. The CSAM focuses attention on key channels and spatial feature areas in turn. Adding the CSAM to the mainstream 3D convolutional neural networks can significantly improve the feature extraction effect. (3) We establish a surveillance video dataset called 'Crimes-mini' containing five categories of abnormal behaviors. The model we propose achieves the detection accuracy of 93.55% on the test set.

Anomaly Detection Methods
There are four methods to deal with anomaly detection, including the reconstructionbased method, prediction-based method, classification-based method, and regression-based method. Table 1 shows the advantages and disadvantages of the four methods.

Category Judgment Basis Advantages Disadvantages
Reconstructionbased method Small reconstruction errors for normal video frames and large reconstruction errors for abnormal video frames.
High accuracy in detecting location-related anomalies.
There are reconstruction errors for anomalous video frames.

Prediction-based method
Small prediction errors for normal video frames and large prediction errors for abnormal video frames.
High accuracy in detecting motion-related anomalies.
It ignores the fact that normal video frames can be unpredictable.

Classification-based method
The samples that do not follow the normal sample distribution are considered abnormal.
The distribution of normal samples can be well-learned.
If the normal sample distribution is complex, the classification model may fail.

Regression-based method
The samples with anomaly scores above the threshold are considered abnormal.
The model is simple and suitable for large-scale video anomaly detection.
The threshold for video anomalies is not easy to determine.
Reconstruction-based method. Hasan et al. propose two Auto-Encoder-based methods that can perform anomaly detection without supervision [18], but the abnormal samples are sometimes subject to reconstruction errors. To address the shortcoming, Gong et al.
propose an improved Auto-Encoder [19], named the "Memory-augmented Auto-Encoder (MemAE)". The method obtains the encoding from the encoder based on the input, and then uses it as a query to retrieve the most relevant memory item for reconstruction. To better remember normal samples, Park et al. propose a memory module that can be updated according to the scheme [20].
Prediction-based method. Medel et al. propose the Convolutional Long Short-Term Memory Network (Conv-LSTM) [21], which incorporates convolution operation into the Long Short-Term Memory Network (LSTM). Anomaly detection is achieved by reconstructing past frames and predicting future frames. Lu et al. combine the Variational Auto-Encoder (VAE) with the Conv-LSTM to propose the Convolutional Variational Recurrent Neural Network (Conv-VRNN) for generating future frames [22]. To address the possible blurring of future frames, Mathieu et al. [23] use alternating convolution and Rectified Linear Unit (RLU) to generate future frames and propose a method to fuse multiple feature learning strategies to generate clear future frames. Ye et al. propose a novel deep Predictive Coding Network (AnoPCN) [24], which implements anomaly detection by unifying reconstruction and prediction methods into a framework.
Classification-based method. Sabokrou et al. [25] propose a single classification-based video anomaly detection method inspired by the Generative Adversarial Network (GAN) for training models. Based on Sabokrou's study, Wu et al. propose a deep single classification neural network that is capable of obtaining compact single-class classifiers considering only normal samples [26]. Xu et al. propose an adaptive intra-frame classification network [27]. The network extracts and classifies the input appearance and motion features into several sub-regions, and then classifies the sub-regions. If the test classification result of the sub-region is different from the true classification, the video is considered abnormal.
Regression-based method. Sultani et al. propose an anomaly detection method based on multiple-instance learning by scoring abnormal and normal video clips and then picking out the clips with the highest anomaly scores to train the model [28]. Since the hinge loss function used for multiple-instance learning training is not smooth, Kamoona et al. propose a loss function to make the model robust to output anomaly scores [29]. Zhu et al. output anomaly scores by feeding the calculated optical flow into a sequence-enhancement network [30].
Among the four methods, only the regression-based method is suitable for dealing with large-scale video data, which can be applied in practical scenarios. Therefore, we use multiple-instance learning of the regression-based method for anomaly detection. With the multiple-instance learning for anomaly detection, how to efficiently extract the features is crucial to the detection result.

Feature Extraction Network for Video
The networks for feature extraction in videos can be divided into a two-stream network architecture and a three-dimensional convolutional neural network architecture.
The three-dimensional convolutional neural network architecture was firstly proposed in [12]. The 3D convolution adds a temporal dimension to the 2D convolution to extract features in both temporal and spatial dimensions. The output of the 3D convolution is still a 3D feature map. Specifically, the 3D convolution moves in three dimensions: height, width, and depth, for multiple video frames. At each position, element-by-element multiplication and addition provide a value. Since the filter is sliding through 3D space, the output values are also arranged in 3D space. The C3D network is shown in Figure 2, and the size of each layer is shown in Table 2.

Layer Name
Size Stride The three-dimensional convolutional neural network architecture can directly utilize three-dimensional convolution to extract spatiotemporal features, so the network structure is simple. However, the increase in the dimensionality of 3D convolution leads to a large number of parameters in the network.
The two-stream architecture (as shown in Figure 3) was firstly proposed in [31]. Spatial Stream Convnet and Temporal Stream Convnet are used to extract the spatial and temporal features of the video, respectively, and finally, the two networks are fused by late fusion. Specifically, the input of Spatial Stream Convnet is a single video frame, which is responsible for extracting the appearance features of the frames. Multiple optical stream frames are input into Temporal Stream Convnet. The addition of an optical stream makes it easier to capture motion information in the neural network, and directly provides the model with motion information between frames. However, the optical flow needs to be pre-extracted, and it leads to the spatiotemporal feature extraction being time-consuming.

The Network Architecture
To introduce our model, we show the architecture in Figure 4, and a summary of the size of each stage is shown in Table 3.  Table 3. The size of each stage in our model.

Stage
Output Size The Size of the Stage The Branch-Fusion Net consists of five stages: Conv1, Conv2, Conv3, Conv4, and Conv5. We combined the Branch-Fusion Net and the CSAM as the local feature extrac-tion network and used the Bi-Directional Gated Recurrent Unit (Bi-GRU) for the global feature extraction.

Branch-Fusion Net
We noted that group convolution not only reduces the parameter overhead but also understands the input features from different perspectives after dividing different channels into groups, so we proposed the multi-branch structure based on group convolution. We firstly introduce how group convolution reduces the parameter overhead.
For the feature map of C × D × H × W, we can generate N × D × H × W by using N convolution kernels of C × d × h × w (assuming that the image size remains the same after pooling). Then, we assumed that the size of the input feature map was still C × D × H × W (Channel × Depth × Height × Width) and the number of output feature maps was N. If we divide it into 2 groups (as shown in Figure 5), the size of the input feature map is C/2 × D × H × W for each group, and the number of output feature maps of each group is N/2. The size of each convolution kernel is C/2 × d × h × w. The number of convolutional kernels in each group is N/2. The kernels only convolve with the input feature map of the same group. Therefore, the total number of convolutional kernels was N × C/2 × d × h × w. It can be seen that the total number of parameters was reduced to 1/2 of the original. When the number of groups is G, the number of parameters is reduced to 1/G. Since each group in group convolution understands the input features independently and information does not circulate among different groups, this leads to the fact that each branch will understand the features from a different perspective.
Since group convolution groups input features from channel dimensions, which is not conducive to extracting global channel features, we improved group convolution by proposing the multi-branch structure, as shown in Figure 6. The first convolutional layer consisted of 128(4 × 32) convolutional kernels of 1 × 1 × 1. The number of convolution kernels was greatly reduced compared to the number of channels of input features. The size of the input feature map of the Branch-Fusion Net block was 256 × 16 × 64 × 64 (Channel × Depth × Height × Width), and the output feature map of each group was 4 × 16 × 64 × 64. We used 1 × 1 × 1 convolution for global feature preservation and completed the dimensionality reduction by making the total number of all grouped convolution kernels smaller than the number of channels in the input feature map, which greatly reduced the number of parameters of the network. The second layer is the group convolution layer, which consisted of 128(4 × 32) convolution kernels of 3 × 3 × 3. The number of channels in the input feature map is the same as the number of convolution kernels. In this layer, we performed group convolution of the input feature maps with branches as groups. The feature map of 128 × 16 × 64 × 64 was divided into 32 groups from the channel dimension, and each group had a feature map of 4 × 16 × 64 × 64. Each group of input features was convolved by 3 × 3 × 3 to obtain an output feature map of 4 × 16 × 64 × 64. By group convolution, we further reduced the number of parameters of the network and improved the generalization ability of the network by understanding the input feature from multiple perspectives.

Channel-Spatial Attention Module
The CSAM includes the Channel Attention Module (CAM) and the Spatial Attention Module (SAM), as shown in Figure 7. The CAM is used to focus attention on the channels that have a greater impact on the final result at different times, and the SAM is used to focus attention on the regions of temporal and spatial features that are favorable for classification, which is complementary to the CAM. For the intermediate feature map of three-dimensional convolutional neural networks, the CSAM generates two attention maps along two independent dimensions, the channel and spatial, and then multiplies the attention by the input feature maps to perform adaptive feature refinement on the input feature maps.

Channel Attention Module
In the convolutional neural network, each convolutional kernel has a different impact on the features. We used channel attention to focus our attention on the channel that has a great impact on the final result at different times. The CAM consists of two pooling layers, a multilayer perceptron, and an activation function. The structure is shown in Figure 8. In Figure 8, denotes the bitwise summation and denotes the sigmoid activation function. Firstly, the spatiotemporal information of the input features was aggregated using three-dimensional average pooling and three-dimensional maximum pooling to generate two different spatiotemporal context descriptors. Then, the descriptors were fed into a multilayer perceptron with shared weights to obtain two feature maps. Finally, the two feature maps were summed element-by-element and activated by the sigmoid function to obtain the final channel attention weights. The CAM is computed as: where MLP denotes the two-layer neural network and the sigmoid activation function. Since the input was a video sequence containing temporal information, the CAM was three-dimensional. The input was changed from (batchsize, channel, height, width) to (batchsize, channel, time_sequential, height, width).

Spatial Attention Module
The SAM was designed to focus attention on regions of spatiotemporal features that are favorable for classification, and it complements the CAM. The SAM consists of two pooling layers, a convolutional layer, and an activation function. The structure is shown in Figure 9.

Spatial Attention Ms
Channel-refined feature FC ×T×H×W 1×T×H×W
Firstly, the channel information of the input features was aggregated using average pooling and maximum pooling to generate two different channel context descriptors. Then, the two different channel context descriptors were stitched together, and the information was aggregated by the convolutional layer with the convolutional kernel of 7 × 7 denotes the sigmoid activation function. Firstly, the spatiotemporal information of the input features was aggregated using three-dimensional average pooling and three-dimensional maximum pooling to generate two different spatiotemporal context descriptors. Then, the descriptors were fed into a multilayer perceptron with shared weights to obtain two feature maps. Finally, the two feature maps were summed element-by-element and activated by the sigmoid function to obtain the final channel attention weights. The CAM is computed as: where MLP denotes the two-layer neural network and the sigmoid activation function.
Since the input was a video sequence containing temporal information, the CAM was three-dimensional. The input was changed from (batchsize, channel, height, width) to (batchsize, channel, time_sequential, height, width).

Spatial Attention Module
The SAM was designed to focus attention on regions of spatiotemporal features that are favorable for classification, and it complements the CAM. The SAM consists of two pooling layers, a convolutional layer, and an activation function. The structure is shown in Figure 9. Firstly, the channel information of the input features was aggregated using average pooling and maximum pooling to generate two different channel context descriptors. Then, the two different channel context descriptors were stitched together, and the information was aggregated by the convolutional layer with the convolutional kernel of 7 × 7 × 7. Finally, the attention weights were obtained by activating the sigmoid function. The spatial attention is calculated as: where f 7×7×7 denotes the convolution kernel of 7 × 7 × 7 and σ denotes the sigmoid activation function. Similar to the CAM, the average pooling, maximum pooling, and convolution in spatial attention were three-dimensional.

Bi-GRU
We first introduce the Gated Recurrent Unit (GRU), as shown in Figure 10b. The calculation process in the GRU is summarized as follows: The Gated Recurrent Unit (GRU) uses two gate functions to ensure that important features are not lost after long-term propagation. z t is the update gate responsible for controlling the amount of data that can be saved to the current moment in the forward parameter information, and r t is a reset gate to control how much of the past information to forget.
The idea of the Bi-Directional Gated Recurrent Unit (Bi-GRU) is to ensure the features obtained at time t have both the past and the future information. The network was divided into two independent GRUs, and the input sequences were input to the two GRUs in forward and reverse order for feature extraction. The two output vectors were stitched together to form the final feature representation. The structure of the Bi-GRU is shown in Figure 10a.

Dataset
UCF-Crimes is a large dataset containing 128 h of surveillance video. It includes 1900 surveillance videos, divided into 14 categories of behavior. There exist some serious problems in the dataset, such as the low resolution and repetition of some video frames.
In this paper, we selected five categories in UCF-Crimes: arson, burglary, explosion, road accident, and stealing, and established the Crimes-mini dataset after the preprocessing operations, including cropping, de-duplication, regrouping, and expansion. The training set, validation set, and test set were randomly selected in the ratio of 6:2:2. Some of the sample images in the dataset are shown in Figure 11. The number of videos of each anomaly in our dataset is shown in Table 4.

Evaluation Index
In this paper, we chose the accuracy rate as the important evaluation metric for our experiments. Since the overall performance of the model is also critical, we considered parameter overhead and F1 values in our analysis. The accuracy and F1 values are shown in Figure 12. Accuracy is calculated as: F1 value is calculated as: Recall is calculated as: Precise is calculated as: Parameter overhead is used to measure the size of the model. The lower the parameter overhead, the fewer parameters required to save the model. Parameter overhead of the 3D convolutional layer is calculated as: where C 0 denotes the number of output channels, C i denotes the number of input channels, k w denotes the convolutional kernel width, k h denotes the convolutional kernel height, k d denotes the length of the convolutional kernel in the time dimension, +1 denotes bias, brackets denote the number of parameters of a convolutional kernel, and C 0 denotes that the layer has C 0 convolutional kernels. Parameter overhead of the fully connected layer is calculated as: where I × O denotes the number of weights of O layers, and the number of bias is O.

Hyperparameter Setting
The setting of hyperparameters is critical, and it directly affects how well the model is trained. Therefore, we experimented with the setting of hyperparameters on the validation set.

Number of Branches
We used the Branch-Fusion Net as the local feature extraction network. The generalization ability of the network directly affects our detection accuracy, and the number of branches determines the generalization ability. Therefore, we set the number of branches to 16, 24, 32, and 48, and analyzed the effects of different branch numbers on the generalization ability. The experimental result for different branch numbers is shown in Figure 13. The experimental result shows that the detection accuracy was the highest when the number of branches was 32. When the number of branches was 16 or 24, the lack of generalization ability of the network led to low accuracy due to the small number of branches. When the number was 48, the accuracy was slightly lower than 32 branches, and the parameter overhead was the largest. Therefore, we chose 32 as the number of branches.

Learning Rate Setting
The learning rate is an important hyperparameter when we train the model. For the gradient descent method, the choice of learning rate is critical. If it is too large, it will cause the model to fail to converge, and if it is too small, it will converge too slowly. We conducted experiments at different learning rates, and the experimental result is shown in Figure 14. As can be seen from the figure, when the number of branches was 32, the F1 value was generally higher than in the other three cases, which is consistent with the results of our previous experiment. The accuracy rate and F1 value reached the maximum when the number of branches was 32 and the learning rate was 0.0005, which were 93.22% and 89.87%, respectively.

Number of Layers of Bi-GRU
We used the Bi-GRU to accomplish the global feature extraction, so its performance is also critical to the overall model. We trained Bi-GRU on the test set for 60 rounds in different layers. The result is shown in Figure 15.
The experimental results show that global features extracted using the single-layer network are not enough, while the features extracted by the three layers are too abstract. Both will interfere with the subsequent training and make it less effective. When Bi-GRU used two layers, it could effectively complete the extraction of global complex information.

Structure Setting of CSAM
The CSAM includes two modules, the CAM and the SAM. To find the optimal connection for these two modules, we experimented with behavior recognition tasks with different connections using Branch-Fusion Net on the test set, and the experimental result is shown in Table 5.
According to the data in the table, adding either the CAM or the SAM to Branch-Fusion Net alone will improve the performance of the network, but the combination of the two is more effective than using individual modules. After connecting the two in series and parallel, we found that the "CAM + SAM" series connection was the most effective.

Comparison Experiments with Branch-Fusion Net
To evaluate the generalization capability of Branch-Fusion Net, we reproduced the mainstream feature extraction networks on the multiple benchmarks (UCF-101, HMDB51, and Kinetics) to conduct a comparison experiment for the behavior recognition task [32]. The Kinetics dataset is much larger than the UCF-101 dataset and the HMDB51 dataset, so it is more challenging. Therefore, we removed the networks (two-stream, C3D, and R3D) with poor feature extraction ability on the UCF-101 dataset and HMDB51 dataset, and then showed the TOP-1 accuracy and TOP-5 accuracy of the remaining networks on the Kinetics dataset. The experimental result is shown in Tables 6 and 7. Table 6. Accuracy (%) comparison on UCF-101 and HMDB datasets.

Ablation Experiment of CSAM
The suppression of useless features and the highlighting of useful features will be reflected in the feature representation capability of the network with the CSAM added, and we chose the main 3D convolutional neural networks for the ablation experiment. The result is shown in Table 8. The experimental result shows that the parameter overhead hardly increased after adding the CSAM to the networks, but there was a significant improvement in the accuracy rate, which indicates that the CSAM outperformed all the baselines without bells and whistles, demonstrating the general applicability of the CSAM across different architectures.

Comparison Experiment with Our Model
To evaluate the performance of our models (the Branch-Fusion Net with CSAM for the local feature extraction network and the Bi-GRU for the global feature extraction network), we conducted a comparison experiment with good feature extraction networks (P3D [43], R3D, and R (2 + 1)D) on the test set. The experimental result is shown in Figure 16. The experimental results show that our proposed model achieved the highest accuracy with the lowest parameter overhead. The low parameter overhead indicates that the model requires low memory, and the 93.55% accuracy indicates that our proposed model can adequately extract the key features in abnormal videos.

Conclusions
We proposed a model based on multiple-instance learning, which included the local feature extraction network and the global feature extraction network. For the local feature extraction network, we proposed the Branch-Fusion Net to reduce parameter overhead and improve the generalization ability at the same time. To prevent useless features from interfering with the model training, we proposed the CSAM to suppress useless features and enhance important features. For the global feature extraction network, we used a twolayer Bi-GRU to complete the global feature extraction. To make the model more suitable for mobile devices, we will reduce the parameter overhead using model compression techniques in the future.