You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

25 December 2022

Crowd Density Level Estimation and Anomaly Detection Using Multicolumn Multistage Bilinear Convolution Attention Network (MCMS-BCNN-Attention)

,
and
1
Department of Computer Science, Xiamen University, Xiamen 361005, China
2
Wayamba University of Sri Lanka, Kuliyapitiya 60200, Sri Lanka
*
Author to whom correspondence should be addressed.
This article belongs to the Section Computing and Artificial Intelligence

Abstract

The detection of crowd density levels and anomalies is a hot topic in video surveillance. Especially in human-centric action and activity-based movements. In some respects, the density level variation is considered an anomaly in the event. Crowd behaviour identification relies on a computer-vision-based approach and basically deals with spatial information of foreground video information. In this work, we focused on a deep-learning-based attention-oriented classification system for identifying several basic movements in public places, especially, human flock movement, sudden motion changes and panic events in several indoor and outdoor places. The important spatial features were extracted from a bilinear CNN and a multicolumn multistage CNN with preprocessed morphological video frames from videos. Finally, the abnormal and crowd density estimation was distinguished by using an attention feature combined with a multilayer CNN feature by modifying the fully connected layer for several categories (binary and multiclass). We validate the proposed method on several video surveillance datasets including PETS2009, UMN and UCSD. The proposed method achieved an accuracy of 98.62, 98.95, 96.97, 99.10 and 98.38 on the UCSD Ped1, UCSD Ped2, PETS2009, UMN Plaza1 and UMN Plaza2 datasets, respectively, with the different pretrained models. We compared the performance between recent modern approaches and the proposed method (MCMS-BCNN-Attention) and achieved the highest accuracy. The anomaly detection performance on the UMN and PETS2009 datasets was compared with that of a state-of-the-art method and achieved the best AUC results as 0.9953 and 1.00 for both scenarios, respectively, with a binary classification.

1. Introduction

One of the key security issues during public events is how to accurately determine activity when people are moving and congregating. The congestion levels and anomaly detection are also extremely difficult tasks in video surveillance. With the expansion of human society activities, surveillance in recent years has become a very important and difficult task. In this work, we mainly focused on several events in indoor and outdoor environments while some events occurred in low light conditions and had shadows with large occlusions. An efficient crowd behaviour analysis is required to identify the abnormality of a crowd gathering and understand its behaviour by analysing video information in public surveillance videos. The analysed information is very useful to use for crowd management in public spaces based on predefined behaviours.
However, in recent years, the technology of surveillance equipment has been significantly advanced in order to improve the quality of the obtained spatial information. Event spot analysis is still a difficult task due to the long viewing distance and the low-resolution quality of surveillance videos. The current approach to testing videos with a resolution of almost 299 × 299 pixels with high occlusion allows a relatively low-quality video. In principle, there are surveillance methods available in real time, leaving the analysis for later in case of a special investigation. However, both methods require a special interaction with a person to analyse the video, which can lead to visual fatigue and psychological weakness. Thus, the interaction with instances and a quick attention to unexpected situations, such as an abnormal event or a cluster of abnormal people, have a great impact on the task of video surveillance.
Technically, an anomaly can be described as an unusual movement of a person, a cluster or a sudden convergence/divergence in a certain area of interest. In addition, crowd density level analysis measures the anomaly of traffic forecasting in surveillance video. The crowd analysis is mainly divided into several subsessions such as density, abnormal activity, crime scene, crowd counting, event understanding and many more. The group activity detection and event understanding need highly improved spatial–temporal model information to understand behaviours. The basic methods of all the crowd behaviour understanding are based on conventional spatial–temporal features such as handcrafted and convolution kernel features. Figure 1 shows the basic workflow of our proposed approach to understanding crowd behaviour in the public environment. The main contribution of the model provides an efficient deep learning feature extraction by modifying the existing conventional model to understand crowd congestion levels in public areas. The manually annotated congestion levels of information are described in Table 1 briefly. The congestion level in the scene is highly involved in understanding abnormal activity in the crowd.
Figure 1. The network architecture of the proposed model.
Table 1. Bilinear attention network’s parameter information (densenet121).
In this paper, we introduce a model to extract spatial information to detect basic activity and crowd density from attention-based feature classification with a multiclass CNN. There are several ways to understand the crowd density in a video sequence. In our approach, a sequence was divided into five congestion levels and tested with publicly available datasets described in the experiments and results section. The congestion levels were categorized from low to high as very low (VL), low (L), medium (M), high (H) and very high (VH). The proposed model was developed by fine-tuning of a densenet121 [1] and Efficientnetv2 [2] network and a multicolumn [3] CNN architecture to extract multichannel features from a single image instance.
The bilinear CNN was based on the simultaneous extraction of different object features (different kernel size) from the same instance and at the same time, a multicolumn CNN was utilized to extract different filters to extract dense features from the same image. The parallel features were gathered before the fully connected layer by combining with the outer products of matrices. Dense object features were extracted simultaneously by using various image convolution filters to create a density map by combining each channel. The final outcome was trained with a fusion network followed by a fully connected layer to classify the final categories. The total training loss parameters were calculated during the training by summing each network’s partial loss. The loss function could be optimized with the same gradient descent optimization as backpropagation in training neural networks. The different-kernel-size filters could effectively extract multiple features with a bilinear attention network to improve the foreground and background information of the scene features at the same time. The conventional multistage- and multicolumn-generated features missed more imported features in a given image than the combined attention features (proposed model). Therefore, a significant classification improvement was achieved, as discussed in the results analysis section. Early stopping was used in the model training strategy to achieve better outcomes, and the fused features experienced faster training convergence than traditional techniques.

Contribution of the Work

The multistage, multicolumn layers improved spatial information along the attention features. In this approach, we introduced several modifications to improve the deep features from spatial information. The main contribution process of building the novel bilinear attention feature vector and fusion network process is explained as follows: The proposed model introduced three main parallel feature extraction methods to improve features at different levels. Mainly, bilinear CNN pooling networks extract different features at different depth levels. Therefore, bilinear feature extraction models extract richer features (deep features) than normal convolution networks. Our approach used a dual-channel feature extraction model (transfer learning) based on densenet121 and Efficientnetv2 for generating the bilinear pooling matrix in a given image. The generate bilinear feature matrix was used to calculate the attention score matrix of each point by multiplying it with the activation weight vector. Finally, we obtained the attention feature vector by aggregating all column elements together to generate a bilinear attention feature vector. Second, the single-image multicolumn network was modified with a convolution layer and a fully connected layer to generate a feature vector instead of a feature map. The generated density feature vector and subsampled multistage convolution feature vector were parallelly concatenated with the bilinear attention feature vector. Finally, the two streams flattened the feature vectors, which were fed into a fully connected fusion network with a softmax classification.
Experiments were conducted at different density levels using benchmarks, including PETS 2009, UMN and UCSD. The training and validation outcomes performed best, as mentioned in the results section, and were also compared with state-of-the-art algorithms.

3. Proposed Work

The proposed model consisted of three main parallel feature extraction stages to improve the spatial feature information of a given image. We used the existing crowd density estimation model available in state-of-art methods, namely, a multicolumn convolution neural network (MCNN) [3] and multistage convolution neural network (MSCNN) [28]. Both MSCNN and MSCNN’s internal parameters and convolution layers were modified to achieve optimum performance, in order to train the model. The densenet121 [1] architecture was used as the bilinear convolution for attention feature extraction as a transfer model’s feature extraction. The transfer learning [29] method is widely used for classification tasks as a reusable existing pretrained model, by freezing or changing some parameters in the model, as well as for tuning up the final fully connected layer to classify data in different categories. For our purpose, we used a frozen densenet121 architecture as a feature extracting model, then later fed it to the attention model to build an attention feature vector with a bilinear CNN (BCNN) [30] feature matrix.
The BCNN models have efficiently being used in various fine-grained image classification tasks in recent years. Basically, the bilinear model operates with a parallel convolution operation for different features at different levels with simultaneous convolution streams to gather deep spatial information from an image to generate a pooled bilinear matrix for visual recognition.
For this work, we used the same model (densenet121) for both streams’ linear models to build the outer product and the bilinear matrix and we flattened the feature vectors and passed them to the attention network to generate an attention vector. Two streams extracted features represented as Fa and Fa;
F a > W H M
Fb > W H N
(where M and N represent vector length, W and H represent the width and height of the feature matrix.) Let us reshape extracted feature vectors Fa and Fa; the resultant bilinear feature vector is represented as X:
X = F a T Fb
where X  ϵ R M N is the outer product of the two streams’ feature vector consisting of a pairwise interaction of two feature vectors. Then, the extracted feature vectors were flattened and fed to the attention function to generate a softmax-generated statistical feature vector. Before the attention operation, the bilinear matrix was passed through a signed square root to improve backpropagation [30].
X = S i g n ( X ) . | X |
X = X | | X | | 2 ( l 2 N o r m a l i z a t i o n )

3.1. Dense Block

When performing a conditional CNN convolution operation, there is a problem with a vanishing gradient due to the dense architecture of local networks. To overcome the disappearing gradient descent, a summarized map of objects of the previous layer with the current operation was introduced to increase the weight matrix; this significantly reduced the number of network parameters. This scenario was explained in the densenet121 architecture [1]. EfficientNetV2 [2] is a new model of convolutional networks that has a higher learning rate and better parameter efficiency than modern models. The search for models was carried out in the search space, enriched with new options, such as Fused-MBConv.

3.2. Bilinear Attention Feature Vector (Attention over BCNN Vector)

Wih an activation weight vector ( F r ) , the attention score was:
A t t e n t i o n S c o r e ( a t t s ) = s o f t m a x ( m a t m u l ( X F r ) )
After normalizing the attention score, the resultant matrix was summed with the attention score:
B i l i n e a r A t t e n t i o n F e a t u r e V e c t o r = S u m ( X a t t s )
The attention feature vector was calculated by multiplying the input BCNN matrix with the calculated attention score over the BCNN matrix. The flow diagram and model parameters are shown in Figure 2 and Table 1, and the main process is explained from the BCNN matrix to the generation of a summed normalized attention vector. Finally, the generated BCNN attention feature vectors are fused with the modified MCNN kernel feature vector and multistage CNN features. The MCNN and MSCNN’s modified architectures are described in Table 2 and Table 3.
Figure 2. Bilinear attention flow diagram.
Table 2. Multistage network parameter information.
Table 3. Multicolumn network parameter information.
The MCNN architecture extracted different kernel features from the same image and fed pooled features to fully connected layers to reduce dimensionality. The details of the modified MCNN network architecture is described in Table 1. The resultant feature vector size was limited to 512 before the fusion with the BCNN attention feature vector.
In the MSCNN model architecture, the same input image was simultaneously fed to the feed-forward CNN network with multistage CNN feature aggregation. The proposed model improved MSCNN for efficient feature extraction. The modified multistage network architecture is explained in Figure 3. The MSCNN process’s first-stage convolution consisted of the first and second convolution operations. Then, the resultant output was concatenated with the second-stage convolution outputs (Conv3, Conv4). The MSCNN model improved feature information and reduced vanishing gradient issues in the network.
Figure 3. Modified MSCNN architecture.

3.3. Proposed Model Architecture

Both MCNN and MSCNN parallel streams’ feature vectors were concatenated with the BCNN attention vector before reaching the fully connected network. The process of bilinear model feature extraction and the operations of the bilinear attention feature vector generation are explained in Equations (1)–(7). The multicolumn (MC) and multistage (MS) feature vectors were concatenated with the bilinear attention feature vectors as:
C M C = [ f M C N N , f B C N N a t t ]
C M S = [ f M S , f B C N N a t t ]
where C and f represent concatenation and generated feature vectors, respectively. Both concatenated feature vectors have the same dimensions, as shown in Figure 1.

3.4. Fully Connected Network (FCN)

The primary feature extraction models were based on the pretrained densenet121/ EfficientNetV2 architectures with the bilinear CNN model (frozen model). Thus, the multistage and multicolumn models were trained with separate optimization functions, and we calculated the loss separately. For the training level, we used the Adam optimization [31] algorithm with a learning rate of 0.001 and a batch size of 40 for both models.
L T o t a l = L M C + L M S
(where L denotes the loss of the network). The feature concatenation and optimization loss calculation are shown in Equations (8)–(10).

4. Experiment Results and Discussion

The proposed model was based on the PyTorch [32] deep learning library and implemented with an Nvidia Geforce GTX 1660 GPU (6 GB dedicated memory) and AMD Ryzen 5 series processor with 16 GB of RAM. The publicly available UCSD [33], PETS2009 [34] and UMN [35] datasets were tested with our proposed model and results were compared with several other methods, such as CLBP [36] and MSCNN [28].

4.1. Dataset Preparation

The proposed models were tested with the publicly available crowd datasets UMN, PETS2009 and UCSD. The UMN dataset consists of two indoor crowd-monitoring video images Plaza1 and Plaza2. The PETS2009 provides different anomaly events and crowd density estimation videos with several viewpoints. In this experiment, we used data from one viewpoint with the different difficulty levels explained in the dataset. Finally, the UCSD ped1 and ped2 datasets were tested with the proposed model. The UCSD consist of several anomaly activities with different congestion levels. All frame-level annotation work was manually done and the segmentation of the congestion levels (VL—very Low, L—low, M—medium, H—high, VH—very high) was inspired from the TIS-MCMS work in [37]. The density levels for UCSD Ped1, UCSD Ped2 and PETS2009 and UMN Plaza1 and UMN Plaza2 are represented as ( V L < 9 , 8 < L < 17 , 16 < M < 25 , 24 < H < 33 , V H > 32 ) and ( V L = 0 , 0 < L < 4 , 4 < M < 7 , 6 < H < 11 , V H > 10 ) , respectively.
The final training and the testing split were kept as a sixty-to-forty ratio for this work. The difficult level summarization and training and testing splits are mentioned in Table 4.
Table 4. Details of datasets’ splits for training and testing.

4.1.1. UCSD Dataset

The UCSD benchmark dataset [33] consists of two different videos streams namely ped1 and ped2 with different viewpoints. The whole dataset consists of 16,000 and 4800 frames with a TIFF image format, respectively, and all images were converted to the JPEG image format. The UCSD dataset comprises sparse, occluded and low-resolution image data.

4.1.2. Pets2009 Dataset

The PETS2009 dataset [34] contains three different levels of complexity, namely L1, L2 and L3. Each difficulty level has four different viewpoints with two time segments. In this proposed experiment, we selected the time step data of each level for testing. Several samples of the complexity levels of each data set are shown in Figure 4.
Figure 4. Examples of crowd scene of different density levels.

4.1.3. Umn Dataset

The UMN dataset [35] has a low complexity compared to the PETS2009 and UCSD datasets due to the stable indoor background scenario. The video recordings from UMN Plaza1 and Plaza2 surveillance cameras were prepared to monitor crowd density.

4.2. Model Training and Initialization of Parameters

The proposed model was fine-tuned on five crowd density levels by setting the parameters and hyperparameters. The training and validation batch sizes were set at 40 samples, the learning rate was kept at 0.001 and we used the Adam optimizer. In this approach, the model training performance was kept at its highest while avoiding overfitting when stopping the model training by adopting early stopping. The model training accuracy/loss and validation accuracy/loss performance are as shown in Figure 5, Figure 6, Figure 7 and Figure 8, respectively. The model with densenet121/EfficientNetV2 converged quickly with a high accuracy rate, as shown in Figure 5, Figure 6, Figure 7 and Figure 8.
Figure 5. Training and validation accuracy of MCMS-BCNN-Attention + densenet121.
Figure 6. Training and validation loss of MCMS-BCNN-Attention + densenet121.
Figure 7. Training and validation accuracy of MCMS-BCNN-Attention + Efficientnetv2.
Figure 8. Training and validation loss of MCMS-BCNN-Attention + Efficientnetv2.

4.3. Evaluation Metrics

The performance of the model was evaluated as a multiclass classification scenario. Consequently, there were four different prediction states described as true positive (TP), true negative (TN), false positive (FP) and false negative (FN). All of the classes were successfully predicted with positive values, which meant that the true value of the class was positive, and the predicted value of the class was also positive. For example, if the density level for the real value of the class was medium density (M), while the predicted class was the same, this was identified as a true positive (TP). Moreover, a true negative (TN) was defined similarly as all the successfully predicted samples with negative values for which the true value of the class was negative and the predicted value of the class was also negative. For example, if the density level for the real value of the class was not medium density (M), and the predicted class was the same, then a false positive (FP) occurred when the actual class differed from the expected class, or when the expected class was positive and the actual class was negative. When the actual class was positive while the projected class was negatively defined, this was considered a false negative (FN).

4.4. Confusion Matrix

A confusion matrix was used to generalize the results of the classification method. For an unqualified number of observations in each class and more than two classes in the data sets, the correctness of the classification can be deceptive. Calculating the confusion matrix gives a clearer idea of what the classification model is doing right and where it is wrong. The two-class classification confusion matrix is shown in Table 5.
Table 5. Confusion Matrix explanation for binary classification.
To measure the performance of the model, we need to define a few measuring parameters, based on the multiclass classification confusion matrix. The performance evaluation was measured with the accuracy, precision, recall, F 1 score and kappa statistical measures explained in Equations (11)–(15).
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
Cohen’s kappa coefficient (k) is [38]:
k = P o P e 1 P e
where P o and P e represent observed agreement and expected agreement, respectively. This measurement shows how much better the classifier’s performance is compared to a balanced or unbalanced dataset.

5. Results Analysis

5.1. Benchmark Datasets Analysis

The performance comparison with existing approaches (Table 6) showed our model performance and that of other recent state-of-the-art approaches. Our proposed model achieved the best accuracy of 98.62 and 98.95 among the UCSD ped1 and UCSD ped2 datasets, respectively, compared with the existing CLBP, MSCNN, densenet121 and Efficientnetv2 methods. The total number of training frames was 8040 and the total number of testing frames was 5360; among those testing frames, there were 66 and 73 misclassified samples, respectively. The resultant confusion matrix is shown in Figure 9 and Figure 10. The proposed model achieved a good precision and recall rate compared with the other approaches. The proposed model trained with PETS2009 (737 training samples and 492 testing samples) achieved an accuracy of 96.97 with a total of 24 misclassified samples. The confusion matrix of the PETS2009 dataset with the proposed model is shown in Figure 11. The model achieved a better recall and precision than the other methods. This dataset had fewer samples compared with the other tested datasets with a high occlusion rate. Finally, the proposed model was tested on the UMN dataset and achieved an accuracy of 99.10 and 98.38 with a total of 17 and 17 incorrectly classified samples for UMN-Plaza1 and UMN-Plaza2, respectively. The relevant confusion matrix tables are shown in Figure 12 and Figure 13.
Table 6. Performance comparison with existing approaches.
Figure 9. Confusion matrix heat map of UCSD Ped1 for (a) MCMS-BCNN-Attention+densenet121 and (b) MCMS-BCNN-Attention+Efficientnetv2.
Figure 10. Confusion matrix heat map of UCSD Ped2 for (a) MCMS-BCNN-Attention+densenet121 and (b) MCMS-BCNN-Attention+Efficientnetv2.
Figure 11. Confusion matrix heat map of PETS2009 for (a) MCMS-BCNN-Attention+densenet121 and (b) MCMS-BCNN-Attention+Efficientnetv2.
Figure 12. Confusion Matrix Heatmap of UMN Plaza 1 for (a) MCMS-BCNN-Attention+densenet121 and (b) MCMS-BCNN-Attention+Efficientnetv2.
Figure 13. Confusion matrix heat map of UMN Plaza 2 for (a) MCMS-BCNN-Attention+densenet121 and (b) MCMS-BCNN-Attention+Efficientnetv2.
The comparison of the performance of the models in Table 6 clearly shows that the proposed method provided better results compared to the available models from the literature. In this approach, we selected different state-of-art models along with our multicolumn multistage approach. The comparison results clearly show that our work achieved the best performance either with densenet121 or Efficientnetv2 compared to existing methods. The proposed model achieved the best average kappa performance value over the five datasets with 0.9583.

5.2. Abnormal Event Detection and Classification

As mentioned earlier, crowd density level variation is also considered a crowd anomaly in a sequence. In this second approach, the network was transformed into a normal/abnormal classification problem. The same model was tested with the same parameters except for the final FCN layer, which was modified as a two-way classification. The results of testing both proposed feature extraction techniques (densenet121 and Efficientnetv2) are described in the sections below. In this approach, the abnormal activities were measured based on the UMN and PETS2009 dataset for detecting panic moments. The UMN dataset has three anomaly events (lawn, mall and plaza) with 7738 video frames with a 320 × 240 resolution. In this approach, the training and testing ratio was kept as sixty to forty as in previous work. However, final classification network was changed to a binary classification (normal/abnormal).
Experiments were also conducted on samples from the PETS2009 S3 motion flow walking and running video segments shown in Figure 14. The model performance was tested with video sequences 14–16 (walk and run) and 14–33 (gather and evacuate), respectively. The performance was measured by calculating the AUC (area under the curve) [40] using the ROC (receiver operating characteristic) [40] curve. The AUC measures the area under the ROC curve. The TPR (true positive rate) and FPR (false positive rate) calculations are explained in Equations (16) and (17). The ROC curve was plotted by using TPR against FPR values.
T P R = T P T P + F N
F P R = F P F P + T N
Figure 14. Sample frames for UMN and PETS2009 datasets abnormal activity.
We evaluated UMN’s activities in the lawn, shopping mall and city urban event (indoor/outdoor) segments with existing approaches. The performance results were compared with our model (MCMS-BCNN-Attention), as shown in Table 7, which achieved the best AUC of 1.0. In this approach, the separation of actions is an extremely difficult task when people walk normally and suddenly start running to evacuate. However, our proposed model architecture with Efficientnetv2-based feature extraction achieved perfect results among all the other works. In the PETS2009 dataset, we monitored two activities such as walking/running and gathering/disappearing. We maintained the same sixty-to-forty ratio as when splitting training and testing into all original datasets’ segments for the experiment. The effectiveness of our approach compared to existing recent works is shown in Table 7, and the MCMS-BCNN-Attention model reached an AUC of 1.0, yielding the best performance.
Table 7. Performance comparison of the proposed approach with existing techniques on UMN and PETS2009 datasets.

6. Conclusions

In this study, we introduced a novel BCNN attention network with a densenet121/ Efficientnetv2 architecture as a transfer learning model to extract feature vectors and achieve the best accuracy rate. Both the modified multistage and multicolumn models were trained to optimize the classification network. The experiments compared our method with recent state-of-art methods and evaluated performance with metrics such as accuracy, precision, recall, F1 score and Cohen’s kappa values. The proposed feature extraction method significantly improved the anomaly congestion detection as shown in the comparison of Table 6. The evaluated datasets contained diverse video quality levels and different viewpoints, and the proposed model achieved a good performance consistency in all scenarios. Moreover, the detection of the abnormal activity (panic moments) also achieved good results. Future dense activity detection models need to be enhanced with an autoencoder-based approach to achieve better video activity detection. The performance measures are shown in Table 7. The clarity of the surveillance video and the distance to the viewpoint with multiview aspects need to be considered for future improvements and better detection results.

Author Contributions

Conceptualization, funding acquisition, methodology, supervision, C.L. and Y.L.; data curation, formal analysis, methodology, validation, visualization, results analysis, writing—original draft, E.M.C.L.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Nature Science Foundation of China (grant no. 61671397).

Institutional Review Board Statement

Not applicable

Data Availability Statement

Benchmark datasets UCSD Anomaly Detection Dataset [33], PETS 2009 Benchmark Data [34] and Monitoring Human Activity-Action Recognition UMN [35] are publicly available.

Conflicts of Interest

The authors declare that they have no conflict of interest. The manuscript contains contributions from all authors. All authors have approved the final version of the manuscript.

References

  1. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  2. Tan, M.; Le, Q. Efficientnetv2: Smaller models and ster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
  3. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
  4. Aldayri, A.; Albattah, W. Taxonomy of Anomaly Detection Techniques in Crowd Scenes. Sensors 2022, 22, 6080. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, T.; Miao, Z.; Chen, Y.; Zhou, Y.; Shan, G.; Snoussi, H. Aed-net: An abnormal event detection network. Engineering 2019, 5, 930–939. [Google Scholar] [CrossRef]
  6. Biswas, S.; Babu, R.V. Anomaly detection via short local trajectories. Neurocomputing 2017, 242, 63–72. [Google Scholar] [CrossRef]
  7. Bera, A.; Kim, S.; Manocha, D. Realtime anomaly detection using trajectory-level crowd behavior learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 27–30 June 2016; pp. 50–57. [Google Scholar]
  8. Maiorano, F.; Petrosino, A. Granular trajectory based anomaly detection for surveillance. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2066–2072. [Google Scholar]
  9. Biswas, S.; Babu, R.V. Short local trajectory based moving anomaly detection. In Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image Processing, Bangalore, India, 14–18 December 2014; pp. 1–8. [Google Scholar]
  10. Zhao, K.; Liu, B.; Li, W.; Yu, N.; Liu, Z. Anomaly detection and localization: A novel two-phase framework based on trajectory-level characteristics. In Proceedings of the 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
  11. Zhang, X.; Ma, D.; Yu, H.; Huang, Y.; Howell, P.; Stevens, B. Scene perception guided crowd anomaly detection. Neurocomputing 2020, 414, 291–302. [Google Scholar] [CrossRef]
  12. Hao, Y.; Xu, Z.J.; Liu, Y.; Wang, J.; Fan, J.L. Effective crowd anomaly detection through spatio-temporal texture analysis. Int. J. Autom. Comput. 2019, 16, 27–39. [Google Scholar] [CrossRef]
  13. Li, N.; Chang, F. Video anomaly detection and localization via multivariate Gaussian fully convolution adversarial autoencoder. Neurocomputing 2019, 369, 92–105. [Google Scholar] [CrossRef]
  14. Li, X.; Li, W.; Liu, B.; Liu, Q.; Yu, N. Object-oriented anomaly detection in surveillance videos. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1907–1911. [Google Scholar]
  15. Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
  16. Lloyd, K.; Rosin, P.L.; Marshall, D.; Moore, S.C. Detecting violent and abnormal crowd activity using temporal analysis of grey level co-occurrence matrix (GLCM) -based texture measures. Mach. Vis. Appl. 2017, 28, 361–371. [Google Scholar] [CrossRef]
  17. Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar]
  18. Luo, W.; Liu, W.; Gao, S. Remembering history with convolutional lstm for anomaly detection. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME) 2017, Hong Kong, China, 10–14 July 2017; pp. 439–444. [Google Scholar]
  19. Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.V. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
  20. Wu, C.; Shao, S.; Tunc, C.; Hariri, S. Video anomaly detection using pre-trained deep convolutional neural nets and context mining. In Proceedings of the 2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), Antalya, Turkey, 2–5 November 2020; pp. 1–8. [Google Scholar]
  21. Huang, S.; Huang, D.; Zhou, X. Learning multimodal deep representations for crowd anomaly event detection. Math. Probl. Eng. 2018, 2018, 6323942. [Google Scholar] [CrossRef]
  22. Wang, T.; Qiao, M.; Zhu, A.; Shan, G.; Snoussi, H. Abnormal event detection via the analysis of multi-frame optical flow information. Front. Comput. Sci. 2020, 14, 304–313. [Google Scholar] [CrossRef]
  23. Singh, K.; Rajora, S.; Vishwakarma, D.K.; Tripathi, G.; Kumar, S.; Walia, G.S. Crowd anomaly detection using aggregation of ensembles of fine-tuned convnets. Neurocomputing 2020, 371, 188–198. [Google Scholar] [CrossRef]
  24. Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
  25. Marsden, M.; McGuinness, K.; Little, S.; O’Connor, N.E. Resnetcrowd: A residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification. In Proceedings of the 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS), Lecce, Italy, 29 August 2017–1 September 2017; pp. 1–7. [Google Scholar]
  26. Ratre, A. Taylor series based compressive approach and Firefly support vector neural network for tracking and anomaly detection in crowded videos. J. Eng. Res. 2019, 7, 115–137. [Google Scholar]
  27. Feng, Y.; Yuan, Y.; Lu, X. Learning deep event models for crowd anomaly detection. Neurocomputing 2017, 219, 548–556. [Google Scholar] [CrossRef]
  28. Fu, M.; Xu, P.; Li, X.; Liu, Q.; Ye, M.; Zhu, C. Fast crowd density estimation with convolutional neural networks. Eng. Appl. Artif. Intell. 2015, 43, 81–88. [Google Scholar] [CrossRef]
  29. A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning. Available online: https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a (accessed on 1 November 2022).
  30. Lin, T.Y.; Roy Chowdhury, A.; Maji, S. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
  31. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  32. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2019. Available online: https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9 f7012727740-Paper.pdf (accessed on 2 December 2022).
  33. UCSD Anomaly Detection Dataset. Available online: http://www.svcl.ucsd.edu/projects/anomaly/dataset.html (accessed on 7 November 2022).
  34. PETS 2009 Benchmark Data. Available online: http://cs.binghamton.edu/mrldata/pets2009 (accessed on 7 November 2022).
  35. Monitoring Human Activity-Action Recognition. Available online: http://mha.cs.umn.edu/projrecognition.shtml (accessed on 7 November 2022).
  36. Alanazi, A.A.; Bilal, M. Crowd density estimation using novel feature descriptor. arXiv 2019, arXiv:1905.05891. [Google Scholar]
  37. Tripathy, S.K.; Srivastava, R. A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis. Multimed. Syst. 2020, 26, 585–605. [Google Scholar] [CrossRef]
  38. Shmueli, B.; Multi-Class Metrics Made Simple, Part III: The Kappa Score (Aka Cohen’s Kappa Coefficient). Medium. Towards Data Science. 2020. Available online: https://towardsdatascience.com/multi-class-metrics-made-simple-the-kappa-score-aka-cohens-kappa-coefficient-bdea137af09c (accessed on 2 December 2022).
  39. Chen, C.; Zhang, B.; Su, H.; Li, W.; Wang, L. Land-use scene classification using multi-scale completed local binary patterns. Signal Image Video Process. 2016, 10, 745–752. [Google Scholar] [CrossRef]
  40. Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
  41. Wang, T.; Qiao, M.; Chen, Y.; Chen, J.; Zhu, A.; Snoussi, H. Video feature descriptor combining motion and appearance cues with length-invariant characteristics. Optik 2018, 157, 1143–1154. [Google Scholar] [CrossRef]
  42. Cong, Y.; Yuan, J.; Liu, J. Sparse reconstruction cost for abnormal event detection. In Proceedings of the Computer Vision and Pattern Recognition 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3449–3456. [Google Scholar]
  43. Wang, T.; Qiao, M.; Deng, A.; Zhou, Y.; Wang, H.; Lyu, Q.; Snoussi, H. Abnormal event detection based on analysis of movement information of video sequence. Optik 2018, 152, 50–60. [Google Scholar] [CrossRef]
  44. Susan, S.; Hanmandlu, M. Unsupervised detection of nonlinearity in motion using weighted average of non-extensive entropies. Signal Image Video Process. 2015, 9, 511–525. [Google Scholar] [CrossRef]
  45. Zhang, X.; Yang, S.; Tang, Y.Y.; Zhang, W. A thermodynamics-inspired feature for anomaly detection on crowd motions in surveillance videos Multimed. Tools Appl. 2020, 75, 8799–8826. [Google Scholar] [CrossRef]
  46. Kaltsa, V.; Briassouli, A.; Kompatsiaris, I.; Hadjileontiadis, L.J.; Strintzis, M.G. Swarm intelligence for detecting interesting events in crowded environments. IEEE Trans. Image Process. 2015, 24, 2153–2166. [Google Scholar] [CrossRef] [PubMed]
  47. Xu, Y.; Lu, L.; Xu, Z.; He, J.; Zhou, J.; Zhang, C. Dual-channel CNN for efficient abnormal behavior identification through crowd feature engineering. Mach. Vis. Appl. 2019, 30, 945–958. [Google Scholar] [CrossRef]
  48. Mu, H.; Sun, R.; Yuan, G.; Li, J.; Wang, M. Crowd behavior detection in videos using statistical physics. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand, 7–10 December 2021; pp. 389–397. [Google Scholar]
  49. Ilyas, Z.; Aziz, Z.; Qasim, T.; Bhatti, N.; Hayat, M.F. A hybrid deep network based approach for crowd anomaly detection. Multimed. Tools Appl. 2021, 80, 24053–24067. [Google Scholar] [CrossRef]
  50. Du, Y. An anomaly detection method using deep convolution neural network for vision image of robot. Multimed. Tools Appl. 2020, 79, 9629–9642. [Google Scholar] [CrossRef]
  51. Singh, G.; Kapoor, R.; Khosla, A. Optical flow-based weighted magnitude and direction histograms for the detection of abnormal visual events using combined classifier. Int. J. Cogn. Inform. Nat. Intell. (IJCINI) 2021, 15, 12–30. [Google Scholar] [CrossRef]
  52. Xu, J.; Denman, S.; Fookes, C.; Sridharan, S. Unusual scene detection using distributed behaviour model and sparse representation. In Proceedings of the 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, Beijing, China, 18–21 September 2012; pp. 48–53. [Google Scholar]
  53. Zhu, X.; Liu, J.; Wang, J.; Fu, W.; Lu, H. Weighted interaction force estimation for abnormality detection in crowd scenes. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 507–518. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.