Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

A Review of Abnormal Crowd Behavior Recognition Technology Based on Computer Vision

Appl. Sci. 2024, 14(21), 9758; https://doi.org/10.3390/app14219758

by Rongyong Zhao¹

, Feng Hua^1,*, Bingyu Wei¹, Cuiling Li¹

, Yulong Ma¹, Eric S. W. Wong² and Fengnian Liu³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2024, 14(21), 9758; https://doi.org/10.3390/app14219758

Submission received: 22 September 2024 / Revised: 11 October 2024 / Accepted: 14 October 2024 / Published: 25 October 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript reviews recent methods that utilize computer vision to recognize abnormal behaviors in crowds. The paper presents the definition of the task, its main challenges, and the current technology employed for the recognition. The authors also provide a comparison of the performance of some methods in terms of evaluation indicators. The paper is generally well-written and presents a comprehensive analysis. However, this reviewer has some minor concerns to be addressed:

1. Table 5 shows the performance comparison of the deep-learning-based models. However, this reviewer suggests addressing the performance of the traditional methods in the table to get a baseline in the comparison. It could highlight the main challenges of the task described in the manuscript.

2. Please fix the references in the fourth column of Table 3 (References), the first column of Table 4 (Name) and the second column of Table 5 (Method).

Author Response

Re: Response to reviewer1

To: Applied Science

Manuscript ID: applsci-3246529.R1

Original Article Title:“A Review of Crowd Abnormal Behavior Recognition Technology Based on Computer Vision”

Dear Reviewer，

Thank you for allowing a resubmission of our manuscript with an opportunity to address your professional comments on our manuscript ID applsci-3246529.

We upload :(a) the point-by-point responses to your professional comments below,(b) an Updated manuscript with red highlighting of new changes (PDF main document) according to your comments and kind suggestions.

Thank you very much for your kind consideration!

Best regards,

Rongyong Zhao

PhD.Asso.Prof.

CIMS research center, Tongji University,Shanghai,China

Comments 1

Table 5 shows the performance comparison of the deep-learning-based models. However, this reviewer suggests addressing the performance of the traditional methods in the table to get a baseline in the comparison. It could highlight the main challenges of the task described in the manuscript.

Response: Thank you for this careful checking and kind suggestion!

Considering a new table (Table 2 in the current version) is added according to another reviewer, We have reviewed the performance indicators of relevant literature in traditional methods and updated these data in Table 6 (corresponding to Table 5 you mentioned in the previous version), so that readers can compare the performance baseline between the traditional and deep learning methods more clearly, which can provide outperformance support for the selection of abnormal behavior recognition methods further. The specific modifications are as follows:

“ Table 6 summarizes the performance of traditional methods and deep learning-based abnormal behavior recognition methods on UCSD, UMN, Shanghai Tech, and Avenue datasets in recent years. This paper compares and presents evaluation indicators such as AUC and EER. These evaluation data are all cited from research papers in recent years.

Table 6. Comparison of abnormal behavior recognition experiments

Classification	Method	Frame level AUC / EER / %										Year
		UCSDPed1		UCSDPed2		UMN		Shanghai Tech		Avenue
		EER	AUC	EER	AUC	EER	AUC	EER	AUC	EER	AUC
Motion Feature	HAVA＋HOF[35]	--	--	--	--	--	99.34	--	--	--	--	2020
Motion Feature	SFA[37]	--	--	--	96.40	--	96.55	--	--	--	--	2022
Clustering Discrimination	DBSCAN[41]	--	--	--	97.20	2.80	97.20	30.70	71.50	--	--	2020
CNN	FCN[52]	--	--	11.00	--	--	--	--	--	--	--	2019
	ABDL[53]	22.00	--	16.00	--	5.80	98.90	--	--	21.00	84.50	2020
	TS-CNN[54]	--	--	--	--	--	99.60	--	--	--	--	2020
	DSTCNN[55]	--	99.74	--	99.94	--	--	--	--	----	--	2020
	LDA-Net[56]	--	--	5.63	97.87	--	--	--	--	--	--	2020
AE	CAE-UNet[62]	--	--	--	96.20	--	--	--	--	--	86.90	2019
	PMAE[64]	--	--	--	95.90	--	--	--	72.90	--	--	2023
	ISTL[65]	29.80	75.20	8.90	91.10	--	--	--	--	29.20	76.8	2019
	S2-VAE[70]	14.30	94.25	--	--	--	99.81	--	--	--	87.6	2019
GAN	Ada-Net[72]	11.90	90.50	11.50	90.70	--	--	--	--	17.60	89.20	2019
	NM-GAN[73]	15.00	90.70	6.00	96.30	--	--	17.00	85.30	15.30	88.60	2021
	D-UNet[74]		84.70		96.30	--	--	--	73.00		85.10	2019
	BMAN[75]	--	--	--	96.60	--	99.60		76.20		90.00	2019
LSTM	FocalLoss-LSTM[77]	--	--	--	--	--	99.83	--	--	--	--	2021
	FCN-LSTM[78]	--	--	--	98.20	--	93.70	--	--	--	--	2021
	CNN-LSTM[79]	--	94.83		96.50	--	--	--	--	--	--	2022
SA	SAFA[81]	--	--	--	96.80	--	--	--	--	--	87.30	2023
	SA-AE[82]	--	--	--	95.69	--	--	--	--	--	84.10	2023
	SABiAE[83]	--	--	9.80	95.60	--	--	--	--	20.90	84.70	2022
	SA-GAN[84]	--	--	--	--	--	--	--	75.70	--	89.20	2021
	A2D-GAN[85]	9.70	94.10	5.10	97.40	--	--	25.20	74.20	9.00	91.00	2024
	SA-CNN[86]	--	--	--	--	--	99.29	--	--	--	--	2023

Note, the performance indicators of HAVA+HOF and SFA in motion features, and DBSCAN in cluster discrimination in UCSD Ped2, UMN, and Shanghai Tech datasets are included to get a baseline in the comparison.”

Author action: We update this manuscript with baseline performance indicators of traditional methods [lines 858-864, on pages 24-25].

Comments 2

Please fix the references in the fourth column of Table 3 (References), the first column of Table 4 (Name) and the second column of Table 5 (Method).

Response: Thank you for this professional suggestion!

Considering a new table (Table 2 in the current version) is added according to another reviewer, we have fixed the references in Table4, Table 5 and Table 6 (corresponding to Table 3, Table 4 and Table 5 you mentioned in the previous version respectively). The specific modifications are as follows:

“ Table 4. The classification and characteristics of deep learning behavior anomaly recognition methods

Method	Design Ideas	Advantages	Limitations	References
CNN	Through local receptive fields and pooling operations, capture the spatial hierarchical features of the original data	Good at extracting local features and combining them into more complex patterns	Lack of time series processing capabilities	[42-56]
AutoEncoder	An unsupervised learning method that captures the main features of the data by encoding and decoding the original data	Reduce the data dimension and extract key features	Easily lead to overfitting of training data	[57-70]
GNN	It is composed of a generator and a discriminator, and the two play a game.	Reconstruct and generate new samples close to real data	Prone to model collapse and non-convergence of training	[71-76]
LSTM	Introduces forget gates, input gates, and output gates to solve the problems of gradient vanishing and gradient explosion	Good at processing sequence data of any length	The structure is complex, and training and reasoning take a long time	[77-79]
Self-Attention	Automatically assigns different attention weights according to different parts of the input sequence	Capture global interdependencies	Easy to overfit on small datasets	[80-87]

”

“ Table 5. Comparison of crowd anomaly behavior datasets

Name	Scene Description	Scale	Resolution	Abnormal Behavior	Object	Limitation
UCSD [88]	The crowd movement on the sidewalk from the perspective of the surveillance camera	Ped1 14000frame 34 Training segments 36 Test segments	238*158	Fast moving, Reverse driving Riding a bicycle, Driving a car, Sitting in a wheelchair, skateboarding, etc.	individual	Low resolution; few types of abnormal behaviors; relatively simple background
UCSD [88]		Ped2 4560 frame 12 Test segments	360*240		individual
UMN [89]	Video clips of pedestrian activities in different backgrounds such as campuses, shopping malls, and streets	8010 frame 11 Video segment	320*240	The crowd suddenly scattered, ran, and gathered.	group	Simple background; Limited abnormal types
Shanghai-Tech [90]	13 campus area scenes with complex lighting conditions and camera angles	317398 frame 130 Video segment	856*480	Crowd gathering, fighting, running, cycling.	group	Abnormal events are repetitive; The annotations have errors
CUHK-Avenue [91]	Video surveillance clips of outdoor public places	30625 frame 16 Training segments 21 Test segments	640*360	Pedestrians fighting, throwing objects, running	individual Vehicle	The shooting angle is single; The resolution is low

”

“ Table 6. Comparison of abnormal behavior recognition experiments

Classification	Method	Frame level AUC / EER / %										Year
		UCSDPed1		UCSDPed2		UMN		Shanghai Tech		Avenue
		EER	AUC	EER	AUC	EER	AUC	EER	AUC	EER	AUC
Motion Feature	HAVA＋HOF[35]	--	--	--	--	--	99.34	--	--	--	--	2020
Motion Feature	SFA[37]	--	--	--	96.40	--	96.55	--	--	--	--	2022
Clustering Discrimination	DBSCAN[41]	--	--	--	97.20	2.80	97.20	30.70	71.50	--	--	2020
CNN	FCN[52]	--	--	11.00	--	--	--	--	--	--	--	2019
	ABDL[53]	22.00	--	16.00	--	5.80	98.90	--	--	21.00	84.50	2020
	TS-CNN[54]	--	--	--	--	--	99.60	--	--	--	--	2020
	DSTCNN[55]	--	99.74	--	99.94	--	--	--	--	----	--	2020
	LDA-Net[56]	--	--	5.63	97.87	--	--	--	--	--	--	2020
AE	CAE-UNet[62]	--	--	--	96.20	--	--	--	--	--	86.90	2019
	PMAE[64]	--	--	--	95.90	--	--	--	72.90	--	--	2023
	ISTL[65]	29.80	75.20	8.90	91.10	--	--	--	--	29.20	76.8	2019
	S2-VAE[70]	14.30	94.25	--	--	--	99.81	--	--	--	87.6	2019
GAN	Ada-Net[72]	11.90	90.50	11.50	90.70	--	--	--	--	17.60	89.20	2019
	NM-GAN[73]	15.00	90.70	6.00	96.30	--	--	17.00	85.30	15.30	88.60	2021
	D-UNet[74]		84.70		96.30	--	--	--	73.00		85.10	2019
	BMAN[75]	--	--	--	96.60	--	99.60		76.20		90.00	2019
LSTM	FocalLoss-LSTM[77]	--	--	--	--	--	99.83	--	--	--	--	2021
	FCN-LSTM[78]	--	--	--	98.20	--	93.70	--	--	--	--	2021
	CNN-LSTM[79]	--	94.83		96.50	--	--	--	--	--	--	2022
SA	SAFA[81]	--	--	--	96.80	--	--	--	--	--	87.30	2023
	SA-AE[82]	--	--	--	95.69	--	--	--	--	--	84.10	2023
	SABiAE[83]	--	--	9.80	95.60	--	--	--	--	20.90	84.70	2022
	SA-GAN[84]	--	--	--	--	--	--	--	75.70	--	89.20	2021
	A2D-GAN[85]	9.70	94.10	5.10	97.40	--	--	25.20	74.20	9.00	91.00	2024
	SA-CNN[86]	--	--	--	--	--	99.29	--	--	--	--	2023

”

Author action: We update and fix the references [lines 746-766, on page 21,lines 829-830, on page 23, lines 862-864, on pages 24-25].

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

· Introduction. The review is good, the links are correctly formatted.

· Overview of Chapter 2 - When enumerating the main problems that need to be solved, it is appropriate to mention the problem of price-quality. In particular, from a hardware point of view: the higher the resolution, the more accurate the classification, and a larger area with more objects can be captured. However, the downside is computer resources when calculating the algorithm in real-time. Therefore, in this section, I recommend mentioning the need to determine the optimal parameters for the system. In the section evaluating the effectiveness of each method, provide a table with the recommended optimal parameters for certain areas (the size of the crowd sample that the machine can optimally process) and sensitivity requirements. This will allow the reader to understand the cost of the studied algorithms. For example, the pictures in Fig. 20 are positioned as a crowd, but if there is a real crowd in front of those cameras (not 5-7 people) and a camera with a resolution of 238x158, then most likely the system will not see the problem. Therefore, the author's assessment of the optimality of resolution and sensitivity will help to estimate also the resource intensity of the studied methods.

· Fig. 6: even at maximum magnification, the formulas in the yellow square are hard to see - fix it.

· In table 3 and table 5 there is "Error! Reference source not found". It needs to be corrected

· Various concepts and methods of solving the problem of classification of situations with threatening consequences are well analyzed in the work.

· There is a problem with the conclusions - it is not clear what scientific novelty item 1 carries: that is, if the authors came up with 5 levels of abnormal behavior, then a simple recalculation is not required, but what is the advantage over division by 2,3,4,6...? Point 2 - what are the most noticeable disadvantages of traditional methods? This is a conclusion - there must be an advantage (in what?) or a disadvantage (in what?) of something over something else. Remark! A numerical comparison should always be present in the conclusions, even in a review paper, otherwise: the text is a guide that tells us about the availability and application of different methods.

I recommend printing after corrections.

Author Response

Re: Response to Reviewer2

To: Applied Science

Manuscript ID: applsci-3246529.R1

Original Article Title: “A Review of Crowd Abnormal Behavior Recognition Technology Based on Computer Vision”

Dear Reviewer，

Thank you for allowing a resubmission of our manuscript with an opportunity to address your professional comments on our manuscript ID applsci-3246529.

We upload :(a) the point-by-point responses to your professional comments below, (b) an Updated manuscript with red highlighting of new changes (PDF main document) according to your comments and kind suggestions.

Thank you very much for your kind consideration!

Best regards,

Rongyong Zhao

PhD.Asso.Prof.

CIMS research center,Tongji University,Shanghai,China

Comments 1

Overview of Chapter 2 - When enumerating the main problems that need to be solved, it is appropriate to mention the problem of price-quality. In particular, from a hardware point of view: the higher the resolution, the more accurate the classification, and a larger area with more objects can be captured. However, the downside is computer resources when calculating the algorithm in real-time. Therefore, in this section, I recommend mentioning the need to determine the optimal parameters for the system. In the section evaluating the effectiveness of each method, provide a table with the recommended optimal parameters for certain areas (the size of the crowd sample that the machine can optimally process) and sensitivity requirements. This will allow the reader to understand the cost of the studied algorithms. For example, the pictures in Fig. 20 are positioned as a crowd, but if there is a real crowd in front of those cameras (not 5-7 people) and a camera with a resolution of 238x158, then most likely the system will not see the problem. Therefore, the author's assessment of the optimality of resolution and sensitivity will help to estimate also the resource intensity of the studied methods.

Response: Thank you for your professional analysis on the problem of price-quality! We agree with you very much.

To allow the reader to understand the cost of the studied algorithms, we have added a new subsection in Section 3: Recommended parameters for hardware resources, explaining the necessity of the required parameters for the abnormal behavior recognition system. A new table is constructed, containing the recommended parameters for certain region types, sample sizes, sensitivities, IPC resolutions, frame rates, and GPU requirements. This will enable readers to understand the cost of the studied algorithms. The specific modifications are as follows:

“3.3 Recommended parameters for hardware resources

In high-density crowd scenarios, achieving efficient crowd behavior recognition not only faces technical challenges but also comes with performance requirements of hardware resource. From a hardware perspective, higher resolution can provide more accurate classification and can capture larger areas and more objects, thereby improving recognition accuracy. However, high-resolution images increase the computer resources required for real-time calculation algorithms and place higher demands on processing capabilities. Therefore, combined with the current industry trends and actual needs, the best practices for abnormal behavior recognition systems are provided. Table 2 summarizes the recommended parameter configurations of cameras in specific areas such as train station waiting rooms, subway platforms, popular shopping plazas, and sports venue entrances, aiming to ensure that overall costs are controlled to meet the performance requirements.

Table 2. The recommended parameters for hardware resources

Region type	Crowd sample	Sensitivity level	Recommended IPC resolution	Lens requirement	Frame rate (fps)	Recommended GPU	Remarks
Train station waiting room	500 - 1000	High	3840×2160	Wide-angle to standard focal length	25 - 30	NVIDIA RTX3060	High human flow density, requiring a high level of sensitivity to ensure safety
Subway platform	100 - 200	Middle	2560×1440	Standard focal length	25 - 30	NVIDIA GTX1080	Medium to high human flow density, needing to balance false positives and false negatives
Popular shopping plaza	800 - 1200	High	3840×2160	Wide-angle focal length	20 - 25	NVIDIA RTX3060	Large open space, allowing a certain range of errors
Entrance of sports venue	200 - 500	Middle	2560×1440 1920×1080	Standard to wide-angle focal length Infrared night vision	25 - 30	NVIDIA RTX2070	During special events, there may be an extremely high density of people gathering

Author action: We update this manuscript with new sub-section [lines 194-207, on page 4].

Comments 2

Fig. 6: even at maximum magnification, the formulas in the yellow square are hard to see - fix it.

Response: Thank you for this careful checking.

We have carefully reviewed all figures throughout the manuscript and replaced them with high-resolution versions. Additionally, we have ensured that they retain their quality and do not display noticeable distortion when zoomed in. The specific modifications are as follows:

Figure 6. The structure of LDA-Net

Author action: We update this manuscript with a high-resolution figure [lines 441-443, on page 11].

Comments 3

In table 3 and table 5 there is "Error! Reference source not found". It needs to be corrected

Response: Thank you for this careful checking.

Considering a new table (Table 2 in the current version) is added according to another reviewer, we have fixed the references in Table 4 and Table 6 (corresponding to Table 3 and Table 5 you mentioned in the previous version). The specific modifications are as follows:

“

Table 4. The classification and characteristics of deep learning behavior anomaly recognition methods

Method	Design Ideas	Advantages	Limitations	References
CNN	Through local receptive fields and pooling operations, capture the spatial hierarchical features of the original data	Good at extracting local features and combining them into more complex patterns	Lack of time series processing capabilities	[42-56]
AutoEncoder	An unsupervised learning method that captures the main features of the data by encoding and decoding the original data	Reduce the data dimension and extract key features	Easily lead to overfitting of training data	[57-70]
GNN	It is composed of a generator and a discriminator, and the two play a game.	Reconstruct and generate new samples close to real data	Prone to model collapse and non-convergence of training	[71-76]
LSTM	Introduces forget gates, input gates, and output gates to solve the problems of gradient vanishing and gradient explosion	Good at processing sequence data of any length	The structure is complex, and training and reasoning take a long time	[77-79]
Self-Attention	Automatically assigns different attention weights according to different parts of the input sequence	Capture global interdependencies	Easy to overfit on small datasets	[80-87]

”

“

Table 6. Comparison of abnormal behavior recognition experiments

Classification	Method	Frame level AUC / EER / %										Year
		UCSDPed1		UCSDPed2		UMN		Shanghai Tech		Avenue
		EER	AUC	EER	AUC	EER	AUC	EER	AUC	EER	AUC
Motion Feature	HAVA＋HOF[35]	--	--	--	--	--	99.34	--	--	--	--	2020
Motion Feature	SFA[37]	--	--	--	96.40	--	96.55	--	--	--	--	2022
Clustering Discrimination	DBSCAN[41]	--	--	--	97.20	2.80	97.20	30.70	71.50	--	--	2020
CNN	FCN[52]	--	--	11.00	--	--	--	--	--	--	--	2019
	ABDL[53]	22.00	--	16.00	--	5.80	98.90	--	--	21.00	84.50	2020
	TS-CNN[54]	--	--	--	--	--	99.60	--	--	--	--	2020
	DSTCNN[55]	--	99.74	--	99.94	--	--	--	--	----	--	2020
	LDA-Net[56]	--	--	5.63	97.87	--	--	--	--	--	--	2020
AE	CAE-UNet[62]	--	--	--	96.20	--	--	--	--	--	86.90	2019
	PMAE[64]	--	--	--	95.90	--	--	--	72.90	--	--	2023
	ISTL[65]	29.80	75.20	8.90	91.10	--	--	--	--	29.20	76.8	2019
	S2-VAE[70]	14.30	94.25	--	--	--	99.81	--	--	--	87.6	2019
GAN	Ada-Net[72]	11.90	90.50	11.50	90.70	--	--	--	--	17.60	89.20	2019
	NM-GAN[73]	15.00	90.70	6.00	96.30	--	--	17.00	85.30	15.30	88.60	2021
	D-UNet[74]		84.70		96.30	--	--	--	73.00		85.10	2019
	BMAN[75]	--	--	--	96.60	--	99.60		76.20		90.00	2019
LSTM	FocalLoss-LSTM[77]	--	--	--	--	--	99.83	--	--	--	--	2021
	FCN-LSTM[78]	--	--	--	98.20	--	93.70	--	--	--	--	2021
	CNN-LSTM[79]	--	94.83		96.50	--	--	--	--	--	--	2022
SA	SAFA[81]	--	--	--	96.80	--	--	--	--	--	87.30	2023
	SA-AE[82]	--	--	--	95.69	--	--	--	--	--	84.10	2023
	SABiAE[83]	--	--	9.80	95.60	--	--	--	--	20.90	84.70	2022
	SA-GAN[84]	--	--	--	--	--	--	--	75.70	--	89.20	2021
	A2D-GAN[85]	9.70	94.10	5.10	97.40	--	--	25.20	74.20	9.00	91.00	2024
	SA-CNN[86]	--	--	--	--	--	99.29	--	--	--	--	2023

Author action: We update and fix the references [lines 746-766, on page 21, lines 862-864, on pages 24-25].

Comments 4

There is a problem with the conclusions - it is not clear what scientific novelty item 1 carries: that is, if the authors came up with 5 levels of abnormal behavior, then a simple recalculation is not required, but what is the advantage over division by 2,3,4,6...? Point 2 - what are the most noticeable disadvantages of traditional methods? This is a conclusion - there must be an advantage (in what?) or a disadvantage (in what?) of something over something else. Remark! A numerical comparison should always be present in the conclusions, even in a review paper, otherwise: the text is a guide that tells us about the availability and application of different methods.

Response: Thank you for this professional comment.

We have restructured the conclusions section of the paper, comprehensively overviewed the entire text, deeply analyzed the limitations of traditional methods and the advantages of deep learning methods. Numerical comparisons of the performance indicators of each algorithm have been mentioned in the conclusions based on Tables 4 and 6. In practice, we give a suggestion that an appropriate single model or a combination of multiple methods can be selected according to specific requirements. Finally, we present a outlook of future research direction. The specific modifications are as follows:

“ 6. Conclusions

This review conducts a comprehensive and in-depth study of crowd abnormal behavior recognition technology, performing detailed analyses from four dimensions: basic definitions, traditional methods, deep learning, and performance indicators. Through a clear classification of crowd levels, "crowd density" is quantified and divided into five levels. Next, this study analyzes recognition techniques for traditional abnormal behavior based on statistical models, motion features, dynamic models, and cluster discrimination. Although these methods are effective in specific scenarios, they still have limitations in dealing with high-dimensional, nonlinear, and complex dynamic data, such as reliance on manually designed features and limited generalization ability.

With the development of AI technologies, deep learning has become a research hotspot because of its strong automatic feature learning ability and good adaptability to complex patterns. This article discusses in detail the methods based on the Convolutional Neural Network (CNN), Generative Adversarial Network (GAN), Long Short-Term Memory Network (LSTM), Autoencoder (AE), and Self-Attention Mechanism (SA), demonstrating their outstanding performance in abnormal behavior recognition. These methods can extract high-level features from raw data, and effectively distinguish between normal and abnormal behaviors, thereby improving recognition accuracy and robustness.

In conclusion, by comparing the performance indicators of each algorithm model in Table 4 and Table 6, it can be found that deep learning methods show significant advantages in dealing with complex and changeable scenarios. Although each model has its application scope and limitations, combining different methods can achieve better results. For example, algorithms such as S2-VAE, TS-CNN, and SA-CNN have achieved a recognition accuracy more than 99% in dealing with some datasets in Table 6. Therefore, in practical applications, choosing the appropriate single model or combining multiple methods based on specific requirements can effectively enhance the overall performance of the abnormal behavior recognition system and achieve more accurate and reliable detection results.

Future research can focus on improving the fusion ability of deep learning and multi-modalities, introducing context understanding and situational reasoning, improving the robustness and adaptive learning ability of the model. It can help further broaden the recognition of multi-source and multi-dimensional data fusion, and construct a self-consistent metaverse virtual-real fusion evolution model, establish a "perception-prediction-intervention-construction" virtual-real fusion presentation mechanism. The continuous innovations in this field are expected to achieve a transformation from passive monitoring to active prevention in the future and focus on a more intelligent, precise, and automated abnormal behavior recognition systems.”

Author action: We update this manuscript with reconstructed conclusions [lines 890-927, on pages 25-26].

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

** Change this sentence "This paper is structured as follows: " by " The rest of this paper is structured as follows: ". Because the full paper is contains also the introduction 'where you write this last paragraph

** It is true that authors give the summaries of each traditional method at the end of the sub-section 4.1. However, I recommend that authors highlight the common limits of all 5 methods and drown the need of new methods

** add (.)after Figure 8 . in the line 456

** change these sentences "The expression is as follows" by " The expression is shown in the eq(x)" // or something like that

** Please correct this error in the tables 3, 4 and 5 “[Error! Reference source not found.-Error! Reference source not found.]”

** In this paper authors compare between the used methods, algorithms and tools for crowd abnormal behavior detection. However, they have neglected (didn’t give any information about) the used datasets in the literature. I recommend to add a section where you can details and compare between the used datasets

** Authors should add also a short conclusion at the end of the paper

Author Response

Re: Response to Reviewer3

To: Applied Science

Manuscript ID: applsci-3246529.R1

Original Article Title:“A Review of Crowd Abnormal Behavior Recognition Technology Based on Computer Vision”

Dear Reviewer，

Thank you for allowing a resubmission of our manuscript with an opportunity to address your professional comments on our manuscript ID applsci-3246529.

We upload :(a) the point-by-point responses to your professional comments below, (b) an Updated manuscript with red highlighting of new changes (PDF main document) according to your comments and kind suggestions.

Thank you very much for your kind consideration!

Best regards,

Rongyong Zhao

PhD.Asso.Prof.

CIMS research center,Tongji University,Shanghai,China

Comments 1

Change this sentence "This paper is structured as follows: " by " The rest of this paper is structured as follows: ". Because the full paper contains also the introduction 'where you write this last paragraph.

Response: Thank you for this professional suggestion!

We have revised this description in the section of Introduction, as follows:

“ The rest of this paper is structured as follows: Section 2 gives the definitions of crowd levels and abnormal behaviors; Section 3 introduces the main challenges faced by the abnormal behavior recognition task in the dense crowd scene; Section 4 comprehensively summarizes the methods of abnormal behavior recognition from the two dimensions of traditional methods and deep learning; and further introduces the current mainstream software tools; Section 5 introduces the datasets widely used in the field of abnormal behavior detection at home and abroad and the performance indicators of each algorithm on these datasets. Section 6 summarizes this paper and presents the future development trend of this research field. The article framework is shown in Figure 1.”

Author action: We update this manuscript with a corrected description [line 104, on page 3].

Comments 2

It is true that authors give the summaries of each traditional method at the end of the sub-section 4.1. However, I recommend that authors highlight the common limits of all 5 methods and drown the need of new methods

Response: Thank you for this kind suggestion!

We have revised the summary of Section 4.1, highlighting the common limitations of all four methods and clarifying the necessity of new methods, as follows:

“ In summary, Table 3 summarizes the design idea, advantage, and disadvantage of the above various abnormal behavior recognition technologies based on traditional methods. Although these traditional methods can effectively identify abnormal behaviors to a certain extent, a common limitation is that they still rely on manually designed features and artificially set thresholds, with limited adaptability to complex and changeable scenarios, and can be vulnerable to interference from environmental factors. In addition, traditional methods usually cannot automatically learn deep-seated behavior patterns, resulting in limited generalization ability and prone to false positives and false negatives. Therefore, there is an urgent need to seek new methods, which should have a strong automatic feature learning ability, adapting well to complex patterns, and capture subtle differences in crowd behaviors more accurately.

”

Author action: We update this manuscript with an improved summary [line 326-336, on page 8].

Comments 3

add (.)after Figure 8 . in the line 456

Response : Thank you for this careful checking!

We have added (.)after Figure 8, as follows:

“

Methods based on similarity measurement are techniques used to evaluate and quantify the degree of similarity between two data objects, samples, sets, or vectors. In the field of crowd abnormal behavior recognition, abnormal behavior recognition is carried out by comparing the similarity between the original input image and the reconstructed image by the Convolutional Autoencoder (CAE), and the process is shown in Figure 8. The similarity between the original image and the reconstructed image is compared through similarity measurement techniques (such as Euclidean Distance [60], Pearson Correlation Coefficient [61], etc.). If the similarity is low, it indicates that there is abnormal behavior in the original image.

”

Author action: We update this manuscript with the correct punctuation mark [line 468, on page 12].

Comments 4

change these sentences "The expression is as follows" by " The expression is shown in the eq(x)" // or something like that

Response : Thank you for the careful checking.

We have revised the descriptions of all the formulas throughout this paper, as follows:

“ During this process, the Mean Square Error (MSE) [58] is commonly used as the loss function. MSE is a statistical indicator that measures the difference between the predicted value and the true value, and the expression is shown in the Eq. (1):

（1）

……

The forget gate determines the information to be retained. Based on ht-1 and Xt, the forget gate outputs a number between 0 and 1 for the state Ct-1. 0 means elimination and 1 means retention. The expression is shown in the Eq. (2):

（2）

The input gate determines the newly incoming information. Composed of the sigmoid function and the tanh function, the result of multiplying the output values is used to update the state. The expression is shown in the Eq.s (3), (4) and (5):

	（3）
	（4）
	（5）

The output gate determines the output information. The output state is determined through the sigmoid layer, and the output result of the state through the tanh layer is multiplied by the output result of the sigmoid layer. The expression is shown in the Eq.s (6) and (7):

	（6）
	（7）

……

Stage 1 is to calculate the similarity between Query and Key. Common methods include the vector dot product, Cosine similarity, and MLP network. The expression is shown in the Eq.s (8) and (9):

Vector dot product

（8）

Cosine similarity

（9）

MLP network

（10）

Stage 2 introduces SoftMax for normalization to sort out the probability distribution of all elements. The expression is shown in the Eq. (11):

（11）

Stage 3 performs weighted summation to obtain the Attention value. The expression is shown in the Eq. (12):

（12）

”

Author action: We correct descriptions of all the formulas throughout this paper.

Comments 5

Please correct this error in the tables 3, 4 and 5 “[Error! Reference source not found.-Error! Reference source not found.]”

Response: Thank you for this professional suggestion!

Considering a new Table (Table 2 in the current version) is added according to another reviewer, we have fixed the references in Table4, Table 5 and Table 6 (corresponding to Table 3, Table 4 and Table 5 you mentioned in the previous version respectively). The specific modifications are as follows:

“

Table 4. The classification and characteristics of deep learning behavior anomaly recognition methods

Method	Design Ideas	Advantages	Limitations	References
CNN	Through local receptive fields and pooling operations, capture the spatial hierarchical features of the original data	Good at extracting local features and combining them into more complex patterns	Lack of time series processing capabilities	[42-56]
AutoEncoder	An unsupervised learning method that captures the main features of the data by encoding and decoding the original data	Reduce the data dimension and extract key features	Easily lead to overfitting of training data	[57-70]
GNN	It is composed of a generator and a discriminator, and the two play a game.	Reconstruct and generate new samples close to real data	Prone to model collapse and non-convergence of training	[71-76]
LSTM	Introduces forget gates, input gates, and output gates to solve the problems of gradient vanishing and gradient explosion	Good at processing sequence data of any length	The structure is complex, and training and reasoning take a long time	[77-79]
Self-Attention	Automatically assigns different attention weights according to different parts of the input sequence	Capture global interdependencies	Easy to overfit on small datasets	[80-87]

”

“

Table 5. Comparison of crowd anomaly behavior datasets

Name	Scene Description	Scale	Resolution	Abnormal Behavior	Object	Limitation
UCSD [88]	The crowd movement on the sidewalk from the perspective of the surveillance camera	Ped1 14000frame 34 Training segments 36 Test segments	238*158	Fast moving, Reverse driving Riding a bicycle, Driving a car, Sitting in a wheelchair, skateboarding, etc.	individual	Low resolution; few types of abnormal behaviors; relatively simple background
UCSD [88]		Ped2 4560 frame 12 Test segments	360*240		individual
UMN [89]	Video clips of pedestrian activities in different backgrounds such as campuses, shopping malls, and streets	8010 frame 11 Video segment	320*240	The crowd suddenly scattered, ran, and gathered.	group	Simple background; Limited abnormal types
Shanghai-Tech [90]	13 campus area scenes with complex lighting conditions and camera angles	317398 frame 130 Video segment	856*480	Crowd gathering, fighting, running, cycling.	group	Abnormal events are repetitive; The annotations have errors
CUHK-Avenue [91]	Video surveillance clips of outdoor public places	30625 frame 16 Training segments 21 Test segments	640*360	Pedestrians fighting, throwing objects, running	individual Vehicle	The shooting angle is single; The resolution is low

”

“

Table 6. Comparison of abnormal behavior recognition experiments

Classification	Method	Frame level AUC / EER / %										Year
		UCSDPed1		UCSDPed2		UMN		Shanghai Tech		Avenue
		EER	AUC	EER	AUC	EER	AUC	EER	AUC	EER	AUC
Motion Feature	HAVA＋HOF[35]	--	--	--	--	--	99.34	--	--	--	--	2020
Motion Feature	SFA[37]	--	--	--	96.40	--	96.55	--	--	--	--	2022
Clustering Discrimination	DBSCAN[41]	--	--	--	97.20	2.80	97.20	30.70	71.50	--	--	2020
CNN	FCN[52]	--	--	11.00	--	--	--	--	--	--	--	2019
	ABDL[53]	22.00	--	16.00	--	5.80	98.90	--	--	21.00	84.50	2020
	TS-CNN[54]	--	--	--	--	--	99.60	--	--	--	--	2020
	DSTCNN[55]	--	99.74	--	99.94	--	--	--	--	----	--	2020
	LDA-Net[56]	--	--	5.63	97.87	--	--	--	--	--	--	2020
AE	CAE-UNet[62]	--	--	--	96.20	--	--	--	--	--	86.90	2019
	PMAE[64]	--	--	--	95.90	--	--	--	72.90	--	--	2023
	ISTL[65]	29.80	75.20	8.90	91.10	--	--	--	--	29.20	76.8	2019
	S2-VAE[70]	14.30	94.25	--	--	--	99.81	--	--	--	87.6	2019
GAN	Ada-Net[72]	11.90	90.50	11.50	90.70	--	--	--	--	17.60	89.20	2019
	NM-GAN[73]	15.00	90.70	6.00	96.30	--	--	17.00	85.30	15.30	88.60	2021
	D-UNet[74]		84.70		96.30	--	--	--	73.00		85.10	2019
	BMAN[75]	--	--	--	96.60	--	99.60		76.20		90.00	2019
LSTM	FocalLoss-LSTM[77]	--	--	--	--	--	99.83	--	--	--	--	2021
	FCN-LSTM[78]	--	--	--	98.20	--	93.70	--	--	--	--	2021
	CNN-LSTM[79]	--	94.83		96.50	--	--	--	--	--	--	2022
SA	SAFA[81]	--	--	--	96.80	--	--	--	--	--	87.30	2023
	SA-AE[82]	--	--	--	95.69	--	--	--	--	--	84.10	2023
	SABiAE[83]	--	--	9.80	95.60	--	--	--	--	20.90	84.70	2022
	SA-GAN[84]	--	--	--	--	--	--	--	75.70	--	89.20	2021
	A2D-GAN[85]	9.70	94.10	5.10	97.40	--	--	25.20	74.20	9.00	91.00	2024
	SA-CNN[86]	--	--	--	--	--	99.29	--	--	--	--	2023

Author action: We update and fix the references [lines 746-766, on page 21, lines 829-830, on page 23, lines 862-864, on pages 24-25].

Comments 6

In this paper authors compare between the used methods, algorithms and tools for crowd abnormal behavior detection. However, they have neglected (didn’t give any information about) the used datasets in the literature. I recommend to add a section where you can detail and compare between the used datasets

Response: Thank you for this profession suggestion!

We elaborated on the datasets used in the literature related to traditional methods and deep learning methods in Section 5.3, and provided detailed explanations and comparisons of each dataset in Section 5.1. Therefore, this part has not been adjusted, as follows:

“ 5.1. Experimental Dataset Analysis

In the field of computer vision, due to the diversity and complexity of real abnormal behaviors in the research of crowd abnormal behavior analysis, there is a lack of high-quality datasets. To overcome this shortage and improve the performance of algorithms, researchers have constructed a series of datasets with different characteristics, focusing on parameters such as duration, size, resolution, etc., and covering various monitoring environments and scenarios, providing an important reference basis for the research of crowd anomaly recognition. Commonly used datasets mainly include UCSD, UMN, Shanghai-Tech, CUHK-Avenue, and other related datasets. Figure 25 shows some abnormal behavior samples in these four abnormal behavior datasets.

5.1.1. UCSD

The UCSD dataset [88] is a dataset for crowd behavior analysis and abnormal detection. The dataset contains two subsets, Ped1 and Ped2. The Ped1 subset includes 34 video clips, recording the pedestrian activities from the campus crosswalk scene; the Ped2 subset provides 12 video clips of the same specification, presenting similar but different pedestrian crossing area activities.

5.1.2. UMN

The UMN dataset [89] is mainly used for the evaluation and research of pedestrian detection and tracking algorithms. This dataset consists of 19 videos, including a series of pedestrian video sequences in complex backgrounds, providing various indoor and outdoor environments such as campuses, shopping malls, streets, and other pedestrian activity video clips in different backgrounds.

5.1.3. ShanghaiTech

The ShanghaiTech dataset [90] is a dataset for video abnormal detection and crowd counting. This dataset has 13 complex scenes, recording the pedestrian flow on the campus. It also contains 130 abnormal events (such as fighting, falling, running, etc.) and more than 270,000 training frames, with a duration ranging from about 30 seconds to 90 seconds.

5.1.4. CUHK-Avenue

The CUHK database [91] is a database about crowd behavior scenes. It includes crowd videos collected from many different environments with different densities and perspective ratios, such as streets, shopping centers, airports, and parks. It consists of traffic datasets and pedestrian datasets. There are a total of 474 video clips in 215 scenes in the dataset.

”

The details are further be presented in Table 5 as,

“

Table 5. Comparison of crowd anomaly behavior datasets

Name	Scene Description	Scale	Resolution	Abnormal Behavior	Object	Limitation
UCSD [88]	The crowd movement on the sidewalk from the perspective of the surveillance camera	Ped1 14000frame 34 Training segments 36 Test segments	238*158	Fast moving, Reverse driving Riding a bicycle, Driving a car, Sitting in a wheelchair, skateboarding, etc.	individual	Low resolution; few types of abnormal behaviors; relatively simple background
UCSD [88]		Ped2 4560 frame 12 Test segments	360*240		individual
UMN [89]	Video clips of pedestrian activities in different backgrounds such as campuses, shopping malls, and streets	8010 frame 11 Video segment	320*240	The crowd suddenly scattered, ran, and gathered.	group	Simple background; Limited abnormal types
Shanghai-Tech [90]	13 campus area scenes with complex lighting conditions and camera angles	317398 frame 130 Video segment	856*480	Crowd gathering, fighting, running, cycling.	group	Abnormal events are repetitive; The annotations have errors
CUHK-Avenue [91]	Video surveillance clips of outdoor public places	30625 frame 16 Training segments 21 Test segments	640*360	Pedestrians fighting, throwing objects, running	individual Vehicle	The shooting angle is single; The resolution is low

”

Author action: We update the manuscript with reconstructed sub-section and Comparison table of crowd anomaly behavior datasets. [lines 791-823, on page 22, lines 829-830, on page 23]

Comments 7

Authors should add also a short conclusion at the end of the paper

Response: Thank you for this kind suggestion!

We have restructured the conclusions section, comprehensively overviewed the entire text structure, summarized the limitations of traditional methods and the advantages of deep learning methods, then further gave suggestions that an appropriate single model or a combination of multiple methods can be selected according to specific requirements. Finally, the future research direction was also discussed. The specific modifications are as follows:

“ 6. Conclusions

Author action: We update this manuscript with reconstructed conclusions [line 890-927, on pages 25-26].

Author Response File: Author Response.pdf

Article Menu

A Review of Abnormal Crowd Behavior Recognition Technology Based on Computer Vision

Comments 1

Comments 2

Comments 1

Comments 2

Comments 3

Comments 4

Comments 1

Comments 2

Comments 3

Comments 4

Comments 5

Comments 6

Comments 7

Further Information

Guidelines

MDPI Initiatives

Follow MDPI