A Review of Abnormal Crowd Behavior Recognition Technology Based on Computer Vision
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript reviews recent methods that utilize computer vision to recognize abnormal behaviors in crowds. The paper presents the definition of the task, its main challenges, and the current technology employed for the recognition. The authors also provide a comparison of the performance of some methods in terms of evaluation indicators. The paper is generally well-written and presents a comprehensive analysis. However, this reviewer has some minor concerns to be addressed:
1. Table 5 shows the performance comparison of the deep-learning-based models. However, this reviewer suggests addressing the performance of the traditional methods in the table to get a baseline in the comparison. It could highlight the main challenges of the task described in the manuscript.
2. Please fix the references in the fourth column of Table 3 (References), the first column of Table 4 (Name) and the second column of Table 5 (Method).
Author Response
Re: Response to reviewer1
To: Applied Science
Manuscript ID: applsci-3246529.R1
Original Article Title:“A Review of Crowd Abnormal Behavior Recognition Technology Based on Computer Vision”
Dear Reviewer,
Thank you for allowing a resubmission of our manuscript with an opportunity to address your professional comments on our manuscript ID applsci-3246529.
We upload :(a) the point-by-point responses to your professional comments below,(b) an Updated manuscript with red highlighting of new changes (PDF main document) according to your comments and kind suggestions.
Thank you very much for your kind consideration!
Best regards,
Rongyong Zhao
PhD.Asso.Prof.
CIMS research center, Tongji University,Shanghai,China
Comments 1
Table 5 shows the performance comparison of the deep-learning-based models. However, this reviewer suggests addressing the performance of the traditional methods in the table to get a baseline in the comparison. It could highlight the main challenges of the task described in the manuscript.
Response: Thank you for this careful checking and kind suggestion!
Considering a new table (Table 2 in the current version) is added according to another reviewer, We have reviewed the performance indicators of relevant literature in traditional methods and updated these data in Table 6 (corresponding to Table 5 you mentioned in the previous version), so that readers can compare the performance baseline between the traditional and deep learning methods more clearly, which can provide outperformance support for the selection of abnormal behavior recognition methods further. The specific modifications are as follows:
“ Table 6 summarizes the performance of traditional methods and deep learning-based abnormal behavior recognition methods on UCSD, UMN, Shanghai Tech, and Avenue datasets in recent years. This paper compares and presents evaluation indicators such as AUC and EER. These evaluation data are all cited from research papers in recent years.
Table 6. Comparison of abnormal behavior recognition experiments
Classification |
Method |
Frame level AUC / EER / % |
Year |
|||||||||
UCSDPed1 |
UCSDPed2 |
UMN |
Shanghai Tech |
Avenue |
||||||||
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
|||
Motion Feature |
HAVA+HOF[35] |
-- |
-- |
-- |
-- |
-- |
99.34 |
-- |
-- |
-- |
-- |
2020 |
SFA[37] |
-- |
-- |
-- |
96.40 |
-- |
96.55 |
-- |
-- |
-- |
-- |
2022 |
|
Clustering Discrimination |
DBSCAN[41] |
-- |
-- |
-- |
97.20 |
2.80 |
97.20 |
30.70 |
71.50 |
-- |
-- |
2020 |
CNN |
FCN[52] |
-- |
-- |
11.00 |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
2019 |
ABDL[53] |
22.00 |
-- |
16.00 |
-- |
5.80 |
98.90 |
-- |
-- |
21.00 |
84.50 |
2020 |
|
TS-CNN[54] |
-- |
-- |
-- |
-- |
-- |
99.60 |
-- |
-- |
-- |
-- |
2020 |
|
DSTCNN[55] |
-- |
99.74 |
-- |
99.94 |
-- |
-- |
-- |
-- |
---- |
-- |
2020 |
|
LDA-Net[56] |
-- |
-- |
5.63 |
97.87 |
-- |
-- |
-- |
-- |
-- |
-- |
2020 |
|
AE |
CAE-UNet[62] |
-- |
-- |
-- |
96.20 |
-- |
-- |
-- |
-- |
-- |
86.90 |
2019 |
PMAE[64] |
-- |
-- |
-- |
95.90 |
-- |
-- |
-- |
72.90 |
-- |
-- |
2023 |
|
ISTL[65] |
29.80 |
75.20 |
8.90 |
91.10 |
-- |
-- |
-- |
-- |
29.20 |
76.8 |
2019 |
|
S2-VAE[70] |
14.30 |
94.25 |
-- |
-- |
-- |
99.81 |
-- |
-- |
-- |
87.6 |
2019 |
|
GAN |
Ada-Net[72] |
11.90 |
90.50 |
11.50 |
90.70 |
-- |
-- |
-- |
-- |
17.60 |
89.20 |
2019 |
NM-GAN[73] |
15.00 |
90.70 |
6.00 |
96.30 |
-- |
-- |
17.00 |
85.30 |
15.30 |
88.60 |
2021 |
|
D-UNet[74] |
|
84.70 |
|
96.30 |
-- |
-- |
-- |
73.00 |
|
85.10 |
2019 |
|
BMAN[75] |
-- |
-- |
-- |
96.60 |
-- |
99.60 |
|
76.20 |
|
90.00 |
2019 |
|
LSTM |
FocalLoss-LSTM[77] |
-- |
-- |
-- |
-- |
-- |
99.83 |
-- |
-- |
-- |
-- |
2021 |
FCN-LSTM[78] |
-- |
-- |
-- |
98.20 |
-- |
93.70 |
-- |
-- |
-- |
-- |
2021 |
|
CNN-LSTM[79] |
-- |
94.83 |
|
96.50 |
-- |
-- |
-- |
-- |
-- |
-- |
2022 |
|
SA |
SAFA[81] |
-- |
-- |
-- |
96.80 |
-- |
-- |
-- |
-- |
-- |
87.30 |
2023 |
SA-AE[82] |
-- |
-- |
-- |
95.69 |
-- |
-- |
-- |
-- |
-- |
84.10 |
2023 |
|
SABiAE[83] |
-- |
-- |
9.80 |
95.60 |
-- |
-- |
-- |
-- |
20.90 |
84.70 |
2022 |
|
SA-GAN[84] |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
75.70 |
-- |
89.20 |
2021 |
|
A2D-GAN[85] |
9.70 |
94.10 |
5.10 |
97.40 |
-- |
-- |
25.20 |
74.20 |
9.00 |
91.00 |
2024 |
|
SA-CNN[86] |
-- |
-- |
-- |
-- |
-- |
99.29 |
-- |
-- |
-- |
-- |
2023 |
Note, the performance indicators of HAVA+HOF and SFA in motion features, and DBSCAN in cluster discrimination in UCSD Ped2, UMN, and Shanghai Tech datasets are included to get a baseline in the comparison.”
Author action: We update this manuscript with baseline performance indicators of traditional methods [lines 858-864, on pages 24-25].
Comments 2
Please fix the references in the fourth column of Table 3 (References), the first column of Table 4 (Name) and the second column of Table 5 (Method).
Response: Thank you for this professional suggestion!
Considering a new table (Table 2 in the current version) is added according to another reviewer, we have fixed the references in Table4, Table 5 and Table 6 (corresponding to Table 3, Table 4 and Table 5 you mentioned in the previous version respectively). The specific modifications are as follows:
“ Table 4. The classification and characteristics of deep learning behavior anomaly recognition methods
Method |
Design Ideas |
Advantages |
Limitations |
References |
CNN |
Through local receptive fields and pooling operations, capture the spatial hierarchical features of the original data |
Good at extracting local features and combining them into more complex patterns |
Lack of time series processing capabilities |
[42-56] |
AutoEncoder |
An unsupervised learning method that captures the main features of the data by encoding and decoding the original data |
Reduce the data dimension and extract key features |
Easily lead to overfitting of training data |
[57-70] |
GNN |
It is composed of a generator and a discriminator, and the two play a game. |
Reconstruct and generate new samples close to real data |
Prone to model collapse and non-convergence of training |
[71-76] |
LSTM |
Introduces forget gates, input gates, and output gates to solve the problems of gradient vanishing and gradient explosion |
Good at processing sequence data of any length |
The structure is complex, and training and reasoning take a long time |
[77-79] |
Self-Attention |
Automatically assigns different attention weights according to different parts of the input sequence |
Capture global interdependencies |
Easy to overfit on small datasets |
[80-87] |
”
“ Table 5. Comparison of crowd anomaly behavior datasets
Name |
Scene Description |
Scale |
Resolution |
Abnormal Behavior |
Object |
Limitation |
UCSD [88] |
The crowd movement on the sidewalk from the perspective of the surveillance camera |
Ped1 14000frame 34 Training segments 36 Test segments |
238*158 |
Fast moving, Reverse driving Riding a bicycle, Driving a car, Sitting in a wheelchair, skateboarding, etc. |
individual |
Low resolution; few types of abnormal behaviors; relatively simple background |
Ped2 4560 frame 12 Test segments |
360*240 |
|||||
UMN [89] |
Video clips of pedestrian activities in different backgrounds such as campuses, shopping malls, and streets |
8010 frame 11 Video segment
|
320*240 |
The crowd suddenly scattered, ran, and gathered. |
group |
Simple background; Limited abnormal types |
Shanghai-Tech [90] |
13 campus area scenes with complex lighting conditions and camera angles |
317398 frame 130 Video segment |
856*480 |
Crowd gathering, fighting, running, cycling. |
group |
Abnormal events are repetitive; The annotations have errors |
CUHK-Avenue [91] |
Video surveillance clips of outdoor public places |
30625 frame 16 Training segments 21 Test segments |
640*360 |
Pedestrians fighting, throwing objects, running |
individual Vehicle |
The shooting angle is single; The resolution is low |
”
“ Table 6. Comparison of abnormal behavior recognition experiments
Classification |
Method |
Frame level AUC / EER / % |
Year |
|||||||||
UCSDPed1 |
UCSDPed2 |
UMN |
Shanghai Tech |
Avenue |
||||||||
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
|||
Motion Feature |
HAVA+HOF[35] |
-- |
-- |
-- |
-- |
-- |
99.34 |
-- |
-- |
-- |
-- |
2020 |
SFA[37] |
-- |
-- |
-- |
96.40 |
-- |
96.55 |
-- |
-- |
-- |
-- |
2022 |
|
Clustering Discrimination |
DBSCAN[41] |
-- |
-- |
-- |
97.20 |
2.80 |
97.20 |
30.70 |
71.50 |
-- |
-- |
2020 |
CNN |
FCN[52] |
-- |
-- |
11.00 |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
2019 |
ABDL[53] |
22.00 |
-- |
16.00 |
-- |
5.80 |
98.90 |
-- |
-- |
21.00 |
84.50 |
2020 |
|
TS-CNN[54] |
-- |
-- |
-- |
-- |
-- |
99.60 |
-- |
-- |
-- |
-- |
2020 |
|
DSTCNN[55] |
-- |
99.74 |
-- |
99.94 |
-- |
-- |
-- |
-- |
---- |
-- |
2020 |
|
LDA-Net[56] |
-- |
-- |
5.63 |
97.87 |
-- |
-- |
-- |
-- |
-- |
-- |
2020 |
|
AE |
CAE-UNet[62] |
-- |
-- |
-- |
96.20 |
-- |
-- |
-- |
-- |
-- |
86.90 |
2019 |
PMAE[64] |
-- |
-- |
-- |
95.90 |
-- |
-- |
-- |
72.90 |
-- |
-- |
2023 |
|
ISTL[65] |
29.80 |
75.20 |
8.90 |
91.10 |
-- |
-- |
-- |
-- |
29.20 |
76.8 |
2019 |
|
S2-VAE[70] |
14.30 |
94.25 |
-- |
-- |
-- |
99.81 |
-- |
-- |
-- |
87.6 |
2019 |
|
GAN |
Ada-Net[72] |
11.90 |
90.50 |
11.50 |
90.70 |
-- |
-- |
-- |
-- |
17.60 |
89.20 |
2019 |
NM-GAN[73] |
15.00 |
90.70 |
6.00 |
96.30 |
-- |
-- |
17.00 |
85.30 |
15.30 |
88.60 |
2021 |
|
D-UNet[74] |
|
84.70 |
|
96.30 |
-- |
-- |
-- |
73.00 |
|
85.10 |
2019 |
|
BMAN[75] |
-- |
-- |
-- |
96.60 |
-- |
99.60 |
|
76.20 |
|
90.00 |
2019 |
|
LSTM |
FocalLoss-LSTM[77] |
-- |
-- |
-- |
-- |
-- |
99.83 |
-- |
-- |
-- |
-- |
2021 |
FCN-LSTM[78] |
-- |
-- |
-- |
98.20 |
-- |
93.70 |
-- |
-- |
-- |
-- |
2021 |
|
CNN-LSTM[79] |
-- |
94.83 |
|
96.50 |
-- |
-- |
-- |
-- |
-- |
-- |
2022 |
|
SA |
SAFA[81] |
-- |
-- |
-- |
96.80 |
-- |
-- |
-- |
-- |
-- |
87.30 |
2023 |
SA-AE[82] |
-- |
-- |
-- |
95.69 |
-- |
-- |
-- |
-- |
-- |
84.10 |
2023 |
|
SABiAE[83] |
-- |
-- |
9.80 |
95.60 |
-- |
-- |
-- |
-- |
20.90 |
84.70 |
2022 |
|
SA-GAN[84] |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
75.70 |
-- |
89.20 |
2021 |
|
A2D-GAN[85] |
9.70 |
94.10 |
5.10 |
97.40 |
-- |
-- |
25.20 |
74.20 |
9.00 |
91.00 |
2024 |
|
SA-CNN[86] |
-- |
-- |
-- |
-- |
-- |
99.29 |
-- |
-- |
-- |
-- |
2023 |
Note, the performance indicators of HAVA+HOF and SFA in motion features, and DBSCAN in cluster discrimination in UCSD Ped2, UMN, and Shanghai Tech datasets are included to get a baseline in the comparison.
”
Author action: We update and fix the references [lines 746-766, on page 21,lines 829-830, on page 23, lines 862-864, on pages 24-25].
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors
· Introduction. The review is good, the links are correctly formatted.
· Overview of Chapter 2 - When enumerating the main problems that need to be solved, it is appropriate to mention the problem of price-quality. In particular, from a hardware point of view: the higher the resolution, the more accurate the classification, and a larger area with more objects can be captured. However, the downside is computer resources when calculating the algorithm in real-time. Therefore, in this section, I recommend mentioning the need to determine the optimal parameters for the system. In the section evaluating the effectiveness of each method, provide a table with the recommended optimal parameters for certain areas (the size of the crowd sample that the machine can optimally process) and sensitivity requirements. This will allow the reader to understand the cost of the studied algorithms. For example, the pictures in Fig. 20 are positioned as a crowd, but if there is a real crowd in front of those cameras (not 5-7 people) and a camera with a resolution of 238x158, then most likely the system will not see the problem. Therefore, the author's assessment of the optimality of resolution and sensitivity will help to estimate also the resource intensity of the studied methods.
· Fig. 6: even at maximum magnification, the formulas in the yellow square are hard to see - fix it.
· In table 3 and table 5 there is "Error! Reference source not found". It needs to be corrected
· Various concepts and methods of solving the problem of classification of situations with threatening consequences are well analyzed in the work.
· There is a problem with the conclusions - it is not clear what scientific novelty item 1 carries: that is, if the authors came up with 5 levels of abnormal behavior, then a simple recalculation is not required, but what is the advantage over division by 2,3,4,6...? Point 2 - what are the most noticeable disadvantages of traditional methods? This is a conclusion - there must be an advantage (in what?) or a disadvantage (in what?) of something over something else. Remark! A numerical comparison should always be present in the conclusions, even in a review paper, otherwise: the text is a guide that tells us about the availability and application of different methods.
I recommend printing after corrections.
Author Response
Re: Response to Reviewer2
To: Applied Science
Manuscript ID: applsci-3246529.R1
Original Article Title: “A Review of Crowd Abnormal Behavior Recognition Technology Based on Computer Vision”
Dear Reviewer,
Thank you for allowing a resubmission of our manuscript with an opportunity to address your professional comments on our manuscript ID applsci-3246529.
We upload :(a) the point-by-point responses to your professional comments below, (b) an Updated manuscript with red highlighting of new changes (PDF main document) according to your comments and kind suggestions.
Thank you very much for your kind consideration!
Best regards,
Rongyong Zhao
PhD.Asso.Prof.
CIMS research center,Tongji University,Shanghai,China
Comments 1
Overview of Chapter 2 - When enumerating the main problems that need to be solved, it is appropriate to mention the problem of price-quality. In particular, from a hardware point of view: the higher the resolution, the more accurate the classification, and a larger area with more objects can be captured. However, the downside is computer resources when calculating the algorithm in real-time. Therefore, in this section, I recommend mentioning the need to determine the optimal parameters for the system. In the section evaluating the effectiveness of each method, provide a table with the recommended optimal parameters for certain areas (the size of the crowd sample that the machine can optimally process) and sensitivity requirements. This will allow the reader to understand the cost of the studied algorithms. For example, the pictures in Fig. 20 are positioned as a crowd, but if there is a real crowd in front of those cameras (not 5-7 people) and a camera with a resolution of 238x158, then most likely the system will not see the problem. Therefore, the author's assessment of the optimality of resolution and sensitivity will help to estimate also the resource intensity of the studied methods.
Response: Thank you for your professional analysis on the problem of price-quality! We agree with you very much.
To allow the reader to understand the cost of the studied algorithms, we have added a new subsection in Section 3: Recommended parameters for hardware resources, explaining the necessity of the required parameters for the abnormal behavior recognition system. A new table is constructed, containing the recommended parameters for certain region types, sample sizes, sensitivities, IPC resolutions, frame rates, and GPU requirements. This will enable readers to understand the cost of the studied algorithms. The specific modifications are as follows:
“3.3 Recommended parameters for hardware resources
In high-density crowd scenarios, achieving efficient crowd behavior recognition not only faces technical challenges but also comes with performance requirements of hardware resource. From a hardware perspective, higher resolution can provide more accurate classification and can capture larger areas and more objects, thereby improving recognition accuracy. However, high-resolution images increase the computer resources required for real-time calculation algorithms and place higher demands on processing capabilities. Therefore, combined with the current industry trends and actual needs, the best practices for abnormal behavior recognition systems are provided. Table 2 summarizes the recommended parameter configurations of cameras in specific areas such as train station waiting rooms, subway platforms, popular shopping plazas, and sports venue entrances, aiming to ensure that overall costs are controlled to meet the performance requirements.
Table 2. The recommended parameters for hardware resources
Region type |
Crowd sample |
Sensitivity level |
Recommended IPC resolution |
Lens requirement |
Frame rate (fps) |
Recommended GPU |
Remarks |
Train station waiting room |
500 - 1000 |
High |
3840×2160 |
Wide-angle to standard focal length |
25 - 30 |
NVIDIA RTX3060 |
High human flow density, requiring a high level of sensitivity to ensure safety |
Subway platform |
100 - 200 |
Middle |
2560×1440 |
Standard focal length |
25 - 30 |
NVIDIA GTX1080 |
Medium to high human flow density, needing to balance false positives and false negatives |
Popular shopping plaza |
800 - 1200 |
High |
3840×2160 |
Wide-angle focal length |
20 - 25 |
NVIDIA RTX3060 |
Large open space, allowing a certain range of errors |
Entrance of sports venue |
200 - 500 |
Middle |
2560×1440 1920×1080 |
Standard to wide-angle focal length Infrared night vision |
25 - 30 |
NVIDIA RTX2070 |
During special events, there may be an extremely high density of people gathering |
Author action: We update this manuscript with new sub-section [lines 194-207, on page 4].
Comments 2
Fig. 6: even at maximum magnification, the formulas in the yellow square are hard to see - fix it.
Response: Thank you for this careful checking.
We have carefully reviewed all figures throughout the manuscript and replaced them with high-resolution versions. Additionally, we have ensured that they retain their quality and do not display noticeable distortion when zoomed in. The specific modifications are as follows:
Figure 6. The structure of LDA-Net
Author action: We update this manuscript with a high-resolution figure [lines 441-443, on page 11].
Comments 3
In table 3 and table 5 there is "Error! Reference source not found". It needs to be corrected
Response: Thank you for this careful checking.
Considering a new table (Table 2 in the current version) is added according to another reviewer, we have fixed the references in Table 4 and Table 6 (corresponding to Table 3 and Table 5 you mentioned in the previous version). The specific modifications are as follows:
“
Table 4. The classification and characteristics of deep learning behavior anomaly recognition methods
Method |
Design Ideas |
Advantages |
Limitations |
References |
CNN |
Through local receptive fields and pooling operations, capture the spatial hierarchical features of the original data |
Good at extracting local features and combining them into more complex patterns |
Lack of time series processing capabilities |
[42-56] |
AutoEncoder |
An unsupervised learning method that captures the main features of the data by encoding and decoding the original data |
Reduce the data dimension and extract key features |
Easily lead to overfitting of training data |
[57-70] |
GNN |
It is composed of a generator and a discriminator, and the two play a game. |
Reconstruct and generate new samples close to real data |
Prone to model collapse and non-convergence of training |
[71-76] |
LSTM |
Introduces forget gates, input gates, and output gates to solve the problems of gradient vanishing and gradient explosion |
Good at processing sequence data of any length |
The structure is complex, and training and reasoning take a long time |
[77-79] |
Self-Attention |
Automatically assigns different attention weights according to different parts of the input sequence |
Capture global interdependencies |
Easy to overfit on small datasets |
[80-87] |
”
“
Table 6. Comparison of abnormal behavior recognition experiments
Classification |
Method |
Frame level AUC / EER / % |
Year |
|||||||||
UCSDPed1 |
UCSDPed2 |
UMN |
Shanghai Tech |
Avenue |
||||||||
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
|||
Motion Feature |
HAVA+HOF[35] |
-- |
-- |
-- |
-- |
-- |
99.34 |
-- |
-- |
-- |
-- |
2020 |
SFA[37] |
-- |
-- |
-- |
96.40 |
-- |
96.55 |
-- |
-- |
-- |
-- |
2022 |
|
Clustering Discrimination |
DBSCAN[41] |
-- |
-- |
-- |
97.20 |
2.80 |
97.20 |
30.70 |
71.50 |
-- |
-- |
2020 |
CNN |
FCN[52] |
-- |
-- |
11.00 |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
2019 |
ABDL[53] |
22.00 |
-- |
16.00 |
-- |
5.80 |
98.90 |
-- |
-- |
21.00 |
84.50 |
2020 |
|
TS-CNN[54] |
-- |
-- |
-- |
-- |
-- |
99.60 |
-- |
-- |
-- |
-- |
2020 |
|
DSTCNN[55] |
-- |
99.74 |
-- |
99.94 |
-- |
-- |
-- |
-- |
---- |
-- |
2020 |
|
LDA-Net[56] |
-- |
-- |
5.63 |
97.87 |
-- |
-- |
-- |
-- |
-- |
-- |
2020 |
|
AE |
CAE-UNet[62] |
-- |
-- |
-- |
96.20 |
-- |
-- |
-- |
-- |
-- |
86.90 |
2019 |
PMAE[64] |
-- |
-- |
-- |
95.90 |
-- |
-- |
-- |
72.90 |
-- |
-- |
2023 |
|
ISTL[65] |
29.80 |
75.20 |
8.90 |
91.10 |
-- |
-- |
-- |
-- |
29.20 |
76.8 |
2019 |
|
S2-VAE[70] |
14.30 |
94.25 |
-- |
-- |
-- |
99.81 |
-- |
-- |
-- |
87.6 |
2019 |
|
GAN |
Ada-Net[72] |
11.90 |
90.50 |
11.50 |
90.70 |
-- |
-- |
-- |
-- |
17.60 |
89.20 |
2019 |
NM-GAN[73] |
15.00 |
90.70 |
6.00 |
96.30 |
-- |
-- |
17.00 |
85.30 |
15.30 |
88.60 |
2021 |
|
D-UNet[74] |
|
84.70 |
|
96.30 |
-- |
-- |
-- |
73.00 |
|
85.10 |
2019 |
|
BMAN[75] |
-- |
-- |
-- |
96.60 |
-- |
99.60 |
|
76.20 |
|
90.00 |
2019 |
|
LSTM |
FocalLoss-LSTM[77] |
-- |
-- |
-- |
-- |
-- |
99.83 |
-- |
-- |
-- |
-- |
2021 |
FCN-LSTM[78] |
-- |
-- |
-- |
98.20 |
-- |
93.70 |
-- |
-- |
-- |
-- |
2021 |
|
CNN-LSTM[79] |
-- |
94.83 |
|
96.50 |
-- |
-- |
-- |
-- |
-- |
-- |
2022 |
|
SA |
SAFA[81] |
-- |
-- |
-- |
96.80 |
-- |
-- |
-- |
-- |
-- |
87.30 |
2023 |
SA-AE[82] |
-- |
-- |
-- |
95.69 |
-- |
-- |
-- |
-- |
-- |
84.10 |
2023 |
|
SABiAE[83] |
-- |
-- |
9.80 |
95.60 |
-- |
-- |
-- |
-- |
20.90 |
84.70 |
2022 |
|
SA-GAN[84] |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
75.70 |
-- |
89.20 |
2021 |
|
A2D-GAN[85] |
9.70 |
94.10 |
5.10 |
97.40 |
-- |
-- |
25.20 |
74.20 |
9.00 |
91.00 |
2024 |
|
SA-CNN[86] |
-- |
-- |
-- |
-- |
-- |
99.29 |
-- |
-- |
-- |
-- |
2023 |
Note, the performance indicators of HAVA+HOF and SFA in motion features, and DBSCAN in cluster discrimination in UCSD Ped2, UMN, and Shanghai Tech datasets are included to get a baseline in the comparison. ”
Author action: We update and fix the references [lines 746-766, on page 21, lines 862-864, on pages 24-25].
Comments 4
There is a problem with the conclusions - it is not clear what scientific novelty item 1 carries: that is, if the authors came up with 5 levels of abnormal behavior, then a simple recalculation is not required, but what is the advantage over division by 2,3,4,6...? Point 2 - what are the most noticeable disadvantages of traditional methods? This is a conclusion - there must be an advantage (in what?) or a disadvantage (in what?) of something over something else. Remark! A numerical comparison should always be present in the conclusions, even in a review paper, otherwise: the text is a guide that tells us about the availability and application of different methods.
Response: Thank you for this professional comment.
We have restructured the conclusions section of the paper, comprehensively overviewed the entire text, deeply analyzed the limitations of traditional methods and the advantages of deep learning methods. Numerical comparisons of the performance indicators of each algorithm have been mentioned in the conclusions based on Tables 4 and 6. In practice, we give a suggestion that an appropriate single model or a combination of multiple methods can be selected according to specific requirements. Finally, we present a outlook of future research direction. The specific modifications are as follows:
“ 6. Conclusions
This review conducts a comprehensive and in-depth study of crowd abnormal behavior recognition technology, performing detailed analyses from four dimensions: basic definitions, traditional methods, deep learning, and performance indicators. Through a clear classification of crowd levels, "crowd density" is quantified and divided into five levels. Next, this study analyzes recognition techniques for traditional abnormal behavior based on statistical models, motion features, dynamic models, and cluster discrimination. Although these methods are effective in specific scenarios, they still have limitations in dealing with high-dimensional, nonlinear, and complex dynamic data, such as reliance on manually designed features and limited generalization ability.
With the development of AI technologies, deep learning has become a research hotspot because of its strong automatic feature learning ability and good adaptability to complex patterns. This article discusses in detail the methods based on the Convolutional Neural Network (CNN), Generative Adversarial Network (GAN), Long Short-Term Memory Network (LSTM), Autoencoder (AE), and Self-Attention Mechanism (SA), demonstrating their outstanding performance in abnormal behavior recognition. These methods can extract high-level features from raw data, and effectively distinguish between normal and abnormal behaviors, thereby improving recognition accuracy and robustness.
In conclusion, by comparing the performance indicators of each algorithm model in Table 4 and Table 6, it can be found that deep learning methods show significant advantages in dealing with complex and changeable scenarios. Although each model has its application scope and limitations, combining different methods can achieve better results. For example, algorithms such as S2-VAE, TS-CNN, and SA-CNN have achieved a recognition accuracy more than 99% in dealing with some datasets in Table 6. Therefore, in practical applications, choosing the appropriate single model or combining multiple methods based on specific requirements can effectively enhance the overall performance of the abnormal behavior recognition system and achieve more accurate and reliable detection results.
Future research can focus on improving the fusion ability of deep learning and multi-modalities, introducing context understanding and situational reasoning, improving the robustness and adaptive learning ability of the model. It can help further broaden the recognition of multi-source and multi-dimensional data fusion, and construct a self-consistent metaverse virtual-real fusion evolution model, establish a "perception-prediction-intervention-construction" virtual-real fusion presentation mechanism. The continuous innovations in this field are expected to achieve a transformation from passive monitoring to active prevention in the future and focus on a more intelligent, precise, and automated abnormal behavior recognition systems.”
Author action: We update this manuscript with reconstructed conclusions [lines 890-927, on pages 25-26].
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors
** Change this sentence "This paper is structured as follows: " by " The rest of this paper is structured as follows: ". Because the full paper is contains also the introduction 'where you write this last paragraph
** It is true that authors give the summaries of each traditional method at the end of the sub-section 4.1. However, I recommend that authors highlight the common limits of all 5 methods and drown the need of new methods
** add (.)after Figure 8 . in the line 456
** change these sentences "The expression is as follows" by " The expression is shown in the eq(x)" // or something like that
** Please correct this error in the tables 3, 4 and 5 “[Error! Reference source not found.-Error! Reference source not found.]”
** In this paper authors compare between the used methods, algorithms and tools for crowd abnormal behavior detection. However, they have neglected (didn’t give any information about) the used datasets in the literature. I recommend to add a section where you can details and compare between the used datasets
** Authors should add also a short conclusion at the end of the paper
Author Response
Re: Response to Reviewer3
To: Applied Science
Manuscript ID: applsci-3246529.R1
Original Article Title:“A Review of Crowd Abnormal Behavior Recognition Technology Based on Computer Vision”
Dear Reviewer,
Thank you for allowing a resubmission of our manuscript with an opportunity to address your professional comments on our manuscript ID applsci-3246529.
We upload :(a) the point-by-point responses to your professional comments below, (b) an Updated manuscript with red highlighting of new changes (PDF main document) according to your comments and kind suggestions.
Thank you very much for your kind consideration!
Best regards,
Rongyong Zhao
PhD.Asso.Prof.
CIMS research center,Tongji University,Shanghai,China
Comments 1
Change this sentence "This paper is structured as follows: " by " The rest of this paper is structured as follows: ". Because the full paper contains also the introduction 'where you write this last paragraph.
Response: Thank you for this professional suggestion!
We have revised this description in the section of Introduction, as follows:
“ The rest of this paper is structured as follows: Section 2 gives the definitions of crowd levels and abnormal behaviors; Section 3 introduces the main challenges faced by the abnormal behavior recognition task in the dense crowd scene; Section 4 comprehensively summarizes the methods of abnormal behavior recognition from the two dimensions of traditional methods and deep learning; and further introduces the current mainstream software tools; Section 5 introduces the datasets widely used in the field of abnormal behavior detection at home and abroad and the performance indicators of each algorithm on these datasets. Section 6 summarizes this paper and presents the future development trend of this research field. The article framework is shown in Figure 1.”
Author action: We update this manuscript with a corrected description [line 104, on page 3].
Comments 2
It is true that authors give the summaries of each traditional method at the end of the sub-section 4.1. However, I recommend that authors highlight the common limits of all 5 methods and drown the need of new methods
Response: Thank you for this kind suggestion!
We have revised the summary of Section 4.1, highlighting the common limitations of all four methods and clarifying the necessity of new methods, as follows:
“ In summary, Table 3 summarizes the design idea, advantage, and disadvantage of the above various abnormal behavior recognition technologies based on traditional methods. Although these traditional methods can effectively identify abnormal behaviors to a certain extent, a common limitation is that they still rely on manually designed features and artificially set thresholds, with limited adaptability to complex and changeable scenarios, and can be vulnerable to interference from environmental factors. In addition, traditional methods usually cannot automatically learn deep-seated behavior patterns, resulting in limited generalization ability and prone to false positives and false negatives. Therefore, there is an urgent need to seek new methods, which should have a strong automatic feature learning ability, adapting well to complex patterns, and capture subtle differences in crowd behaviors more accurately.
”
Author action: We update this manuscript with an improved summary [line 326-336, on page 8].
Comments 3
add (.)after Figure 8 . in the line 456
Response : Thank you for this careful checking!
We have added (.)after Figure 8, as follows:
“
- Methods based on similarity measurement are techniques used to evaluate and quantify the degree of similarity between two data objects, samples, sets, or vectors. In the field of crowd abnormal behavior recognition, abnormal behavior recognition is carried out by comparing the similarity between the original input image and the reconstructed image by the Convolutional Autoencoder (CAE), and the process is shown in Figure 8. The similarity between the original image and the reconstructed image is compared through similarity measurement techniques (such as Euclidean Distance [60], Pearson Correlation Coefficient [61], etc.). If the similarity is low, it indicates that there is abnormal behavior in the original image.
”
Author action: We update this manuscript with the correct punctuation mark [line 468, on page 12].
Comments 4
change these sentences "The expression is as follows" by " The expression is shown in the eq(x)" // or something like that
Response : Thank you for the careful checking.
We have revised the descriptions of all the formulas throughout this paper, as follows:
“ During this process, the Mean Square Error (MSE) [58] is commonly used as the loss function. MSE is a statistical indicator that measures the difference between the predicted value and the true value, and the expression is shown in the Eq. (1):
|
(1) |
……
- The forget gate determines the information to be retained. Based on ht-1 and Xt, the forget gate outputs a number between 0 and 1 for the state Ct-1. 0 means elimination and 1 means retention. The expression is shown in the Eq. (2):
|
(2) |
- The input gate determines the newly incoming information. Composed of the sigmoid function and the tanh function, the result of multiplying the output values is used to update the state. The expression is shown in the Eq.s (3), (4) and (5):
|
(3) |
|
(4) |
|
(5) |
- The output gate determines the output information. The output state is determined through the sigmoid layer, and the output result of the state through the tanh layer is multiplied by the output result of the sigmoid layer. The expression is shown in the Eq.s (6) and (7):
|
(6) |
|
(7) |
……
Stage 1 is to calculate the similarity between Query and Key. Common methods include the vector dot product, Cosine similarity, and MLP network. The expression is shown in the Eq.s (8) and (9):
Vector dot product
|
(8) |
- Cosine similarity
|
(9) |
- MLP network
|
(10) |
Stage 2 introduces SoftMax for normalization to sort out the probability distribution of all elements. The expression is shown in the Eq. (11):
|
(11) |
Stage 3 performs weighted summation to obtain the Attention value. The expression is shown in the Eq. (12):
|
(12) |
”
Author action: We correct descriptions of all the formulas throughout this paper.
Comments 5
Please correct this error in the tables 3, 4 and 5 “[Error! Reference source not found.-Error! Reference source not found.]”
Response: Thank you for this professional suggestion!
Considering a new Table (Table 2 in the current version) is added according to another reviewer, we have fixed the references in Table4, Table 5 and Table 6 (corresponding to Table 3, Table 4 and Table 5 you mentioned in the previous version respectively). The specific modifications are as follows:
“
Table 4. The classification and characteristics of deep learning behavior anomaly recognition methods
Method |
Design Ideas |
Advantages |
Limitations |
References |
CNN |
Through local receptive fields and pooling operations, capture the spatial hierarchical features of the original data |
Good at extracting local features and combining them into more complex patterns |
Lack of time series processing capabilities |
[42-56] |
AutoEncoder |
An unsupervised learning method that captures the main features of the data by encoding and decoding the original data |
Reduce the data dimension and extract key features |
Easily lead to overfitting of training data |
[57-70] |
GNN |
It is composed of a generator and a discriminator, and the two play a game. |
Reconstruct and generate new samples close to real data |
Prone to model collapse and non-convergence of training |
[71-76] |
LSTM |
Introduces forget gates, input gates, and output gates to solve the problems of gradient vanishing and gradient explosion |
Good at processing sequence data of any length |
The structure is complex, and training and reasoning take a long time |
[77-79] |
Self-Attention |
Automatically assigns different attention weights according to different parts of the input sequence |
Capture global interdependencies |
Easy to overfit on small datasets |
[80-87] |
”
“
Table 5. Comparison of crowd anomaly behavior datasets
Name |
Scene Description |
Scale |
Resolution |
Abnormal Behavior |
Object |
Limitation |
UCSD [88] |
The crowd movement on the sidewalk from the perspective of the surveillance camera |
Ped1 14000frame 34 Training segments 36 Test segments |
238*158 |
Fast moving, Reverse driving Riding a bicycle, Driving a car, Sitting in a wheelchair, skateboarding, etc. |
individual |
Low resolution; few types of abnormal behaviors; relatively simple background |
Ped2 4560 frame 12 Test segments |
360*240 |
|||||
UMN [89] |
Video clips of pedestrian activities in different backgrounds such as campuses, shopping malls, and streets |
8010 frame 11 Video segment
|
320*240 |
The crowd suddenly scattered, ran, and gathered. |
group |
Simple background; Limited abnormal types |
Shanghai-Tech [90] |
13 campus area scenes with complex lighting conditions and camera angles |
317398 frame 130 Video segment |
856*480 |
Crowd gathering, fighting, running, cycling. |
group |
Abnormal events are repetitive; The annotations have errors |
CUHK-Avenue [91] |
Video surveillance clips of outdoor public places |
30625 frame 16 Training segments 21 Test segments |
640*360 |
Pedestrians fighting, throwing objects, running |
individual Vehicle |
The shooting angle is single; The resolution is low |
”
“
Table 6. Comparison of abnormal behavior recognition experiments
Classification |
Method |
Frame level AUC / EER / % |
Year |
|||||||||
UCSDPed1 |
UCSDPed2 |
UMN |
Shanghai Tech |
Avenue |
||||||||
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
EER |
AUC |
|||
Motion Feature |
HAVA+HOF[35] |
-- |
-- |
-- |
-- |
-- |
99.34 |
-- |
-- |
-- |
-- |
2020 |
SFA[37] |
-- |
-- |
-- |
96.40 |
-- |
96.55 |
-- |
-- |
-- |
-- |
2022 |
|
Clustering Discrimination |
DBSCAN[41] |
-- |
-- |
-- |
97.20 |
2.80 |
97.20 |
30.70 |
71.50 |
-- |
-- |
2020 |
CNN |
FCN[52] |
-- |
-- |
11.00 |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
2019 |
ABDL[53] |
22.00 |
-- |
16.00 |
-- |
5.80 |
98.90 |
-- |
-- |
21.00 |
84.50 |
2020 |
|
TS-CNN[54] |
-- |
-- |
-- |
-- |
-- |
99.60 |
-- |
-- |
-- |
-- |
2020 |
|
DSTCNN[55] |
-- |
99.74 |
-- |
99.94 |
-- |
-- |
-- |
-- |
---- |
-- |
2020 |
|
LDA-Net[56] |
-- |
-- |
5.63 |
97.87 |
-- |
-- |
-- |
-- |
-- |
-- |
2020 |
|
AE |
CAE-UNet[62] |
-- |
-- |
-- |
96.20 |
-- |
-- |
-- |
-- |
-- |
86.90 |
2019 |
PMAE[64] |
-- |
-- |
-- |
95.90 |
-- |
-- |
-- |
72.90 |
-- |
-- |
2023 |
|
ISTL[65] |
29.80 |
75.20 |
8.90 |
91.10 |
-- |
-- |
-- |
-- |
29.20 |
76.8 |
2019 |
|
S2-VAE[70] |
14.30 |
94.25 |
-- |
-- |
-- |
99.81 |
-- |
-- |
-- |
87.6 |
2019 |
|
GAN |
Ada-Net[72] |
11.90 |
90.50 |
11.50 |
90.70 |
-- |
-- |
-- |
-- |
17.60 |
89.20 |
2019 |
NM-GAN[73] |
15.00 |
90.70 |
6.00 |
96.30 |
-- |
-- |
17.00 |
85.30 |
15.30 |
88.60 |
2021 |
|
D-UNet[74] |
|
84.70 |
|
96.30 |
-- |
-- |
-- |
73.00 |
|
85.10 |
2019 |
|
BMAN[75] |
-- |
-- |
-- |
96.60 |
-- |
99.60 |
|
76.20 |
|
90.00 |
2019 |
|
LSTM |
FocalLoss-LSTM[77] |
-- |
-- |
-- |
-- |
-- |
99.83 |
-- |
-- |
-- |
-- |
2021 |
FCN-LSTM[78] |
-- |
-- |
-- |
98.20 |
-- |
93.70 |
-- |
-- |
-- |
-- |
2021 |
|
CNN-LSTM[79] |
-- |
94.83 |
|
96.50 |
-- |
-- |
-- |
-- |
-- |
-- |
2022 |
|
SA |
SAFA[81] |
-- |
-- |
-- |
96.80 |
-- |
-- |
-- |
-- |
-- |
87.30 |
2023 |
SA-AE[82] |
-- |
-- |
-- |
95.69 |
-- |
-- |
-- |
-- |
-- |
84.10 |
2023 |
|
SABiAE[83] |
-- |
-- |
9.80 |
95.60 |
-- |
-- |
-- |
-- |
20.90 |
84.70 |
2022 |
|
SA-GAN[84] |
-- |
-- |
-- |
-- |
-- |
-- |
-- |
75.70 |
-- |
89.20 |
2021 |
|
A2D-GAN[85] |
9.70 |
94.10 |
5.10 |
97.40 |
-- |
-- |
25.20 |
74.20 |
9.00 |
91.00 |
2024 |
|
SA-CNN[86] |
-- |
-- |
-- |
-- |
-- |
99.29 |
-- |
-- |
-- |
-- |
2023 |
Note, the performance indicators of HAVA+HOF and SFA in motion features, and DBSCAN in cluster discrimination in UCSD Ped2, UMN, and Shanghai Tech datasets are included to get a baseline in the comparison. ”
Author action: We update and fix the references [lines 746-766, on page 21, lines 829-830, on page 23, lines 862-864, on pages 24-25].
Comments 6
In this paper authors compare between the used methods, algorithms and tools for crowd abnormal behavior detection. However, they have neglected (didn’t give any information about) the used datasets in the literature. I recommend to add a section where you can detail and compare between the used datasets
Response: Thank you for this profession suggestion!
We elaborated on the datasets used in the literature related to traditional methods and deep learning methods in Section 5.3, and provided detailed explanations and comparisons of each dataset in Section 5.1. Therefore, this part has not been adjusted, as follows:
“ 5.1. Experimental Dataset Analysis
In the field of computer vision, due to the diversity and complexity of real abnormal behaviors in the research of crowd abnormal behavior analysis, there is a lack of high-quality datasets. To overcome this shortage and improve the performance of algorithms, researchers have constructed a series of datasets with different characteristics, focusing on parameters such as duration, size, resolution, etc., and covering various monitoring environments and scenarios, providing an important reference basis for the research of crowd anomaly recognition. Commonly used datasets mainly include UCSD, UMN, Shanghai-Tech, CUHK-Avenue, and other related datasets. Figure 25 shows some abnormal behavior samples in these four abnormal behavior datasets.
5.1.1. UCSD
The UCSD dataset [88] is a dataset for crowd behavior analysis and abnormal detection. The dataset contains two subsets, Ped1 and Ped2. The Ped1 subset includes 34 video clips, recording the pedestrian activities from the campus crosswalk scene; the Ped2 subset provides 12 video clips of the same specification, presenting similar but different pedestrian crossing area activities.
5.1.2. UMN
The UMN dataset [89] is mainly used for the evaluation and research of pedestrian detection and tracking algorithms. This dataset consists of 19 videos, including a series of pedestrian video sequences in complex backgrounds, providing various indoor and outdoor environments such as campuses, shopping malls, streets, and other pedestrian activity video clips in different backgrounds.
5.1.3. ShanghaiTech
The ShanghaiTech dataset [90] is a dataset for video abnormal detection and crowd counting. This dataset has 13 complex scenes, recording the pedestrian flow on the campus. It also contains 130 abnormal events (such as fighting, falling, running, etc.) and more than 270,000 training frames, with a duration ranging from about 30 seconds to 90 seconds.
5.1.4. CUHK-Avenue
The CUHK database [91] is a database about crowd behavior scenes. It includes crowd videos collected from many different environments with different densities and perspective ratios, such as streets, shopping centers, airports, and parks. It consists of traffic datasets and pedestrian datasets. There are a total of 474 video clips in 215 scenes in the dataset.
”
The details are further be presented in Table 5 as,
“
Table 5. Comparison of crowd anomaly behavior datasets
Name |
Scene Description |
Scale |
Resolution |
Abnormal Behavior |
Object |
Limitation |
UCSD [88] |
The crowd movement on the sidewalk from the perspective of the surveillance camera |
Ped1 14000frame 34 Training segments 36 Test segments |
238*158 |
Fast moving, Reverse driving Riding a bicycle, Driving a car, Sitting in a wheelchair, skateboarding, etc. |
individual |
Low resolution; few types of abnormal behaviors; relatively simple background |
Ped2 4560 frame 12 Test segments |
360*240 |
|||||
UMN [89] |
Video clips of pedestrian activities in different backgrounds such as campuses, shopping malls, and streets |
8010 frame 11 Video segment
|
320*240 |
The crowd suddenly scattered, ran, and gathered. |
group |
Simple background; Limited abnormal types |
Shanghai-Tech [90] |
13 campus area scenes with complex lighting conditions and camera angles |
317398 frame 130 Video segment |
856*480 |
Crowd gathering, fighting, running, cycling. |
group |
Abnormal events are repetitive; The annotations have errors |
CUHK-Avenue [91] |
Video surveillance clips of outdoor public places |
30625 frame 16 Training segments 21 Test segments |
640*360 |
Pedestrians fighting, throwing objects, running |
individual Vehicle |
The shooting angle is single; The resolution is low |
”
Author action: We update the manuscript with reconstructed sub-section and Comparison table of crowd anomaly behavior datasets. [lines 791-823, on page 22, lines 829-830, on page 23]
Comments 7
Authors should add also a short conclusion at the end of the paper
Response: Thank you for this kind suggestion!
We have restructured the conclusions section, comprehensively overviewed the entire text structure, summarized the limitations of traditional methods and the advantages of deep learning methods, then further gave suggestions that an appropriate single model or a combination of multiple methods can be selected according to specific requirements. Finally, the future research direction was also discussed. The specific modifications are as follows:
“ 6. Conclusions
This review conducts a comprehensive and in-depth study of crowd abnormal behavior recognition technology, performing detailed analyses from four dimensions: basic definitions, traditional methods, deep learning, and performance indicators. Through a clear classification of crowd levels, "crowd density" is quantified and divided into five levels. Next, this study analyzes recognition techniques for traditional abnormal behavior based on statistical models, motion features, dynamic models, and cluster discrimination. Although these methods are effective in specific scenarios, they still have limitations in dealing with high-dimensional, nonlinear, and complex dynamic data, such as reliance on manually designed features and limited generalization ability.
With the development of AI technologies, deep learning has become a research hotspot because of its strong automatic feature learning ability and good adaptability to complex patterns. This article discusses in detail the methods based on the Convolutional Neural Network (CNN), Generative Adversarial Network (GAN), Long Short-Term Memory Network (LSTM), Autoencoder (AE), and Self-Attention Mechanism (SA), demonstrating their outstanding performance in abnormal behavior recognition. These methods can extract high-level features from raw data, and effectively distinguish between normal and abnormal behaviors, thereby improving recognition accuracy and robustness.
In conclusion, by comparing the performance indicators of each algorithm model in Table 4 and Table 6, it can be found that deep learning methods show significant advantages in dealing with complex and changeable scenarios. Although each model has its application scope and limitations, combining different methods can achieve better results. For example, algorithms such as S2-VAE, TS-CNN, and SA-CNN have achieved a recognition accuracy more than 99% in dealing with some datasets in Table 6. Therefore, in practical applications, choosing the appropriate single model or combining multiple methods based on specific requirements can effectively enhance the overall performance of the abnormal behavior recognition system and achieve more accurate and reliable detection results.
Future research can focus on improving the fusion ability of deep learning and multi-modalities, introducing context understanding and situational reasoning, improving the robustness and adaptive learning ability of the model. It can help further broaden the recognition of multi-source and multi-dimensional data fusion, and construct a self-consistent metaverse virtual-real fusion evolution model, establish a "perception-prediction-intervention-construction" virtual-real fusion presentation mechanism. The continuous innovations in this field are expected to achieve a transformation from passive monitoring to active prevention in the future and focus on a more intelligent, precise, and automated abnormal behavior recognition systems.”
Author action: We update this manuscript with reconstructed conclusions [line 890-927, on pages 25-26].
Author Response File: Author Response.pdf