Review Reports - Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article provides a systematic review of research on using deep learning techniques to detect abnormal behavior events and crowd gatherings in public spaces. The topic holds significant importance for public safety, and the integration of computer vision and AI methods to enhance the intelligence of surveillance systems is of considerable research value. However, the paper has the following issues that need to be addressed:

The authors review the problem of detecting dangerous crowd gatherings through video analysis. However, there is a lack of clear definition for "dangerous crowd gatherings." In real-life scenarios, gatherings such as markets or large-scale events are common—how can dangerous gatherings be distinguished from normal ones? Additionally, how to evaluate the danger level of a gathering is another issue worth considering.
Section 2.3 discusses inclusiveness and exclusiveness, but its purpose is unclear. The statistical description of the collected literature could be appropriately condensed.
In Section 4.2, the title "Can IA recognise What a Gathering of People Is?"—is "IA" a typo? In Section 4.6, the title "Ghaterings" should be corrected to "Gatherings." The authors are advised to carefully proofread the entire manuscript to correct spelling errors and refine wording.
At the end of the Introduction, it states that "Sections 5 and 6 present different computer vision techniques and conclusions," whereas in reality, Section 4 already covers the discussion of Q1–Q6, and Section 6 is the conclusion. The authors should adjust the section descriptions to ensure strict consistency with the actual content. Additionally, some parts appear repetitive (e.g., the beginning of Section 4.6 repeats some content from Section 4.3 regarding the exclusion of factors like weapons). Such redundancy should be streamlined or merged.
In addressing Q5 (regarding method performance), the authors primarily provide qualitative analysis and an overview of methodological characteristics. To strengthen persuasiveness, more quantitative comparisons should be included. Consider emphasizing these results in the main text and summarizing them in tables to help readers better understand the efficacy of existing methods.
The authors summarize the core methods and contributions of the 26 screened papers and compare the similarities and differences among them based on research questions. However, these comparisons remain largely qualitative, lacking deeper quantitative analysis or unified benchmarking. The depth of analysis is somewhat insufficient, and further elaboration is recommended.
The authors are encouraged to summarize the remaining challenges in the current research field and supplement potential future research directions.

Author Response

The authors review the problem of detecting dangerous crowd gatherings through video analysis. However, there is a lack of clear definition for "dangerous crowd gatherings." In real-life scenarios, gatherings such as markets or large-scale events are common—how can dangerous gatherings be distinguished from normal ones? Additionally, how to evaluate the danger level of a gathering is another issue worth considering.

The following paragraphs have been added to the Introduction section to clarify the terms.

“A key factor that helps characterize the problem is that crowd influx tends to be influenced by business hours, the day of the week, urban dynamics, and broader social factors such as sporting or political events. While daily and weekly patterns often exhibit regularity, they are frequently disrupted by unforeseen occurrences, including demonstrations, celebrations, or accidents.

Anomaly detection in a time series, which enables the distinction between hazardous and non-hazardous crowd formations, involves identifying those elements that significantly deviate from their expected statistical behavior, whether based on probabilistic models or temporal regularity.

To assess the level of risk posed by a concentration, adaptive thresholding techniques—accounting for historical variability and cyclical oscillations in the data—can reduce false alarms while maintaining sensitivity to genuine threats, thus supporting timely and effective responses.”

Section 2.3 discusses inclusiveness and exclusiveness, but its purpose is unclear. The statistical description of the collected literature could be appropriately condensed.

Separating the two sections 2.3 and 2.4 may lead to ambiguity regarding their respective purposes. To enhance clarity for the reader, both sections have therefore been combined into one. The first paragraph would read as follows:

This review focuses on identifying crowd formations and behavioral anomalies using machine learning and object tracking techniques. A total of 698 publications, between 2016 and 2025, were retrieved using the search strings described in Section 2.2. After applying the inclusion and exclusion criteria detailed in Table 2, we selected the studies most relevant to our research objectives and the questions outlined in Section 2.1.

In Section 4.2, the title "Can IA recognise What a Gathering of People Is?"—is "IA" a typo? In Section 4.6, the title "Ghaterings"should be corrected to "Gatherings." The authors are advised to carefully proofread the entire manuscript to correct spelling errors and refine wording.

To avoid confusion or ambiguity, the current title has been replaced with the following: “Can a gathering of people be recognized through computer vision methods?”. The word "ghaterings" has been corrected to "gatherings." A proofreading has been made.

At the end of the Introduction, it states that "Sections 5 and 6 present different computer vision techniques and conclusions,"whereas in reality, Section 4 already covers the discussion of Q1–Q6, and Section 6 is the conclusion. The authors should adjust the section descriptions to ensure strict consistency with the actual content. Additionally, some parts appear repetitive (e.g., the beginning of Section 4.6 repeats some content from Section 4.3 regarding the exclusion of factors like weapons). Such redundancy should be streamlined or merged.

To avoid redundancy and ensure greater consistency across the discussions, it is proposed to replace the first paragraph of section 4.2, Q6: “These studies are useful when analyzing actions in a gathering of people, but they do not consider the combination of other determining factors, nor whether a bladed weapon or blunt object can be found in the crowd. In these cases, it is necessary to use deep learning to detect other important factors which are not taken in account in the previously mentioned papers.” with the following: “As noted in Section 4.2, Question 3, traditional approaches often fail to account for factors such as the presence of dangerous objects within crowds. Therefore, it becomes essential to complement these models with object detection and context-aware analysis techniques based on deep learning.

In addressing Q5 (regarding method performance), the authors primarily provide qualitative analysis and an overview of methodological characteristics. To strengthen persuasiveness, more quantitative comparisons should be included. Consider emphasizing these results in the main text and summarizing them in tables to help readers better understand the efficacy of existing methods.

It is proposed to highlight in bold any information whose quantitative values can be represented as qualitative values in a table. Additionally, it is suggested to add the following table at the end of Q5:

Method	Application	Metric	Value	Reference
Optimized deep maxout	Anomaly detection	Accuracy	97.28%	[32]
Enhaced SlowFast	Detection of 5 behaviors in public spaces	Processin speed (FPS)	40.5 FPS	[31]
YOLOv5 + Twilio API	Overcrowding detection	Real-time response (latency)	Not defined	[34]
ACSAM	Anomaly detection in dense crowds	Accuracy improvement vs previous methds	>12 %	[33]
PublicVision (swin Transformer + encrypted VPN)	Crowd behavior classification	Global accuracy mean average Precision (mAP) Inference latency	89.76% 93.3% ~20 frames per inference	[30]

The authors summarize the core methods and contributions of the 26 screened papers and compare the similarities and differences among them based on research questions. However, these comparisons remain largely qualitative, lacking deeper quantitative analysis or unified benchmarking. The depth of analysis is somewhat insufficient, and further elaboration is recommended.

Grouping the situations according to the classification in Table 3 is proposed as a solution.

The authors are encouraged to summarize the remaining challenges in the current research field and supplement potential future research directions.

In the last sentence: “By leveraging the capabilities of mobile devices such as smartphones and static devices like surveillance cameras, it becomes possible to enhance the coverage and effectiveness of gathering detection systems.” the following could be added: “… It is not only possible to improve the coverage and effectiveness of detection systems, but also to leverage recent time series forecasting methods (such as LSTM, ARIMA, N-BEATS, and TFT) together with the previously described technologies, to predict in advance the onset of a crowd formation in a public space. By establishing dynamic thresholds based on past prediction errors, these methods can help determine whether a given concentration of people should be considered normal or anomalous.

Reviewer 2 Report

Comments and Suggestions for Authors

Here are a few remarks that are helpful to enhance the article's quality:
1. Make Important Definitions Clear Early on in the text, the phrase "abnormal behaviour" is used a lot but isn't given a clear meaning. It would be easier for readers to grasp the scope if the definition were explicit and operational (e.g., particular acts like running, violence, or weapon usage).
2. Enhance the Research Gap: Although the research notes limitations in "prior detection" of abnormalities, it might highlight uniqueness by clearly contrasting the suggested prediction approaches with current post-event methods.
3. Balance Technical and Practical Insights: Although the deep learning techniques are thoroughly described, the discussion would be more grounded if it included a brief paragraph on actual deployment issues (such as computing costs and privacy issues).
4. Interpret Key Concepts: For readers who are not experts, figures such as model structures or sample anomalies (such as calm versus violent gatherings) might enhance the content.
5. Critique Dataset Limitations: The UCF-Crime and PETS2009 datasets are helpful, but the technique section might be strengthened with a phrase addressing their biases (such as cultural settings and camera angles).
6. Future Work Specificity: "Mobile and static devices" for future work are mentioned in passing in the end. Clearer guidance would be provided by expanding this into two or three specific paths (e.g., edge computing for real-time analysis).

Author Response

Make Important Definitions Clear Early on in the text, the phrase "abnormal behaviour" is used a lot but isn't given a clear meaning. It would be easier for readers to grasp the scope if the definition were explicit and operational (e.g., particular acts like running, violence, or weapon usage).

To provide greater clarity to the reader regarding the expression "abnormal behaviour", in the line 55, after “.. spaces [2]”, include the following expression: As will be discussed throughout this paper, "abnormal behavior" refers to situations that deviate from typical crowd dynamics, such as individuals suddenly running in the opposite direction, the outbreak of a physical altercation, or the public display of weapons.

Enhance the Research Gap: Although the research notes limitations in "prior detection" of abnormalities, it might highlight uniqueness by clearly contrasting the suggested prediction approaches with current post-event methods.

The following clarification in lines 64 and 65 could help clarify the suggested prediction approaches: Prior detection methods aim to classify events that are already occurring, whereas only a few studies focus on anticipating the formation of an anomalous event..

Balance Technical and Practical Insights: Although the deep learning techniques are thoroughly described, the discussion would be more grounded if it included a brief paragraph on actual deployment issues (such as computing costs and privacy issues).

We have included the following paragraph at the beginning of Chapter 5 to enhance understanding: “In addition to the theoretical framework described above, several real-world challenges must be addressed when deploying such systems. First, the computational cost of deep learning models must be considered, both during the training phase (due to high GPU requirements and large volumes of data) and during real-time inference, which can be problematic in resource-constrained environments given the associated energy and financial demands. Moreover, depending on the applicable legislation in the deployment area, the use of image and video data may raise ethical and legal concerns regarding the processing of personal data, its compliance with data protection regulations, and the need for secure communication protocols to ensure information privacy.”

Interpret Key Concepts: For readers who are not experts, figures such as model structures or sample anomalies (such as calm versus violent gatherings) might enhance the content.

To clarify the types of anomalous situations commonly addressed in the literature, and for better clarification for the reader, two examples of such anomalous situations are described: Example 1, Movement in the Opposite Direction to the Crowd Flow, and Example 2, Sudden Transition from Normal to Violent Behavior.

Critique Dataset Limitations: The UCF-Crime and PETS2009 datasets are helpful, but the technique section might be strengthened with a phrase addressing their biases (such as cultural settings and camera angles).

The following paragraph is introduced before Table 6, in line 373:

It is important to note that both the UCF-Crime and PETS2009 datasets contain incidents recorded in specific cultural and urban contexts, and that their camera angles may not reflect the diversity of real-world surveillance environments. These biases can hinder the generalizability of the models to other geographic regions or alternative camera configurations, making it advisable to complement these datasets with more heterogeneous data sources in order to enhance the robustness and reliability of the approach.

6. Future Work Specificity: "Mobile and static devices" for future work are mentioned in passing in the end. Clearer guidance would be provided by expanding this into two or three specific paths (e.g., edge computing for real-time analysis).

To comply with this recommendation, it is proposed to expand the conclusions section with the following paragraph::

In addition to integration with both mobile and static devices, further exploration of 5G connectivity could enable the design of hybrid architectures that combine high-resolution cameras with cloud-based processing over 5G networks, ensuring scalability and robustness against environmental variability. Likewise, edge computing models could be implemented to perform real-time analysis directly on the cameras, thereby reducing latency and minimizing reliance on continuous connectivity.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a systematic review of deep learning approaches for detecting abnormal behaviors and gatherings in public spaces to enhance security. It highlights the current state of the field and identifies existing gaps, particularly the need to address multiple simultaneous anomalies in crowded environments.

However, the manuscript requires several improvements to meet the standards for publication. Below are detailed comments and suggestions:

Line 52–55:

The sentence “…ensuring a higher degree of general satisfaction among the population” seems vague and out of context in a technical review. The authors should consider rephrasing or removing it unless they can clearly explain and support the concept of “general satisfaction” in the context of anomaly detection in public spaces.
Lines 58–60 and Line 122:

There is an inconsistency in the number of selected papers. The introduction (line 59) mentions 26 papers, whereas line 122 refers to 35. The authors must clarify the correct number and ensure consistency throughout the manuscript. Additionally, the phrase “thereby fulfilling the criteria for quality assessment” is ambiguous. The authors should explicitly define what these criteria are and how they were applied.
Figure 2:

This figure should represent data from the selected papers (26 or 35), not from the initial pool before screening. Presenting only the final selected papers would better align with the goals of a systematic review.
Table 3:

It is unclear whether the keyword counts refer to the actual keywords provided by the authors of the selected papers (which seems unlikely due to the high frequencies) or to the frequency with which the search keywords appear in those papers. The authors should clearly explain their methodology for this analysis.
Figure 3, Table 4, and Lines 177–179:

The categorization presented in Figure 3 and Table 4—Crowd aggregation, Abnormal crowd aggregation, and Critical abnormal crowd aggregation—is not clearly aligned with the textual description in lines 177–179. In those lines, the authors describe the third category as involving additional risk factors, such as the detection of concealed weapons or extreme anomalies. However, it is unclear whether this description is meant to define what they label as Critical abnormal crowd aggregation. To avoid confusion, the authors should explicitly clarify how the textual description corresponds to the categories used in the figures and tables, and ensure consistent terminology throughout.
Figure 4:

This figure is not mentioned or discussed in the text. It should either be properly introduced and integrated into the discussion or removed.
Discussion Section:

The discussion is rather superficial. A more in-depth analysis of trends, limitations of current approaches, and future research directions would significantly enhance the paper’s value and impact.

Author Response

Line 52-55: The sentence “…ensuring a higher degree of general satisfaction among the population”seems vague and out of context in a technical review. The authors should consider rephrasing or removing it unless they can clearly explain and support the concept of “general satisfaction” in the context of anomaly detection in public spaces.

To avoid confusing the reader regarding the concepts explained, it is proposed to remove the phrase “and ensuring a higher degree of general satisfaction among the population”.

Lines 58–60 and Line 122: There is an inconsistency in the number of selected papers. The introduction (line 59) mentions 26 papers, whereas line 122 refers to 35. The authors must clarify the correct number and ensure consistency throughout the manuscript. Additionally, the phrase “thereby fulfilling the criteria for quality assessment”is ambiguous. The authors should explicitly define what these criteria are and how they were applied.

The proposed numbering is corrected (26 is replaced by 35). Additionally, to avoid ambiguities, the expression “thereby fulfilling the criteria for quality assessment” is removed.

Figure 2. This figure should represent data from the selected papers (26 or 35), not from the initial pool before screening. Presenting only the final selected papers would better align with the goals of a systematic review.

It was decided to highlight in bold the most significant words from the 35 selected papers, with a brief paragraph explaining this feature.

Table 3. It is unclear whether the keyword counts refer to the actual keywords provided by the authors of the selected papers (which seems unlikely due to the high frequencies) or to the frequency with which the search keywords appear in those papers. The authors should clearly explain their methodology for this analysis.

The methodology used in the search involved counting the frequency with which the terms appeared in the full text of each document, thereby ensuring an objective measure of their prevalence within the analyzed literature.

Figura 3, tabla 4, y líneas 177-179. The categorization presented in Figure 3 and Table 4—Crowd aggregation, Abnormal crowd aggregation, and Critical abnormal crowd aggregation—is not clearly aligned with the textual description in lines 177–179. In those lines, the authors describe the third category as involving additional risk factors, such as the detection of concealed weapons or extreme anomalies. However, it is unclear whether this description is meant to define what they label as Critical abnormal crowd aggregation. To avoid confusion, the authors should explicitly clarify how the textual description corresponds to the categories used in the figures and tables and ensure consistent terminology throughout.

To enhance understanding of the different categories, the following clarification will be included in line 190: “Abnormal crowd aggregation refers to situations where behaviours deviate from the norm but may not involve immediate danger (for example, sudden changes in crowd density or unexpected movements that do not necessarily indicate violence). In contrast, critical abnormal crowd aggregation encompasses high-risk scenarios, such as the detection of concealed weapons or overtly violent behaviours that could pose a direct threat to public safety”.

Figura 4. This figure is not mentioned or discussed in the text. It should either be properly introduced and integrated into the discussion or removed.

To integrate the image into the discussion, since it had not been previously referenced, it is proposed to add the expression in italics:

“Table 5 presents some of the significant datasets commonly employed in the papers retrieved from section 2.2, and figure 4 shows four examples of some datasets contents in table 5”.

Discussion section. The discussion is rather superficial. A more in-depth analysis of trends, limitations of current approaches, and future research directions would significantly enhance the paper’s value and impact.

This information has already been corrected and expanded upon in a previous revision (Reviewer 2).

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The author has revised the manuscript based on the feedback. As a review paper, it requires in-depth analysis in addition to summarizing existing research findings.

Author Response

The author has revised the manuscript based on the feedback. As a review paper, it requires in-depth analysis in addition to summarizing existing research findings.

We sincerely thank the reviewer for their careful assessment and constructive feedback. We fully agree that a review paper must go beyond summarizing existing works and should include in-depth technical and critical analysis.

In the revised manuscript, we have made a significant effort to integrate such analysis throughout the paper. We present a comparative discussion of the most relevant deep learning architectures used in abnormal behavior detection, highlighting their respective strengths, limitations, and suitability for different real-world surveillance contexts. We also examine how dataset characteristics—such as diversity of environments, camera perspectives, annotation granularity, and environmental conditions—influence the generalization capacity and robustness of AI models in this field.

Moreover, the discussion incorporates reflections on key aspects such as the impact of camera network planning and physical infrastructure on detection reliability. The manuscript also addresses the need for new types of annotated datasets to capture the complexity and variability of real-world scenarios, especially when dealing with rare or context-dependent anomalies.

Finally, we explore emerging research directions, including hybrid models that combine classical methods with deep learning, self-supervised learning strategies, and the integration of semantic reasoning via vision-language models.

We hope these clarifications show that the manuscript provides not only a comprehensive synthesis of existing literature, but also a meaningful critical perspective on the methodological and technological challenges shaping future work in this domain. We remain open to expanding or deepening the analysis in any specific area the reviewer deems appropriate.