Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review

Rodrigo-Guillen, Rafael; Garcia-D’Urso, Nahuel; Mora-Mora, Higinio; Azorin-Lopez, Jorge

doi:10.3390/jsan14040069

Open AccessReview

Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review

Department of Computer Science and Technology, University of Alicante, 03690 San Vicente del Raspeig, Alicante, Spain

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2025, 14(4), 69; https://doi.org/10.3390/jsan14040069

Submission received: 22 April 2025 / Revised: 26 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Public security is a crucial aspect of maintaining social order. Although crime rates in Western cultures may be considered socially acceptable, it is important to continually improve security measures to prevent potential risks. With the advancements in artificial intelligence methods, particularly in deep learning and computer vision, it has become possible to detect abnormal event patterns in groups of people. This paper presents a review of the deep learning techniques employed for identifying gatherings of people and detecting anomalous events to enhance public security. Some of the open research areas are identified, including the lack of works addressing multiple cases of anomalies in large concentrations of people, which leaves open an important avenue for future scientific work.

Keywords:

computer vision; spontaneous demonstration; crowd aggregation; deep learning; crowd gathering

1. Introduction

Public security is a paramount concern in modern societies. While crime rates may be considered socially acceptable, ensuring the safety and well-being of individuals and communities remains a pressing need. In recent years, there has been a growing recognition that traditional approaches to public security may not be sufficient in addressing emerging threats and challenges [1].

One such challenge is the rise of new forms of violent and anti-social behavior, often exhibited by multicultural youth gangs. These gangs, comprising individuals from diverse backgrounds, engage in coordinated acts of physical brutality using weapons such as knives and blunt objects. Their activities pose a significant risk not only to rival gangs but also to innocent bystanders, increasing the urgency to find proactive solutions to mitigate potential harm.

Additionally, the changing dynamics of public demonstrations have further underscored the need for advanced security measures. Some Western countries are considering allowing spontaneous demonstrations without the obligation of prior notification to authorities. While this allows for greater freedom of expression, it also introduces potential risks. Spontaneous gatherings with illegal intentions can result in injuries to the general public and property damage, necessitating innovative approaches to identify and prevent such incidents.

Given these pressing challenges, there is a critical need to explore advanced technological solutions that can enhance public security measures. Artificial intelligence (AI) methods, particularly in the realms of deep learning and computer vision, offer promising avenues for detecting abnormal behavior patterns and gatherings in public spaces [2]. By harnessing the power of AI, it becomes possible to detect and predict potential security risks, allowing for timely interventions and improved public safety.

To understand the current capabilities and limitations of AI systems in this domain, it is essential to trace the historical evolution of video-based anomaly detection technologies. The automatic detection of anomalous events in video has evolved significantly over the past two decades. In the early 2000s, most approaches relied on traditional computer vision techniques with hand-crafted features and statistical models to describe normal behavior [3]. Techniques such as Gaussian mixture models for background subtraction or trajectory-based clustering using SVMs were common [4]. These classical systems were based on explicit assumptions and expert-defined rules, often combining motion vectors or texture patterns with machine learning methods such as k-NN [5], HMMs [6], random forests [7], or shallow neural networks [8]. Around the mid-2010s, a paradigm shift occurred with the advent of deep learning. Enabled by powerful hardware and larger video datasets, researchers began using deep neural networks to automatically learn features from data. Early architectures such as autoencoders and 2D CNNs modeled normal video patterns through reconstruction or prediction, triggering anomaly alerts when errors spiked [9,10,11].

Shortly thereafter, more specialized video architectures emerged. Temporal modeling was introduced through RNNs and LSTMs, as well as ConvLSTM and 3D CNNs like C3D or I3D, which learned spatiotemporal features directly from video [12]. Generative models such as GANs and VAEs allowed for the synthesis of future frames or latent scene representations, detecting anomalies through generation discrepancies [13]. More recently, the 2020s have seen a push toward even more generalizable and semantically rich approaches. Hybrid models now combine classical insights with deep learning, while self-supervised learning uses pretext tasks like frame ordering to learn useful representations [14]. Vision–language models (VLMs) have also entered the field, integrating textual context to better distinguish between benign unusual behavior and true security threats [15,16].

Existing reviews [17,18,19,20] in the field of computer vision have extensively explored monitoring the behavior of individuals or groups. However, the focus of much of the literature has been on the post-event detection of anomalous events, rather than addressing the challenge of prior detection. Alternatively, the available literature primarily concentrates on detecting events after they have occurred or, at best, during their ongoing unfolding.

Considering the identified gaps in prior research on public safety risk detection, our research aims to investigate the potential of AI to provide highly accurate predictive tools for anticipating imminent anomalous events. Our interest lies in methods that go beyond existing detection techniques for anomalous events in large gatherings of people, enabling the anticipation of future events and ensuring a higher degree of general satisfaction among the population.

This paper aims to address these pressing concerns by conducting a literature review of AI methods, specifically in deep learning and computer vision, that are capable of detecting gatherings of people and abnormal events. The review provides readers with 35 papers pertaining to mass congregations of people and the analysis of event patterns, thereby fulfilling the criteria for quality assessment. Additionally, a comprehensive examination of the data obtained from the selected studies is conducted, leading to the formulation of recommendations and guidelines to support future work in this research domain.

The remainder of the paper is organized as follows: Section 2 describes the steps used to select those primary studies of interest for the analysis. Section 3 presents the results of the searches and their quality assessment. Section 4 attempts to answer the questions presented in Section 2.2. Section 5 and Section 6 highlight the different techniques of computer vision and the conclusions.

2. Research Methodology

2.1. Prior Research

Regarding the identification of potential anomalous events with significant consequences in densely populated areas, our review of the literature revealed inconclusive systematic reviews. While the consulted sources primarily concentrate on aspects related to detecting and monitoring groups or congregations of individuals, we did encounter noteworthy research that can serve as an initial exemplification, such as learning to predict crowd flows [21], monitoring of subjects and relationships with lots of people [22], and identification of abnormal behavioral patterns in crowds [22].

2.2. Research Goals

In this phase, the questions that gave rise to the research were examined, along with the methods applied to find and focus the information. For instance, the databases that were consulted, the keywords that were used, and so on, were considered. The following six questions were considered, as presented in Table 1, in which the question’s path can be determined by the column called “route”:

We mainly used the databases of Scopus, Web of Science, IEEE Xplore Digital Library, and Google Scholar for the collection of related papers, and depending on the platform, the search focused on the title, abstract, or keywords.

The criteria used in the searches referred to syntaxes such as citizen security, vision camera, computer machine, etc., using Boolean operators AND and OR. The strings used in the search were as follows:

TI=(crowd gathering) OR TI=(spontaneous demonstration) AND TI=(crowd aggregation) AND TI=(abnormal aggregation) AND PY=(2017–2025);
TITLE-ABS-KEY ((crowd OR gathering) AND ( (movement AND patterns) OR (weapon AND detection) ) AND analysis AND (human OR person OR people)) AND PUBYEAR > 2016;
TITLE-ABS-KEY (“public safety”) AND TITLE-ABS-KEY (“computer vision”) AND TITLE-ABS-KEY (“deep learning”) AND PUBYEAR > 2016.

Due to the increase in the number of smart cities, where those have sensors that capture almost every possible kind of image such as traffic control or crowd counting, we focused on the results that were more directly related to public security, searching for spontaneous gatherings of people in unusual areas (crowd aggregation evolution, or abnormal crowd aggregation), as well as, for example, the detection of items capable of carrying bladed weapons or abnormal behaviors such as running at unexpected times or in unexpected ways, falls, etc. Hence, once the results of the metasearch engines were obtained, screening was carried out to only keep the publications that are relevant to our research through exclusion/inclusion criteria, which will be presented in Section 2.3.

2.3. Inclusion and Exclusion Criteria

The outcomes derived from this literature review relied on the identification of crowd formations, wherein machine learning and object tracking techniques were employed to extract pertinent and substantial information regarding behavior or patterns that deviate from commonly established norms. Table 2 shows the inclusive and exclusive criteria applied to the filter that was obtained with the metasearch engines.

2.4. Selection Results and Data Extraction

The search strings in Section 2.2 show 698 publications between the years 2016 and 2025. Once the 698 results related to the core of our research had been identified, using the criteria shown in Table 2, the ones that most closely matched the desired final objective and that could provide a response to the questions raised in Section 2.1 were reviewed.

The first exclusion criterion that was taken into account (after removing duplicated) was to eliminate all the results that had nothing to do with the area of computer science, which then left 481 publications. Evidently and almost automatically, the first thing the review focused on was the title of the remaining papers, which resulted in a further 261 being discarded, leaving 210. Then, we conducted a detailed review of the conclusions, summaries, and results. Following the latest filtering process, 35 documents were considered to be the closest to our search. Figure 1 shows the discarded papers through the filtering process.

The selected papers highlight the recognition of events such as recognizing interpersonal distance and people density, predicting behavior based on human actions [23], detection of small objects at a low altitude, learning to predict crowd flows [21], early prediction of anomalies in surveillance cameras [24], vision-based systems for security and protection through machine learning techniques [5], learning from anomalous events in crowds [25], learning to prevent crime [26], monitoring of subjects and relationships with many people [22], identification of abnormal behavioral patterns in crowds [22], crowd flow predictions [27], detection of anomalies in both critical and noncritical crowded environments [28,29], and detection of knives or bladed weapons [30,31].

The 35 selected results contain some characteristics that could be useful for a definitive solution to the analysis of the state of the art. These publications cover many different aspects of the study of citizen security, including detecting a massive concentration of people, counting the number of participating entities, detecting patterns of behavior, locating objects that can be used to detect a mass concentration of people, the ability to determine whether the object poses a social danger (knives, firearms, etc.) or whether it is not relevant to public security.

2.5. Temporary Chart of Publications

About the dates of the publications found, Figure 2 shows the number of publications since 2017 concerning crowd analysis, movement patterns, and weapon detection, with the 2023 index indicating an increase at an exponential rate. The decision to use this range of dates is because the interest in this type of technology began to increase from 2017, as well as to limit the number of publications in this paper. Upon examining the graph, it becomes evident that there is a growing interest in the matter.

2.6. Most Representative Keywords

To verify the alignment between the search results and the intended objective of our research, Table 3 presents a compilation of keywords that appeared across all revised papers, with the 35 selected studies after applying filters highlighted in bold. It is evident, as anticipated, that six keywords—detection, video, datasets, crowd, training, and prediction—emerge as the most frequently recurring terms. This finding substantiates the research interest in leveraging computer vision to enable intelligent analysis of crowd gatherings.

3. Findings and Quality Assessment

The obtained papers were thoroughly examined to facilitate their categorization into distinct groups, utilizing both quantitative and qualitative data. Quantitative data pertain to the information derived from experimental procedures, while qualitative data encompass insights drawn from the authors’ conclusions. Based on the conducted analysis, it was determined that the papers could be classified into three interrelated categories. Despite their interconnectedness, these categories exhibit notable distinctions that form a crucial criterion for the accurate assessment of quality.

The initial category can be referred to as “crowd aggregation” or “crowd gathering,” encompassing studies centered around the identification and analysis of typical crowd behaviors. Another category, termed “abnormal aggregation”, specifically focuses on detecting and studying unusual behaviors within crowds. Within the latter category, a further subdivision exists, known as “critical abnormal aggregation”, which pertains to the identification and examination of highly significant abnormal crowd behaviors. These categorizations correspond to the research questions outlined in Table 1.

By conducting this quality assessment, we can classify the primary studies and ascertain the quantity of papers positioned at the forefront of knowledge relevant to our research.

The first category encompasses studies dedicated to examining various aspects of crowd formations, such as people counting, peer relationships, crowd analysis, and distance estimations. In contrast, the second category explores deeper into the research of abnormal behavioral patterns that deviate from what is conventionally considered “normal”. Lastly, the third category introduces an additional risk factor, focusing on studies involving the detection of concealed weapons within crowds or the identification of potentially violent or extreme anomalies.

Figure 3 displays a graph presenting the distribution of studies across the different categories based on their extracted and classified data. It illustrates the percentage composition of each category. Additionally, Table 4 provides a categorized reference list of the papers.

The identification and categorization of primary studies highlight the balance between research focused on the analysis of incident-free crowd gatherings (42%) and studies investigating anomalous situations (46%). Notably, only a minority of studies (12%) concentrate on critical anomalies, suggesting that the field of hazardous event detection is still in its nascent stage of development. Therefore, there is considerable value in expanding knowledge and understanding of such critical situations.

4. Analysis of the Results

In this section, a discussion about the questions raised in Section 2 of this paper is presented. These questions serve as important guiding points for our research and explore the key aspects related to the detection of abnormal events in large gatherings of people using artificial intelligence methods. By addressing these questions, we aim to gain a deeper understanding of the existing knowledge and explore potential solutions to the identified research gaps. Through a comprehensive analysis and synthesis of the available literature, this section aims to provide valuable insights and contribute to the advancement of the field of public security through AI-based anomaly detection techniques.

4.1. Q1. What Is Considered a Gathering of People?

It is important to consider the concept of gathering as the initial foundation for the start of the research. Because there are no clear definitions provided in the existing literature, we can initiate anomaly detection by considering a minimum group size of 20 individuals as our starting point. Results that provide a head count and indicate that it is a gathering of more than that number (or similar) of individuals will be useful. In addition to that, it is also important to be able to keep track of individuals and the interactions between them. Elias et al. [22] present real-time videos capturing individuals walking in densely populated areas such as squares, shopping malls, and train stations. The study aims to comprehend both crowd behavior and the distinction between individual elements and group entities. This is achieved by enclosing each subject within a frame, characterizing them, and analyzing their behavior based on proximity to others in consecutive video frames. Additionally, they employ the deep high-resolution representation learning neural network [40] to extract information from body poses and joints, enabling the identification of relationships and human interactions. In a similar line, Fitwi et al. [35] estimate interpersonal distance between entities in the dataset and measures the density of the nearest crowd using E-SEC. This approach considers the interpersonal distance between dynamic objects, the occupied area of the crowd, and density. The dataset was captured using CCTV cameras, and the YOLOv3 algorithm was applied for detection and tracking purposes.

It is important to determine which of a multitude of people may correspond to a family nucleus or a couple, as well as the interactions between them and the rest of the group. Understanding the concept of group gatherings, tracking individual elements in the dataset, and analyzing interactions between entities (including differentiation between family and non-family units) are crucial for accurately detecting spontaneous crowd flows that occur within a short timeframe. Moreover, the significance of people congregating can vary based on external factors beyond control, such as holidays that encourage participation in demonstrations or rainy days when individuals prefer to stay indoors. Therefore, predicting crowd behavior using multisource data (including vacation schedules, public holidays, and traffic data) becomes imperative. In [21], the ST-B-ResNet method leverages multisource data and employs a BRBM data reconstruction mechanism for crowd flow analysis.

4.2. Q2. Can AI Recognize What a Gathering of People Is?

Deep learning can be employed to detect gatherings of people by utilizing trained neural networks that can recognize specific patterns in images or videos associated with such groups or situations. This process entails utilizing a dataset comprising images or videos showcasing instances of people congregating, followed by training a neural network that learns from this dataset and can subsequently identify common characteristics of these gatherings.

One possible strategy involves utilizing object detection algorithms like YOLO, which have the ability to detect and track objects in video sequences. By training a model using YOLO on a collection of image and video datasets containing people congregations, the model can learn to identify specific patterns within these groups. For instance, the algorithm can be trained to detect clusters of people in close proximity or individuals looking toward a shared point of interest.

An alternative method involves employing a convolutional neural network (CNN) to analyze images or videos depicting gatherings of people, enabling it to learn and identify behavioral patterns based on the visual features extracted from these visual inputs. For instance, the CNN can be trained to recognize distinct body shapes, sizes, and the spatial relationships among individuals within the scenes.

Upon training the network, it serves as a foundation for detecting people gatherings in new videos or images that were not included in the training datasets. This detection capability holds potential for numerous applications, including crowd behavior management, CCTV surveillance, and tracking, as well as social distancing monitoring.

4.3. Q3. How Can We Detect with Videos the Difference Between a Peaceful and a Non-Peaceful Gathering?

For those occasions when a gathering may lead to violent actions, the system should be able to predict and even anticipate them. Once the need for the number of participating individuals is covered, whether the behavior of that crowd of people corresponds to something that could be called normal, or whether, on the contrary, actions outside the norm are evident. The study of foreground people’s actions allows for the establishment of a learning model [29] that helps to determine whether a behavior is anomalous or not, highlighting those behaviors that represent a prediction error. The work of Nauman and Shoaib [38] mainly focuses on detecting anomalies within a crowd of people, fusing a Mask R-CNN convolutional network with Res-Net-101 as the main architecture. These studies are useful when analyzing actions in a gathering of people, but they do not consider the combination of other determining factors, nor whether a bladed weapon or a blunt object can be found in the crowd. In these cases, it is necessary to use deep learning to detect other important factors that are not taken into account in the previously mentioned papers. Recent methods such as MSMC-Net utilize multiscale spatiotemporal attention to detect crowd-level anomalies like turbulence and counterflow, enabling real-time risk scoring without tracking individuals [49].

Table 5 presents several significant datasets commonly used in the papers from Section 2.2. To clarify the types of anomalous situations commonly addressed in the literature, two representative examples are described below:

Example 1: Movement in the Opposite Direction to the Crowd Flow: A typical anomaly in video surveillance scenarios involves detecting an individual walking in the opposite direction to the prevailing movement of the crowd. While this behavior does not necessarily imply violence, it does constitute a significant deviation from the usual collective dynamics. Such anomalies may indicate potential risk situations, such as attempted escapes, suspicious behavior, or the possible onset of an incident.
Example 2: Sudden Transition from Normal to Violent Behavior: Another paradigmatic situation involves a group of people who, initially, are behaving in a completely peaceful or routine manner. However, at a given moment, two individuals begin to fight or engage in physical aggression. These events highlight the importance of automated systems being able not only to identify atypical patterns but also to detect, at an early stage, the transition between normal behaviors and scenarios that may become dangerous.

Both examples underscore the need for artificial intelligence-based solutions capable of efficiently anticipating or detecting both subtle deviations in group behavior and the emergence of imminent risk situations for public safety.

4.4. Q4. What Are the Different Types of Deep Learning Networks and Architectures Used for Abnormal Event and Gathering Detection?

Deep learning can be employed to detect anomalous events in crowd gatherings by training neural networks to recognize patterns that deviate from normal behavior. To achieve this differentiation, two distinct datasets are utilized: one representing “normal behavior” and the other representing “anomalous behavior.” The neural network is trained using these datasets to accurately identify and distinguish such deviations. For example, the ACSAM model uses CNNs with abnormality-aware training layers designed for dense crowds, improving accuracy by over 12% compared to previous deep learning methods [46].

One potential approach is to employ anomaly detection algorithms, such as autoencoders or GANs, to recognize patterns that diverge from the anticipated behavior of a group captured in video datasets. For instance, an autoencoder can be trained using a dataset of regular occurrences (e.g., individuals walking in a public area) and subsequently leverage this training to detect events that deviate from the norm (e.g., an individual running or a sudden convergence of a group towards a specific location). Similarly, GANs can also be utilized to generate images depicting regular occurrences and subsequently identify images that deviate from the anticipated pattern.

Another approach involves employing a neural network, specifically a CNN, to classify events as either normal or abnormal. This method entails training the CNN using a labeled dataset of images or videos that indicate the normality of each event. Subsequently, the trained CNN can be applied to analyze new images or videos and classify them accordingly based on these two models.

In both instances, continuous training with new data enables the neural network to undergo refinements, enabling it to adapt to evolving behavior patterns and enhance its predictive capabilities in detecting anomalies. The insights gained from this detection process hold applicability in diverse domains, including public safety and the management of crowds within public or private spaces.

A notable innovation is the PublicVision system, which integrates a Swin Transformer to classify crowd behavior by size and violence level, while securing surveillance data using encrypted VPNs [43].

4.5. Q5. What Are the Performance and Accuracy of These Methods on Public Security Datasets?

The performance and accuracy of the previously discussed CNNs and autoencoders can vary due to several factors, including the dataset’s size and quality, the complexity of the addressed problem, the chosen model architecture, and the evaluation metrics employed to gauge performance. Although CNNs and GANs are frequently employed in the domain of public safety, each exhibits distinct strengths. CNNs excel in image segmentation, classification, and object detection, making them highly effective in such tasks. On the other hand, GANs are predominantly utilized for synthetic image generation and enhancing low-resolution images. CNNs and GANs have demonstrated significant advancements in performance and accuracy when applied to public safety datasets. GANs have been successfully employed to generate highly realistic facial images, contributing to the advancement of facial recognition systems.

Autoencoders have found extensive utility in a range of public safety applications, including anomaly detection, surveillance, and image and video analysis. Nonetheless, the performance and accuracy of autoencoders when applied to public safety datasets can fluctuate based on factors such as the specific application, dataset quality and size, and the design of the autoencoder model.

Autoencoders have demonstrated encouraging outcomes across diverse public safety applications. Particularly in anomaly detection, they have been employed to detect and characterize aberrant behavior or events in surveillance videos or sensor data. The accuracy of the anomaly detection system hinges on the quality of the input data as well as the design of the autoencoder.

Autoencoders have found utility in various image and video analysis tasks, including compression, reconstruction, and super-resolution. They possess the capability to acquire a compressed representation of the input data through training and subsequently employ this representation to reconstruct the original image or video.

Nevertheless, it is crucial to acknowledge that the performance and accuracy of autoencoders in public safety datasets can be impacted by ethical and legal considerations associated with the utilization of these technologies.

Specifically in anomaly detection, deep models have been used to identify and characterize aberrant behaviors or events. One such model, based on an Optimized Deep Maxout architecture, achieved 97.28% accuracy in anomaly detection using evolutionary optimization techniques [45]. Another enhanced SlowFast model with attention and dropout mechanisms detects five public-space behaviors with up to 40.5 FPS, meeting real-time needs [44].

A practical implementation using YOLOv5 combined with Twilio API enabled the real-time detection of overcrowding and alert dispatch in educational and public facilities, demonstrating strong applicability in live environments [47].

4.6. Q6. What Key Elements of the Datasets Can Provide Evidence of the Outcome of the Gathering, and Which Datasets Are Most Commonly Used to Monitor Gatherings Using Deep Learning?

These studies are useful when analyzing actions in a gathering of people, but they do not consider the combination of other determining factors, nor whether a bladed weapon or blunt object can be found in the crowd. In these cases, it is necessary to use deep learning to detect other important factors that are not taken into account in the previously mentioned papers.

To efficiently conduct computer vision-based monitoring of a crowd, it is essential to utilize datasets that offer ample information and learning capabilities, enabling us to achieve conclusive outcomes in our monitoring objectives.

Deep CNNs have been trained to estimate kinetic energy and entropy from simulated evacuations, capturing crowd macro-behavioral shifts useful for anomaly detection [42]. Group-level feature learning has also enabled crowd behavior reconstruction without tracking individuals, enhancing flexibility and realism in simulated datasets [59]. Hybrid models combining deep learning with enhanced social force simulations enable realistic crowd behavior modeling with fewer collisions and better handling of high-density scenarios [50].

Table 6 presents some of the significant datasets commonly employed in the papers retrieved from Section 2.2.

5. Discussion

Following the analysis of the potential answers in Section 2, a concise table was derived in Section 4, featuring selected papers that highlight distinct technologies, datasets, and objectives. Table 7 provides a summarized overview of the various computer vision techniques employed.

Upon acquiring these data, the identification of a solution becomes feasible, leading to the proposal of three potential solutions to address our questions on gatherings.

The problem definition would focus on “what is happening” in an agglomeration of people in front of a given event, and whether there is a relationship between the actions and that event. The system should behave in the same way as a human being, after observing a certain behavior on a monitor. It has to be able to realize that something anomalous is happening. Table 8 presents the accuracy results of the selected papers in different scenarios.

The current literature contains other methods capable of detecting anomalies in a video sequence with high precision. These methods can, for example, detect if a vehicle is in an area designated for pedestrians only or if people are running in typically calm areas. However, these methods lack the ability to interpret whether there is violence or the potential for an altercation in the area. Table 9 presents a summary of the leading state-of-the-art techniques for detecting anomalous events in images and videos.

In addition to algorithmic performance, the reliability of abnormal behavior detection systems also depends on the physical configuration of the surveillance infrastructure. Camera planning is thus a critical factor in ensuring the effectiveness of AI-based monitoring systems. A strategic deployment and configuration of a camera network maximizes the visual coverage of urban areas, intersections, and public squares, minimizing blind spots and optimizing the capture of unusual behaviors. Indeed, camera planning is considered the first step when designing surveillance networks for public safety applications [76]. Networked cameras, when interconnected and often centrally managed, can collaborate to detect anomalous events more robustly than isolated units. A well-planned camera network facilitates continuous tracking of individuals or crowds across different zones, enhancing the detection of suspicious activities or unauthorized gatherings from multiple viewpoints [77,78].

A careful network design must consider environmental characteristics (urban versus rural, open versus narrow spaces) and the specific surveillance objectives. Recent research highlights that coverage requirements and optimization algorithms for camera placement must be tailored to the type of scene being monitored—for example, busy urban areas require different strategies than rural settings [79]. In deep learning-based anomaly detection systems, it becomes essential to plan the location, angle, and range of cameras to ensure that relevant events are not missed. Proper multi-camera coverage not only increases the likelihood of capturing unusual activity but also reduces false positives by corroborating incidents from different perspectives [80,81].

Modern surveillance systems are further enhanced by smart cameras equipped with embedded analytics that can operate in coordination [82]. These edge-computing cameras are capable of detecting objects and behavioral patterns in real time and communicating with other nodes in the network. The recent literature underscores the value of such intelligent multi-camera systems, from multi-camera surveillance reviews to scene-aware planning strategies adapted to different geographic contexts [83,84]. Optimal camera planning—supported by networked cameras and computer vision algorithms—forms a critical technological foundation for reliable anomaly detection in public environments. This integration enhances not only accuracy but also the responsiveness of the entire surveillance system, ensuring a more effective early warning and mitigation framework.

Another important aspect to discuss is that, despite recent progress, deep learning-based systems for abnormal event detection still face significant limitations due to the lack of representative datasets. Most current models are trained on traditional datasets such as UCSD, CUHK Avenue, or UCF-Crime, which do not adequately reflect the diversity of real-world surveillance scenarios. There is a growing need for datasets that encompass multiple environments (e.g., streets, transit stations, and campuses), varied camera perspectives (e.g., fixed, mobile, and PTZ), and different environmental conditions. Recent studies emphasize that the quality and diversity of datasets directly influence model performance and that the absence of comprehensive training data remains a critical barrier to effective anomaly detection [85,86]. To address this, new benchmarks such as the NWPU Campus dataset have been proposed, offering extensive multi-scenario coverage, diverse viewpoints, and a rich taxonomy of anomalous and normal behaviors, significantly advancing model generalization capabilities [87]. Additionally, enhancing dataset annotations has become essential [88]. While many existing datasets adopt unsupervised or weakly supervised formulations, future datasets should provide richer labels, such as temporal markers, spatial localization, and anomaly type classification (e.g., violence, accidents, unauthorized gatherings, fires, or structural failures). Including both human-related and environmental anomalies would help reflect the full spectrum of real-world public safety scenarios.

In recent years, the field of abnormal behavior detection in video using deep learning has seen significant growth across multiple countries. According to a literature review considering peer-reviewed scientific publications restricted to articles published in indexed journals and top-tier international conferences, China has emerged as a global leader in this domain, with a high volume of publications focused on anomaly detection in surveillance footage, supported by institutions like the Chinese Academy of Sciences, Shanghai Jiao Tong University, Northwestern Polytechnical University, and Tsinghua University. The United States follows closely, with prominent contributions from universities such as the University of Central Florida, Carnegie Mellon University, and the University of California system, particularly in integrating spatiotemporal modeling and generative learning approaches. Other countries with increasing research output include India, with active participation from the Indian Institutes of Technology, which has ranked second in terms of the number of publications in recent years. South Korea also stands out, especially through the Universities of Sejong and Yonsei.

European institutions also play a central role in advancing this field. Universities like the University of Amsterdam, the Technical University of Munich, and ETH Zurich have published influential work in unsupervised learning, vision–language integration, and multi-view anomaly detection. In the United Kingdom, the University of Oxford and Imperial College London have contributed methods focused on real-time detection and privacy-aware video analytics. Overall, research efforts are geographically diverse but consistently emphasize deep learning-based methods, including hybrid and self-supervised approaches, tailored to real-world surveillance scenarios.

Finally, scientific advancements in video anomaly detection have rapidly translated into practical applications within the electronic security industry. Several companies now offer intelligent surveillance cameras equipped with embedded deep learning analytics capable of autonomously detecting abnormal behaviors. A notable example is Avigilon (https://docs.avigilon.com, accessed on 29 June 2025) (Motorola Solutions), which has introduced Unusual Motion Detection (UMD)—a system that continuously learns the typical activity in a monitored scene without relying on predefined rules. By highlighting unusual movements, the system allows security operators to quickly review large volumes of footage and focus on potentially critical incidents. Axis Communications (https://www.axis.com, accessed on 29 June 2025) has also integrated deep learning capabilities into its latest camera models. With dedicated deep learning processing units (DLPUs), Axis cameras can detect anomalies directly at the edge without the need for centralized servers, enabling scalable, real-time surveillance with reduced infrastructure demands.

In the global market, leading Chinese companies such as Hikvision (https://www.hikvision.com/, accessed on 29 June 2025) and Dahua (https://www.dahuasecurity.com, accessed on 29 June 2025) have incorporated AI-driven behavioral analytics into their product lines. Hikvision, for instance, offers real-time abnormal pattern detection through its HikCentral platform, particularly suited for high-traffic environments like airports or stadiums. These systems can automatically raise alerts upon detecting unusual crowd dynamics, such as sudden gatherings or erratic movement. In parallel, software-focused companies like Scylla AI (https://www.scylla.ai, accessed on 29 June 2025) have emerged, offering plug-and-play solutions that enhance existing camera systems with real-time behavior recognition. Their platforms can identify a range of abnormal events—including violence, theft, or vandalism—across multiple video streams simultaneously. These developments demonstrate how deep learning is empowering surveillance systems to move from passive recording to proactive, intelligent monitoring, aligning commercial tools with the cutting-edge techniques explored in academic research.

6. Conclusions

This paper provides a review of methods and techniques that use computer vision and machine learning to determine a gathering of people and analyze abnormal events. The findings reveal that while there are various approaches in the literature, deep learning-based systems are gaining prominence due to their ability to learn complex patterns and achieve higher accuracy in detecting gatherings and analyzing behavior. While some works address the problem individually using classical computer vision methods, the adoption of deep learning techniques offers significant advancements in detecting and understanding gatherings of people. Deep learning models, with their capacity to automatically extract relevant features from large-scale datasets, have shown superior performance in detecting abnormal events and behaviors. However, despite the progress made, there is a notable gap in the literature when it comes to a comprehensive approach that addresses the entire pipeline of gathering detection. While some studies focus on detecting normal behavior in gatherings using one-class learning systems, there is still a need for further research on the comprehensive detection and combination of abnormalities. This entails not only identifying anomalous behaviors but also classifying them and understanding their implications for public security. Furthermore, the utilization of different types of devices, including both mobile and static ones, for gathering detection presents an important avenue for future scientific work. By leveraging the capabilities of mobile devices such as smartphones and static devices like surveillance cameras, it becomes possible to enhance the coverage and effectiveness of gathering detection systems.

Author Contributions

Conceptualization, H.M.-M. and J.A.-L.; methodology, H.M.-M. and J.A.-L.; formal analysis, R.R.-G. and N.G.-D.; investigation, R.R.-G. and N.G.-D.; writing—original draft preparation, R.R.-G. and N.G.-D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Haghani, M.; Coughlan, M.; Crabb, B.; Dierickx, A.; Feliciani, C.; van Gelder, R.; Geoerg, P.; Hocaoglu, N.; Laws, S.; Lovreglio, R.; et al. A roadmap for the future of crowd safety research and practice: Introducing the Swiss Cheese Model of Crowd Safety and the imperative of a Vision Zero target. Saf. Sci. 2023, 168, 106292. [Google Scholar] [CrossRef]
Kuppusamy, P.; Bharathi, V. Human abnormal behavior detection using CNNs in crowded and uncrowded surveillance – A survey. Meas. Sens. 2022, 24, 100510. [Google Scholar] [CrossRef]
Azorin-Lopez, J.; Saval-Calvo, M.; Fuster-Guillo, A.; Garcia-Rodriguez, J.; Mora-Mora, H. Constrained self-organizing feature map to preserve feature extraction topology. Neural Comput. Appl. 2017, 28, 439–459. [Google Scholar] [CrossRef]
Azorin-Lopez, J.; Saval-Calvo, M.; Fuster-Guillo, A.; Garcia-Rodriguez, J.; Cazorla, M.; Signes-Pont, M.T. Group activity description and recognition based on trajectory analysis and neural networks. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 1585–1592. [Google Scholar]
Azorín-López, J.; Saval-Calvo, M.; Fuster-Guilló, A.; García-Rodríguez, J. Human behaviour recognition based on trajectory analysis using neural networks. In Proceedings of the The 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–7. [Google Scholar]
Wang, L.; Zhou, Y.; Li, R.; Ding, L. A fusion of a deep neural network and a hidden Markov model to recognize the multiclass abnormal behavior of elderly people. Knowl.-Based Syst. 2022, 252, 109351. [Google Scholar] [CrossRef]
Abdalla, M.; Javed, S.; Radi, M.A.; Ulhaq, A.; Werghi, N. Video Anomaly Detection in 10 Years: A Survey and Outlook. arXiv 2024, arXiv:2405.19387. [Google Scholar]
Azorin-Lopez, J.; Saval-Calvo, M.; Fuster-Guillo, A.; Garcia-Rodriguez, J. A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process. Lett. 2016, 43, 363–387. [Google Scholar] [CrossRef]
Hu, X.; Lian, J.; Zhang, D.; Gao, X.; Jiang, L.; Chen, W. Video anomaly detection based on 3D convolutional auto-encoder. Signal Image Video Process. 2022, 16, 1885–1893. [Google Scholar] [CrossRef]
Borja-Borja, L.F.; Azorin-Lopez, J.; Saval-Calvo, M.; Fuster-Guillo, A.; Sebban, M. Architecture for automatic recognition of group activities using local motions and context. IEEE Access 2022, 10, 79874–79889. [Google Scholar] [CrossRef]
Borja-Borja, L.F.; Azorin-Lopez, J.; Saval-Calvo, M.; Fuster-Guillo, A. Deep learning architecture for group activity recognition using description of local motions. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Wang, L.; Tan, H.; Zhou, F.; Zuo, W.; Sun, P. Unsupervised anomaly video detection via a double-flow ConvLSTM variational autoencoder. IEEE Access 2022, 10, 44278–44289. [Google Scholar] [CrossRef]
Feng, X.; Song, D.; Chen, Y.; Chen, Z.; Ni, J.; Chen, H. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 5546–5554. [Google Scholar]
Yu, G.; Wang, S.; Cai, Z.; Zhu, E.; Xu, C.; Yin, J.; Kloft, M. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 583–591. [Google Scholar]
Sun, S.; Hua, J.; Feng, J.; Wei, D.; Lai, B.; Gong, X. TDSD: Text-driven scene-decoupled weakly supervised video anomaly detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 5055–5064. [Google Scholar]
Chen, W.; Ma, K.T.; Yew, Z.J.; Hur, M.; Khoo, D.A.A. TEVAD: Improved video anomaly detection with captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5549–5559. [Google Scholar]
Bendali-Braham, M.; Weber, J.; Forestier, G.; Idoumghar, L.; Muller, P.A. Recent trends in crowd analysis: A review. Mach. Learn. Appl. 2021, 4, 100023. [Google Scholar] [CrossRef]
Ansari, M.A.; Singh, D.K. Human detection techniques for real time surveillance: A comprehensive survey. Multimed. Tools Appl. 2021, 80, 8759–8808. [Google Scholar] [CrossRef]
Tyagi, B.; Nigam, S.; Singh, R. A Review of Deep Learning Techniques for Crowd Behavior Analysis. Arch. Comput. Methods Eng. 2022, 29, 5427–5455. [Google Scholar] [CrossRef]
Bhuiyan, M.R.; Abdullah, J.; Hashim, N.; Farid, F.A. Video analytics using deep learning for crowd analysis: A review. Multimed. Tools Appl. 2022, 81, 27895–27922. [Google Scholar] [CrossRef]
Zhai, Z.; Liu, P.; Zhao, L.; Qian, J.; Cheng, B. An efficiency-enhanced deep learning model for citywide crowd flows prediction. Int. J. Mach. Learn. Cybern. 2021, 12, 1879–1891. [Google Scholar] [CrossRef]
Elias, P.; Macko, M.; Sedmidubsky, J.; Zezula, P. Tracking subjects and detecting relationships in crowded city videos. Multimed. Tools Appl. 2022, 83, 15339–15361. [Google Scholar] [CrossRef]
Kim, S.; Hwang, S.; Hong, S.H. Identifying shoplifting behaviors and inferring behavior intention based on human action detection and sequence analysis. Adv. Eng. Inform. 2021, 50, 101399. [Google Scholar] [CrossRef]
Emad, M.; Ishack, M.; Ahmed, M.; Osama, M.; Salah, M.; Khoriba, G. Early-Anomaly Prediction in Surveillance Cameras for Security Applications. In Proceedings of the 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference, MIUCC 2021, Cairo, Egypt, 26–27 May 2021; pp. 124–128. [Google Scholar] [CrossRef]
Lin, W.; Gao, J.; Wang, Q.; Li, X. Learning to detect anomaly events in crowd scenes from synthetic data. Neurocomputing 2021, 436, 248–259. [Google Scholar] [CrossRef]
Zheng, Z.; Xia, Y.; Chen, X.; Yao, J. Security alert: Generalized deep multi-view representation learning for crime forecasting. Comput. Intell. 2022, 39, 4–17. [Google Scholar] [CrossRef]
Guo, H.; Zhang, D.; Jiang, L.; Poon, K.W.; Lu, K. ASTCN: An Attentive Spatial-Temporal Convolutional Network for Flow Prediction. IEEE Internet Things J. 2022, 9, 3215–3225. [Google Scholar] [CrossRef]
Khaire, P.; Kumar, P. A semi-supervised deep learning based video anomaly detection framework using RGB-D for surveillance of real-world critical environments. Forensic Sci. Int. Digit. Investig. 2022, 40, 301346. [Google Scholar] [CrossRef]
Fang, J.; Zhang, X.; Yang, B.; Chen, S.; Li, B. An Attention-based U-Net Network for Anomaly Detection in Crowded Scenes. In Proceedings of the 2022 IEEE 14th International Conference on Computer Research and Development, ICCRD 2022, Shenzhen, China, 7–9 January 2022; pp. 202–206. [Google Scholar] [CrossRef]
Galab, M.K.; Taha, A.; Zayed, H.H. Adaptive Technique for Brightness Enhancement of Automated Knife Detection in Surveillance Video with Deep Learning. Arab. J. Sci. Eng. 2021, 46, 4049–4058. [Google Scholar] [CrossRef]
Fernandez-Carrobles, M.M.; Deniz, O.; Maroto, F. Gun and Knife Detection Based on Faster R-CNN for Video Surveillance. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2019; Volume 11868, LNCS. [Google Scholar] [CrossRef]
Boltes, M.; Schumann, J.; Salden, D. Gathering of data under laboratory conditions for the deep analysis of pedestrian dynamics in crowds. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2017, Lecce, Italy, 29 August–1 September 2017. [Google Scholar] [CrossRef]
Yang, D.S.; Liu, C.Y.; Liao, W.H.; Ruan, S.J. Crowd gathering and commotion detection based on the stillness and motion model. Multimed. Tools Appl. 2020, 79, 19435–19449. [Google Scholar] [CrossRef]
Alqaysi, H.H.; Sasi, S. Detection of Abnormal behavior in Dynamic Crowded Gatherings. In Proceedings of the 2013 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 23–25 October 2013. [Google Scholar] [CrossRef]
Fitwi, A.; Chen, Y.; Sun, H.; Harrod, R. Estimating interpersonal distance and crowd density with a single-edge camera. Computers 2021, 10, 143. [Google Scholar] [CrossRef]
Zhou, X.; Wang, X.; Brown, G.; Wang, C.; Chin, P. Mixed Spatio-Temporal Neural Networks on Real-time Prediction of Crimes. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 1749–1754. [Google Scholar] [CrossRef]
Huang, Z.; Wang, P.; Zhang, F.; Gao, J.; Schich, M. A mobility network approach to identify and anticipate large crowd gatherings. Transp. Res. Part B Methodol. 2018, 114, 147–170. [Google Scholar] [CrossRef]
Nauman, M.A.; Shoaib, M. Identification of anomalous behavioral patterns in crowd scenes. Comput. Mater. Contin. 2022, 71, 925–939. [Google Scholar] [CrossRef]
Yang, C.L.; Wu, T.H.; Lai, S.H. Moving-Object-Aware Anomaly Detection in Surveillance Videos. In Proceedings of the 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-World Anomaly Detection in Surveillance Videos. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Zhou, S.; Shi, R.; Wang, L. Extracting macroscopic quantities in crowd behaviour with deep learning. Phys. Scr. 2024, 99, 065213. [Google Scholar] [CrossRef]
Qaraqe, M.K.; Elzein, A.; Basaran, E.; Yang, Y.; Varghese, E.B.; Costandi, W.; Rizk, J.; Alam, N. PublicVision: A Secure Smart Surveillance System for Crowd Behavior Recognition. IEEE Access 2024, 12, 26474–26491. [Google Scholar] [CrossRef]
Peng, Y.; Hao, H.; Zhou, T.; Han, B.; Yin, W. Research on the Detection Algorithm for Abnormal Crowd Behaviors Based on an Enhanced SlowFast Model. In Proceedings of the 2024 9th International Conference on Computer and Communication Systems (ICCCS), Xi’an, China, 19–22 April 2024; pp. 67–72. [Google Scholar] [CrossRef]
Chaudhary, R.; Kumar, M. Optimized deep maxout for crowd anomaly detection: A hybrid optimization-based model. Network 2024, 36, 148–173. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Qiu, L.; Wang, J.; Feng, S. The use of convolutional neural networks for abnormal behavior recognition in crowd scenes. Inf. Process. Manag. 2025, 62, 103880. [Google Scholar] [CrossRef]
Jadhav, N.; Rangdale, S.; Solav, S.; Gayake, N.; Nanware, S.; Ranjan, N.M. Utilizing YOLO V5 and deep learning Approach to detect and manage crowds through advanced computational methodologies and Twilio Programmable Messaging API. In Proceedings of the 2024 OPJU International Technology Conference (OTCON), Raigarh, India, 5–7 June 2024; pp. 1–5. [Google Scholar] [CrossRef]
Rajitha, B. Intelligent Vision-Based Systems for Public Safety and Protection via Machine Learning Techniques; IGI Global Scientific Publishing: Hershey, PA, USA, 2021. [Google Scholar] [CrossRef]
Luo, L.; Xie, S.; Yin, H.; Peng, C.; Ong, Y. Detecting and Quantifying Crowd-Level Abnormal Behaviors in Crowd Events. IEEE Trans. Inf. Forensics Secur. 2024, 19, 6810–6823. [Google Scholar] [CrossRef]
Yan, D.; Ding, G.; Huang, K.; Bai, C.; He, L.; Zhang, L. Enhanced Crowd Dynamics Simulation with Deep Learning and Improved Social Force Model. Electronics 2024, 13, 934. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world Anomaly Detection in Surveillance Videos. arXiv 2019, arXiv:1801.04264. [Google Scholar] [CrossRef]
Landi, F.; Snoek, C.G.; Cucchiara, R. Anomaly Locality in Video Surveillance. arXiv 2019, arXiv:1901.10364. [Google Scholar]
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Ferryman, J.; Shahrokni, A. PETS2009: Dataset and challenge. In Proceedings of the 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Snowbird, UT, USA, 7–9 December 2009; pp. 1–6. [Google Scholar] [CrossRef]
Mehran, R.; Oyama, A.; Shah, M. Abnormal crowd behavior detection using social force model. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 935–942. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar] [CrossRef]
Chan, A.B.; Morrow, M.; Vasconcelos, N. Analysis of Crowded Scenes using Holistic Properties. In Proceedings of the Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2009), Miami, FL, USA, 7–12 December 2009. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An Open Large Scale Video Database for Violence Detection. arXiv 2020, arXiv:1911.05913. [Google Scholar] [CrossRef]
Lu, Z.; Guan, X.; Yan, D.; Li, Y.; Huang, T. Crowd behavior reconstruction with deep group feature learning. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; pp. 5699–5706. [Google Scholar] [CrossRef]
Yahuarcani, I.O.; Diaz, J.E.G.; Satalaya, A.M.N.; Noriega, A.A.D.; Cachique, F.X.L.; Llaja, L.A.S.; Pezo, A.R.; Rojas, A.E.L. Recognition of violent actions on streets in urban spaces using Machine Learning in the context of the Covid-19 pandemic. In Proceedings of the 2021 International Conference on Electrical, Computer and Energy Technologies (ICECET), Cape Town, South Africa, 9–10 December 2021. [Google Scholar] [CrossRef]
Zhou, Y.; Qu, Y.; Xu, X.; Shen, F.; Song, J.; Shen, H. BatchNorm-based Weakly Supervised Video Anomaly Detection. arXiv 2023, arXiv:2311.15367. [Google Scholar] [CrossRef]
Liu, W.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Diversity-Measurable Anomaly Detection. arXiv 2023, arXiv:2303.05047. [Google Scholar] [CrossRef]
Xiao, J.; Liu, T.; Ji, G. Divide and Conquer in Video Anomaly Detection: A Comprehensive Review and New Approach. arXiv 2023, arXiv:2309.14622. [Google Scholar] [CrossRef]
Hachiuma, R.; Sato, F.; Sekii, T. Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling. arXiv 2023, arXiv:2303.15270. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Zhang, B.; Fok, W.; Qi, X.; Wu, Y.C. MGFN: Magnitude-Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection. arXiv 2022, arXiv:2211.15098. [Google Scholar] [CrossRef]
Naji, Y.; Setkov, A.; Loesch, A.; Gouiffès, M.; Audigier, R. Spatio-temporal predictive tasks for abnormal event detection in videos. arXiv 2023, arXiv:2210.15741. [Google Scholar] [CrossRef]
Reiss, T.; Hoshen, Y. Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection. arXiv 2022, arXiv:2212.00789. [Google Scholar] [CrossRef]
Mohammadi, H.; Nazerfard, E. Video Violence Recognition and Localization Using a Semi-Supervised Hard Attention Model. arXiv 2022, arXiv:2202.02212. [Google Scholar] [CrossRef]
Wu, J.C.; Hsieh, H.Y.; Chen, D.J.; Fuh, C.S.; Liu, T.L. Self-Supervised Sparse Representation for Video Anomaly Detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Georgescu, M.I.; Ionescu, R.; Khan, F.S.; Popescu, M.; Shah, M. A Background-Agnostic Framework with Adversarial Training for Abnormal Event Detection in Video. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4505–4523. [Google Scholar] [CrossRef]
Hirschorn, O.; Avidan, S. Normalizing Flows for Human Pose Anomaly Detection. arXiv 2023, arXiv:2211.10946. [Google Scholar] [CrossRef]
Garcia-Cobo, G.; SanMiguel, J.C. Human skeletons and change detection for efficient violence detection in surveillance videos. Comput. Vis. Image Underst. 2023, 233, 103739. [Google Scholar] [CrossRef]
Lv, H.; Zhou, C.; Cui, Z.; Xu, C.; Li, Y.; Yang, J. Localizing Anomalies From Weakly-Labeled Videos. IEEE Trans. Image Process. 2021, 30, 4505–4515. [Google Scholar] [CrossRef] [PubMed]
Wang, G.; Wang, Y.; Qin, J.; Zhang, D.; Bao, X.; Huang, D. Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles. arXiv 2022, arXiv:2207.10172. [Google Scholar] [CrossRef]
Islam, Z.; Rukonuzzaman, M.; Ahmed, R.; Kabir, M.H.; Farazi, M. Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021. [Google Scholar] [CrossRef]
Su, Q.; Shu, L.; Hancke, G.P.; Huang, K.; Nurellari, E.; Zhao, Q.; Choudhury, N.; Hazarika, A. Camera planning for physical safety of outdoor electronic devices: Perspective and analysis. IEEE/CAA J. Autom. Sin. 2025. [Google Scholar] [CrossRef]
Pereira, S.S.; Maia, J.E.B. MC-MIL: Video surveillance anomaly detection with multi-instance learning and multiple overlapped cameras. Neural Comput. Appl. 2024, 36, 10527–10543. [Google Scholar] [CrossRef]
Xie, Z.; Ni, Z.; Yang, W.; Zhang, Y.; Chen, Y.; Zhang, Y.; Ma, X. A robust online multi-camera people tracking system with geometric consistency and state-aware re-id correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024. [Google Scholar] [CrossRef]
Wu, H.; Zeng, Q.; Guo, C.; Zhao, T.; Chen, C.W. Target-Aware Camera Placement for Large-Scale Video Surveillance. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13338–13348. [Google Scholar] [CrossRef]
Jakob, P.; Madan, M.; Schmid-Schirling, T.; Valada, A. Multi-perspective anomaly detection. Sensors 2021, 21, 5311. [Google Scholar] [CrossRef] [PubMed]
Hsieh, Y.H.; Kao, C.C.; Lai, C.H.; Lin, K.P.; Yang, S.Y.; Yuan, S.M. Low-FPS Multi-Object Multi-Camera Tracking via Deep Learning. Electronics 2025, 14, 1373. [Google Scholar] [CrossRef]
Cob-Parro, A.C.; Losada-Gutiérrez, C.; Marrón-Romera, M.; Gardel-Vicente, A.; Bravo-Muñoz, I. Smart video surveillance system based on edge computing. Sensors 2021, 21, 2958. [Google Scholar] [CrossRef] [PubMed]
Dharan, A.M.; Mukhopadhyay, D. A comprehensive survey on machine learning techniques to mobilize multi-camera network for smart surveillance. Innov. Syst. Softw. Eng. 2025, 21, 313–332. [Google Scholar] [CrossRef]
Li, C.; Li, J.; Xie, Y.; Nie, J.; Yang, T.; Lu, Z. Multi-camera joint spatial self-organization for intelligent interconnection surveillance. Eng. Appl. Artif. Intell. 2022, 107, 104533. [Google Scholar] [CrossRef]
Zhu, L.; Wang, L.; Raj, A.; Gedeon, T.; Chen, C. Advancing video anomaly detection: A concise review and a new dataset. arXiv 2024, arXiv:2402.04857. [Google Scholar]
Wang, Y.; Zhao, Y.; Huo, Y.; Lu, Y. Multimodal anomaly detection in complex environments using video and audio fusion. Sci. Rep. 2025, 15, 1–22. [Google Scholar] [CrossRef]
Cao, C.; Lu, Y.; Wang, P.; Zhang, Y. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20392–20401. [Google Scholar]
Ramachandra, B.; Jones, M. Street scene: A new dataset and evaluation protocol for video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2569–2578. [Google Scholar]

Figure 1. Investigation discarded after filtering.

Figure 2. Number of publications published over time.

Figure 3. Distributions of publications published over time.

Table 1. Research questions.

Questions	Route
Q1. What is considered to be a gathering of people?	The foundational concept to consider at the outset of this research is the notion of a gathering. Valuable outcomes can be achieved by accurately determining the head count and confirming the presence of a group surpassing a specified threshold. Furthermore, it is crucial to establish the capability of tracking individuals and their interactions.
Q2. Can AI recognize what a gathering of people is?	The integration of deep learning techniques with well-trained models plays a crucial role in facilitating real-time crowd scene monitoring. By employing AI techniques and leveraging meticulously trained datasets alongside mathematical models, it becomes possible to effectively discern between organized congregations of individuals and mere streams of people occupying the same vicinity.
Q3. How can we detect with videos the difference between a peaceful and a nonpeaceful gathering?	For those occasions when a gathering may lead to violent actions, the system should be able to predict and even anticipate them. Once the need for the number of participating individuals is covered, whether the behavior of that crowd of people corresponds to something that could be called normal, or whether, on the contrary, actions outside the norm are evident.
Q4. What are the different types of deep learning networks and architectures used for abnormal event and gathering detection?	This section will be addressed by examining several prevalent learning mechanisms frequently employed in anomaly detection, including autoencoders, GANs, and diverse variations of CNNs.
Q5. What are the performance and accuracy of these methods on public security datasets?	The performance and accuracy of the CNNs and autoencoders examined in the preceding section can be influenced by several factors. These factors include the size and quality of the datasets, the complexity of the problem under consideration, the architectural design of the models employed, and the choice of evaluation metrics for performance assessment. Consequently, our research will focus on identifying the situations where the utilization of one AI technique over another is more advantageous or suitable.
Q6. What key elements of the datasets can provide evidence of the outcome of the gathering, and which datasets are most commonly used to monitor gatherings using deep learning?	We need studies to be able to analyze actions in a gathering of people, but also consider the combination of other determining factors, and whether a bladed weapon or blunt object can be found in the crowd. In these cases, it is necessary to use deep learning to detect other important factors.

Table 2. Inclusive and exclusive criteria.

Decisions	Related to
Inclusive Decisions	Related to people detection
	Related to the detection of objects
	Related to public safety
	Related to behavioral patterns
Exclusive Decisions	Not related to deep learning techniques
	Not related to camera analysis (computer vision)
	Not related to engineering (computer science)

Table 3. Inclusive and exclusive criteria (keyword filters are highlighted in bold).

Keywords	Count	Keywords	Count	Keywords	Count	Keywords	Count
detection	903	video	569	datasets	534	crowd	512
training	443	prediction	442	network	430	object	349
flow	264	anomaly	259	camera	204	crime	201
ratio	197	abnormal	186	performance	176	count	167
people	160	convolutional	152	framework	144	public	136
distance	128	tracking	114	density	108	recognition	108
deep learning	90	architecture	89	target	86	behavior	80
pattern	76	anomalous	75

Table 4. Categorized primary studies.

Category	Primary Studies
Crowd aggregation	[21,22,25,27,28,32,33,34,35,36,37]
Abnormal crowd aggregation	[20,23,24,25,28,30,31,38,39,40,41,42,43,44,45,46,47]
Critical abnormal aggregation	[26,39,48,49,50]

Table 5. Significant datasets for crowd analysis.

Dataset	Type	Nº Videos/Images	Method	Model
UCF-Crime [51]	Surveillance	2047	Abnormal event detection	Autoencoders, CNN, 3D-VAE, transfer learning.
UCF-Crime2Local [52]	Surveillance	13,662	Abnormal event detection	CNN, Spatiotemporal Autoencoder, transfer learning.
CrowdHuman [53]	Crowd	15,000 images, 5000 annotated	Pedestrian detection, multi-person tracking	Faster R-CNN, RetinaNet, YOLOv3, Cascade R-CNN
PETS2009 [54]	Surveillance	24,259	Object Tracking	Multiple Object Tracking (MOT), R-CNN, DeepSORT
UMN Dataset [55]	Surveillance	1200	Object detection	Faster R-CNN, YOLOv3, SSD, Mask R-CNN
ShanghaiTech [56]	Crowd	1119	Crowd counting	CNNs, DCNN, MCNN, FCN
UCSD [57]	Pedestrian	Over 2000	Anomaly detection	CNN, VAEs, GANs, Transfer learning
RWF-2000 [58]	Surveillance	2000	Abnormal event detection	LSTM, CNN, Reinforcement Learning
PublicVision [43]	Surveillance crowd	1413 videos	Behavior and violence-level recognition	Swin Transformer

Table 6. Categorized primary studies.

Method	Application	Metric	Value	Reference
Optimized deep maxout	Anomaly detection	Accuracy	97.28%	[45]
Enhanced SlowFast	Detection of 5 behaviors in public spaces	Processing speed (FPS)	40.5 FPS	[44]
YOLOv5 + Twilio API	Overcrowding detection	Real-time response (latency)	Not defined	[47]
ACSAM	Anomaly detection in dense crowds	Accuracy improvement over previous methods	>12%	[46]
PublicVision (Swin Transformer + encrypted VPN)	Crowd behavior classification	Global accuracy Mean Average Precision (mAP) Inference latency	89.76% 93.3% ∼20 frames/inference	[43]

Table 7. Summary of different techniques in video analysis.

Title	Objective	Model and Method	Tool	Dataset	Site
[33]	Determine the stillness state in every different place and situation.	RNN. Stillness level model and the leaky bucket model (LBM)	Not defined	PETS2009	Outdoor and indoor
[34]	Generate an alarm for the security personnel to take appropriate actions	DADCG algorithm. MHI and Optical Flow techniques.	CVST—Matlab R2012a	YouTube	Outdoor and indoor
[25]	Reduce over-fitting because of scarcity of data.	CNN: C3D, GAN, N3D, ResNet	SHADE	UCSD, UCF-Crime, UMN	Outdoor
[26]	Develop a unified model to predict crime, leveraging latent relationships from social media	Smote-TomekLinks iterative, 1D-CNN	Not defined	Twitter, 4-month city-wide dataset	Outdoor
[28]	Detection of anomalies in real-world surveillance sites	CNN MobileNet	Not defined	RGB+D, UCF-Crime2Local	Outdoor and indoor
[60]	Detect violent actions on streets in urban spaces	CNN	A conventional laptop, VLC tool	Created by the authors	Outdoor and indoor
[31]	Develop two novel weapon detectors applying deep learning	Faster R-CNN, RPN	Not defined	COCO 2017, Gupta dataset, Open Images dataset	Outdoor and indoor

Table 8. Quantitative results of crowd analysis models across various categories.

Category	Sub-Category	Accuracy
Crowd aggregation	Liquor law violation	59.8% [26]
	Narcotics	76.8% [26]
	Prostitution	86.9% [26]
	Deceptive practice	71.0% [26]
	Grappling	95.0% [60]
	Arrest	63.5% [25]
	Chase	59.1% [25]
	Run	63.2% [25]
Abnormal crowd aggregation	Robbery	77.3% [26]
	Theft	74.7% [26]
	Assault	75.1% [26]
	Punching	92.0% [60]
	Kicking	93.0% [60]
	Fight	55.1% [25]
	Knife Detection	85.44% [31]
Critical abnormal aggregation	Homicide	70.0% [26]
	Gun Detection	46.68% [31]
	Shoot	83.2% [25]
	Strangulation	90.0% [60]
	Kidnapping	52.1% [26]
	Arson	54.6% [26]
	Weapons violation	77.3% [26]

Table 9. Performance of anomaly detection methods across various datasets.

UCF-Crime		UCSD		ShanghaiTech		RWF-2000
Method	AUC (%)	Method	AUC (%)	Method	AUC (%)	Method	Accuracy
[61]	87.24	[62]	99.7	[63]	87.72	[64]	93.4
[65]	86.98	[66]	98.9	[67]	85.94	[68]	90.4
[69]	85.99	[70]	98.7	[71]	85.9	[72]	90.25
[73]	85.38	–	–	[74]	84.3	[75]	89.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rodrigo-Guillen, R.; Garcia-D’Urso, N.; Mora-Mora, H.; Azorin-Lopez, J. Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review. J. Sens. Actuator Netw. 2025, 14, 69. https://doi.org/10.3390/jsan14040069

AMA Style

Rodrigo-Guillen R, Garcia-D’Urso N, Mora-Mora H, Azorin-Lopez J. Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review. Journal of Sensor and Actuator Networks. 2025; 14(4):69. https://doi.org/10.3390/jsan14040069

Chicago/Turabian Style

Rodrigo-Guillen, Rafael, Nahuel Garcia-D’Urso, Higinio Mora-Mora, and Jorge Azorin-Lopez. 2025. "Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review" Journal of Sensor and Actuator Networks 14, no. 4: 69. https://doi.org/10.3390/jsan14040069

APA Style

Rodrigo-Guillen, R., Garcia-D’Urso, N., Mora-Mora, H., & Azorin-Lopez, J. (2025). Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review. Journal of Sensor and Actuator Networks, 14(4), 69. https://doi.org/10.3390/jsan14040069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Abnormal Behavior Events and Gatherings in Public Spaces Using Deep Learning: A Review

Abstract

1. Introduction

2. Research Methodology

2.1. Prior Research

2.2. Research Goals

2.3. Inclusion and Exclusion Criteria

2.4. Selection Results and Data Extraction

2.5. Temporary Chart of Publications

2.6. Most Representative Keywords

3. Findings and Quality Assessment

4. Analysis of the Results

4.1. Q1. What Is Considered a Gathering of People?

4.2. Q2. Can AI Recognize What a Gathering of People Is?

4.3. Q3. How Can We Detect with Videos the Difference Between a Peaceful and a Non-Peaceful Gathering?

4.4. Q4. What Are the Different Types of Deep Learning Networks and Architectures Used for Abnormal Event and Gathering Detection?

4.5. Q5. What Are the Performance and Accuracy of These Methods on Public Security Datasets?

4.6. Q6. What Key Elements of the Datasets Can Provide Evidence of the Outcome of the Gathering, and Which Datasets Are Most Commonly Used to Monitor Gatherings Using Deep Learning?

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI