Deep Learning for Abnormal Human Behavior Detection in Surveillance Videos—A Survey

: Detecting abnormal human behaviors in surveillance videos is crucial for various domains, including security and public safety. Many successful detection techniques based on deep learning models have been introduced. However, the scarcity of labeled abnormal behavior data poses significant challenges for developing effective detection systems. This paper presents a comprehensive survey of deep learning techniques for detecting abnormal human behaviors in surveillance video streams. We categorize the existing techniques into three approaches: unsupervised, partially supervised, and fully supervised. Each approach is examined in terms of its underlying conceptual framework, strengths, and drawbacks. Additionally, we provide an extensive comparison of these approaches using popular datasets frequently used in the prior research, highlighting their performance across different scenarios. We summarize the advantages and disadvantages of each approach for abnormal human behavior detection. We also discuss open research issues identified through our survey, including enhancing robustness to environmental variations through diverse datasets, formulating strategies for contextual abnormal behavior detection. Finally, we outline potential directions for future development to pave the way for more effective abnormal behavior detection systems.


Introduction
Abnormal human behavior detection involves identifying unusual behavior or state transitions in a targeted subject.Behavior deviating from the norm is deemed abnormal [1].In surveillance video monitoring, video footage from static cameras is analyzed for such behaviors [2][3][4][5][6].The field of abnormal behavior detection primarily focuses on security and public safety, promoting the well-being of society [7,8].Surveillance video provides valuable visual information within a defined field of view for detecting abnormal human behaviors [9][10][11].
Several survey papers have compared abnormal behavior detection techniques using various evaluation metrics, such as accuracy, equal error rate (EER), and area under the curve (AUC), meaning the area under the receiver operating characteristic (ROC) curve [57][58][59][60][61][62][63].These metrics are crucial for assessing the effectiveness of different approaches in identifying abnormal behaviors.In recent years, abnormal behavior prediction analysis utilizes a variety of deep learning algorithms [64].Deep learning techniques for abnormal human behavior detection can be classified into three approaches: unsupervised, partially supervised, and fully unsupervised learning [65].Specifically, the weakly supervised and semi-supervised learning paradigms are referred to as partially supervised learning [66].The scarcity of abnormal human behavior data poses a significant challenge [67].Hence, unsupervised and partially supervised detection approaches serve as alternatives to address the data scarcity issue [68][69][70].In these approaches, the model learns normal behavior patterns from input layers during the training phase [59].This model then detects abnormal behaviors by identifying deviations from the learned patterns or by comparing the new data against the identified normal behavior clusters.However, not all unsupervised learning schemes are equally effective in detecting abnormalities in images or videos.Two commonly used unsupervised learning approaches for abnormal human behavior detection are reconstruction-based [71] and generative detection [72] methods.
In summary, this study provides a comprehensive examination of methods across the three deep-learning-based detection approaches, addressing evident gaps such as the limited coverage of research in the past five years, the scarcity of abnormal human behavior data, and limited performance comparisons using popular datasets from the prior research.

Literature Review Methodology
The survey was conducted on the literature published in major journals.Their online platforms were explored to locate the latest articles regarding deep learning techniques for abnormal human behavior detection in surveillance videos.To ensure comprehensiveness, several reputable academic websites were consulted, including Web of Science, IEEE Xplore, Google Scholar, Science Direct, Scopus, ACM, and MDPI. Figure 1 illustrates the trend in the number of papers focusing on deep learning for abnormal human behavior detection over the past five years, from 2019 to 2023.With this upward trend, the topic of abnormal human behavior detection research using deep learning is becoming increasingly prominent, as evidenced by the substantial number of research publications in this field, particularly in 2023.
Electronics 2024, 13, x FOR PEER REVIEW 2 of 37 Several survey papers have compared abnormal behavior detection techniques using various evaluation metrics, such as accuracy, equal error rate (EER), and area under the curve (AUC), meaning the area under the receiver operating characteristic (ROC) curve [57][58][59][60][61][62][63].These metrics are crucial for assessing the effectiveness of different approaches in identifying abnormal behaviors.In recent years, abnormal behavior prediction analysis utilizes a variety of deep learning algorithms [64].Deep learning techniques for abnormal human behavior detection can be classified into three approaches: unsupervised, partially supervised, and fully unsupervised learning [65].Specifically, the weakly supervised and semi-supervised learning paradigms are referred to as partially supervised learning [66].The scarcity of abnormal human behavior data poses a significant challenge [67].Hence, unsupervised and partially supervised detection approaches serve as alternatives to address the data scarcity issue [68][69][70].In these approaches, the model learns normal behavior patterns from input layers during the training phase [59].This model then detects abnormal behaviors by identifying deviations from the learned patterns or by comparing the new data against the identified normal behavior clusters.However, not all unsupervised learning schemes are equally effective in detecting abnormalities in images or videos.Two commonly used unsupervised learning approaches for abnormal human behavior detection are reconstruction-based [71] and generative detection [72] methods.
In summary, this study provides a comprehensive examination of methods across the three deep-learning-based detection approaches, addressing evident gaps such as the limited coverage of research in the past five years, the scarcity of abnormal human behavior data, and limited performance comparisons using popular datasets from the prior research.

Literature Review Methodology
The survey was conducted on the literature published in major journals.Their online platforms were explored to locate the latest articles regarding deep learning techniques for abnormal human behavior detection in surveillance videos.To ensure comprehensiveness, several reputable academic websites were consulted, including Web of Science, IEEE Xplore, Google Scholar, Science Direct, Scopus, ACM, and MDPI. Figure 1 illustrates the trend in the number of papers focusing on deep learning for abnormal human behavior detection over the past five years, from 2019 to 2023.With this upward trend, the topic of abnormal human behavior detection research using deep learning is becoming increasingly prominent, as evidenced by the substantial number of research publications in this field, particularly in 2023.Several combinations of keywords were used, including "deep learning", "unsupervised learning", "weakly-supervised learning", "semi-supervised learning", "anomaly detection", and "video surveillance", to search for relevant articles.In total, 1284 results were obtained from several academic journal search engines.Among these, 416 results were from Science Direct, 388 from Google Scholar, 93 from IEEE Xplore, and 63 from Scopus.The remaining results were obtained from the Web of Science, ACM, and MDPI search engines.Figure 2 illustrates the distribution of related published papers on abnormal human behavior detection by search engines.From the collected set of articles, all those published before 2019 were removed.Additionally, priority was given to journal articles.The remaining articles were screened to exclude those not related to abnormal behavior detection, particularly those that do not focus on video surveillance and static cameras.Then, the remaining papers were categorized into unsupervised, partially supervised, and fully supervised detection approaches.Finally, the screening process was completed with 97 papers, including 13 survey papers from related works.Approximately 90% of the selected papers are from the past four years, since 2020.Only 9% of the selected papers were published in 2019.
Several combinations of keywords were used, including "deep learning", "unsupervised learning", "weakly-supervised learning", "semi-supervised learning", "anomaly detection", and "video surveillance", to search for relevant articles.In total, 1284 results were obtained from several academic journal search engines.Among these, 416 results were from Science Direct, 388 from Google Scholar, 93 from IEEE Xplore, and 63 from Scopus.The remaining results were obtained from the Web of Science, ACM, and MDPI search engines.Figure 2 illustrates the distribution of related published papers on abnormal human behavior detection by search engines.From the collected set of articles, all those published before 2019 were removed.Additionally, priority was given to journal articles.The remaining articles were screened to exclude those not related to abnormal behavior detection, particularly those that do not focus on video surveillance and static cameras.Then, the remaining papers were categorized into unsupervised, partially supervised, and fully supervised detection approaches.Finally, the screening process was completed with 97 papers, including 13 survey papers from related works.Approximately 90% of the selected papers are from the past four years, since 2020.Only 9% of the selected papers were published in 2019.

Contributions of the Paper
The objective of this paper is to assess the strengths and weaknesses of various abnormal behavior detection techniques within the domain of deep learning.Additionally, it provides an overview of recent breakthroughs from studies conducted over the past five years.Each piece of research is meticulously analyzed, including the abnormal behavior datasets utilized, performance evaluation results of the models in terms of AUC or accuracy, and an examination of the pros and cons associated with each prior research endeavor.This paper places particular emphasis on deep learning techniques, thereby narrowing down the focus compared to earlier review papers.The key contributions of this paper are as follows:

Contributions of the Paper
The objective of this paper is to assess the strengths and weaknesses of various abnormal behavior detection techniques within the domain of deep learning.Additionally, it provides an overview of recent breakthroughs from studies conducted over the past five years.Each piece of research is meticulously analyzed, including the abnormal behavior datasets utilized, performance evaluation results of the models in terms of AUC or accuracy, and an examination of the pros and cons associated with each prior research endeavor.This paper places particular emphasis on deep learning techniques, thereby narrowing down the focus compared to earlier review papers.The key contributions of this paper are as follows: 1.
Categorizing deep learning techniques for abnormal human behavior detection into three main detection approaches: unsupervised, partially supervised, and fully supervised.

2.
Discussing the strengths and drawbacks of each learning scheme for training a deep learning model for abnormal human behavior detection.

3.
Conducting a comprehensive comparison of the performances of deep-learning-based abnormal human behavior detection techniques on popular benchmarking datasets.

4.
Exploring open research issues in the field of abnormal human behavior detection in surveillance videos.

Organization of the Paper
This paper is organized as follows: Section 2 explains the types of abnormal human behavior detection and the prior research.Section 3 presents popular abnormal behavior datasets utilized in the prior research works.Section 4 surveys the deep learning techniques for abnormal human behavior detection in surveillance videos, categorizing them into unsupervised, partially supervised, and fully supervised approaches.Section 5 discusses the open research issues of the current deep learning techniques.Finally, Section 6 presents the conclusion of this survey paper.Figure 3 illustrates the organizational structure of this survey to facilitate navigation through the paper.
surveillance videos.

Organization of the Paper
This paper is organized as follows: Section 2 explains the types of abnormal human behavior detection and the prior research.Section 3 presents popular abnormal behavior datasets utilized in the prior research works.Section 4 surveys the deep learning techniques for abnormal human behavior detection in surveillance videos, categorizing them into unsupervised, partially supervised, and fully supervised approaches.Section 5 discusses the open research issues of the current deep learning techniques.Finally, Section 6 presents the conclusion of this survey paper.Figure 3 illustrates the organizational structure of this survey to facilitate navigation through the paper.

Abnormal Human Behavior Detection
Abnormal human behaviors (AHB) involve observing actions of human-like entities and identifying unusual patterns in behavior that deviate from the norm.These behaviors are labeled as 'abnormal' because they diverge from typical environmental contexts [73].The rapid detection of abnormal behavior is crucial in real-time settings, particularly in environments where public safety is paramount [74].AHB detection poses challenges due to the dynamic visual characteristics influenced by environmental conditions and the nature of abnormal actions [75].

Types of Abnormal Behaviors
In the process of abnormal human behavior detection, certain strategies prioritize detection time.Abnormal human behaviors are generally classified into two types: shortterm and long-term abnormal behaviors.Further elaboration on these classifications is provided in the subsequent subsections.

Abnormal Human Behavior Detection
Abnormal human behaviors (AHB) involve observing actions of human-like entities and identifying unusual patterns in behavior that deviate from the norm.These behaviors are labeled as 'abnormal' because they diverge from typical environmental contexts [73].The rapid detection of abnormal behavior is crucial in real-time settings, particularly in environments where public safety is paramount [74].AHB detection poses challenges due to the dynamic visual characteristics influenced by environmental conditions and the nature of abnormal actions [75].

Types of Abnormal Behaviors
In the process of abnormal human behavior detection, certain strategies prioritize detection time.Abnormal human behaviors are generally classified into two types: shortterm and long-term abnormal behaviors.Further elaboration on these classifications is provided in the subsequent subsections.

Short-Term Abnormal Behaviors
Short-term abnormal human behavior refers to behaviors that deviate from the norm and can be identified by analyzing a relatively short duration of video frames.Decisions regarding such behaviors can be made immediately as their consequences become apparent in real-time [76].Examples of short-term abnormal behaviors include fires, running, falling, crowding, throwing objects, fighting, trespassing, and moving in opposite directions.
Researchers have explored the early detection of burning fires in real-time scenarios [12,77,78].Fires, whether resulting from accidents or intentional human acts, are classified as short-term abnormal behavior as they can be detected from a single-frame view of fire.The act of human running can be considered abnormal behavior in environments where most individuals typically walk.[17].Numerous studies have explored the detection of running behavior as a significant aspect [13][14][15][16].Visual differentiation between walking and running can be achieved using a single image [79].Hence, running can be classified as short-term abnormal behavior.A fall condition occurs when a person loses body balance and ends up in an unstable position [80].Research has been conducted on fall detection, establishing it as one of the classifications for short-term abnormal behavior detection [18][19][20][21][22].A crowd is defined as a condition where two or more people are closely grouped within a single frame [23].Detecting crowds is crucial as it often signifies an abnormal event [24][25][26][27][28]. Throwing objects involves the act of hurling potentially dangerous or harmful items [31].Numerous studies on detecting this behavior emphasize the risk posed by the throwing of prohibited items [29,30].Identifying suspicious objects in the air from a single frame classifies this behavior as short-term abnormal.Physical altercations involving two or more individuals with the potential for injury are categorized as fights [32].Detection of these fights has been undertaken by several researchers to mitigate potential impacts [33][34][35].Similar to the preceding category, fighting behavior falls under short-term abnormal behavior due to its detectability within a few frames.The high occurrence of accidents resulting from trespassing in restricted areas, such as railway lines, necessitates proactive measures [37].Detecting human presence in restricted areas from a single frame allows for the identification of breaches in designated zones [36,[38][39][40][41]. Thus, trespassing is classified as short-term abnormal behavior.Moving in the opposite direction, typically observed when an individual walks against the flow within a crowd, can be identified with only a small number of frames [42].Therefore, this behavior is categorized as short-term abnormal.

Long-Term Abnormal Behaviors
Long-term abnormal human behavior refers to persistent patterns of unusual behavior observed over an extended period.Unlike short-term abnormal behavior, it requires prolonged observation to discern significant deviations from expected behavioral patterns.These behaviors may unfold gradually over time, necessitating continuous monitoring and analysis to understand their full impact.Examples of long-term abnormal behaviors include loitering, leaving bags unattended, and prolonged absence of movement in specific areas.
Loitering, where individuals aimlessly linger in crowded areas, can pose threats to public safety [48].Detection of such behavior occurs when an individual follows others without any apparent purpose for an extended period [43][44][45][46][47]. Therefore, loitering is classified as long-term abnormal behavior.An unattended bag refers to a situation where the owner intentionally leaves it behind for a certain period [49].If the bag contains potentially dangerous items, swift action must be taken [50][51][52][53].The time required to detect the unattended bag places it in the category of long-term abnormal behavior.The lack of human movement detected by the camera can signal abnormal behavior.This phenomenon is also referred to as unusual inactivity or stationary movement detection.The absence of human movement in a specific area raises concerns, whether it involves a group of individuals [55,56] or an elderly person [54].Detecting inactivity requires a longer time to reach a conclusive decision.Therefore, this category is classified as long-term abnormal behavior.

Prior Research on Abnormal Behavior Recognition
Deep learning offers advantages as it requires minimal hand engineering, especially with the increasing availability of computing power and data [81].Techniques utilizing fully supervised learning frameworks in deep learning can achieve high accuracy but demand substantial amounts of data and computational resources [82,83].Several studies employ convolutional neural networks (CNN), long-short-term memory (LSTM), and gated recurrent unit (GRU) architectures to analyze abnormal human behavior spatially and temporally [84][85][86][87][88][89][90][91][92][93][94].However, a significant obstacle in deep learning is the limited availability of data, often referred to as data scarcity [95].
The advancement of abnormal human behavior detection is hindered by issues related to data scarcity [96].Detecting abnormal human behavior that has not been previously defined in the training data, especially given the wide variety of behaviors, poses significant challenges [97].Furthermore, many instances of abnormal human behavior are context-dependent, with behaviors considered abnormal in one setting being normal in another [98].This phenomenon arises from the model's incapacity to capture intrinsic uncertainty and often leads to decreased efficiency in the event recognition phase, particularly due to data scarcity [99,100].To address these challenges, an important research question emerges: How can models be effectively trained to detect abnormal human behavior with limited labeled data, considering the diversity and context-dependency of behaviors?To mitigate the impact of data scarcity, the model is expected to learn from unlabeled data by identifying relationships and patterns through unsupervised and partially supervised learning paradigms [101,102].
There have been several previous works surveying abnormal human behavior detection systems.Patrikar and Parate [60] provide a survey on image-based detection systems for abnormal behaviors in video surveillance.They also survey edge-computing-based abnormal detection and divide the explanations into two main parts: learning and modeling algorithms.However, their focus is mainly on the application of edge computing, lacking a thorough exploration in the context of machine learning.Myagmar-Ochir and Kim [101] survey video surveillance systems (VSS) for smart city applications, but their explanation of the methods used in unsupervised learning methods is incomplete.Duong, Le, and Hoang [98] survey vision-based human activity recognition and describe popular databases commonly used.They also present data processing and feature engineering.However, their survey does not include quantitative comparisons using metrics between research results for each prior study.Choudhry et al. [59] comprehensively explain the challenges of machine learning techniques for VSS and divide them into three categories: supervised, semi-supervised, and unsupervised.However, their scope does not focus on image-based detection.The previous surveys rarely cover research published in 2023.Moreover, there is a need for surveys to discuss the scarcity of abnormal human behavior data, as well as challenges and future applications.Table 1 summarizes the previous surveys.

Datasets
Several datasets have been widely used by researchers to benchmark related research.This section addresses the question, "What are the datasets used by prior research in abnormal human behavior detection?"The University of California, San Diego (UCSD) anomaly dataset consists of 70 pieces of video footage captured from an elevated perspective to monitor pedestrian walkways [171].Abnormal events captured in this dataset include the presence of non-pedestrian entities in the walkways and anomalous pedestrian motion patterns.The UCSD anomaly dataset comprises two sets of videos, Ped1 and Ped2.Ped1 contains footage of people walking towards and away from the camera, with various perspective distortions, along with humans identified as abnormalities.Ped2 includes scenes with pedestrian movement parallel to the camera plane from a top angle, where non-human objects are considered abnormal.
The ShanghaiTech (ST) Campus dataset comprises 13 scenes with complex lighting conditions and camera angles from all sides [172].It contains 130 abnormal events and over 270,000 training frames across a total of 437 videos for training and testing.The University of Central Florida (UCF)-Crime dataset consists of 1900 videos totaling 128 h, covering 13 anomalies in real-world environments, including fighting, vandalism, and robbery [173].The Avenue dataset includes 37 videos, with 16 training video clips and 21 testing video clips.Filmed on the Chinese University of Hong Kong (CUHK) campus, it comprises 30,652 frames, evenly split between training and testing, and features 14 unusual incidents such as people running, loitering, and throwing objects [174].The University of Minnesota (UMN) dataset encompasses 11 different abnormal scenarios with 3 scenes indoors and outdoors, totaling 22 videos for training and testing [175].The Performance Evaluation of Tracking and Surveillance (PETS) dataset, recorded at the Whiteknights Campus, University of Reading, UK, captures abnormal behaviors including people counting, density estimation, person tracking, flow analysis, and event recognition [176].The Subway dataset features two videos totaling two hours, containing 209 and 150 frames, comprising exit gate and entry gate videos.It includes 19 types of unusual events such as walking in the wrong direction, loitering, and wandering near exits [177].The UBI-Fights dataset, generated by Universidade da Beira Interior in 2020, focuses specifically on fighting events.It consists of an 80-h video dataset labeled at the frame level, including 216 videos of fighting events and others depicting daily life [178].
The Live Videos (LV) dataset consists of 30 videos featuring various abnormal scenes, including 14 different abnormal events, with a total duration of 3.93 h [179].The Surveillance Fight dataset consists of 300 videos, divided into fight and non-fight sequences taken from movies [180].The Hockey Fight dataset consists of 1000 video clips from hockey games, manually labeled as fight or non-fight [181].The Violent Flows dataset consists of 246 videos taken from YouTube and de-interlaced as audio video interleave (AVI) files [182].The Traffic Anomaly Dataset (TAD) consists of 500 videos totaling 25 h, featuring abnormal actions such as vehicle accidents, illegal turns, illegal occupations, retrograde motion, pedestrian on road, road spills, and more [169].The Universidad Panamericana Fall (UP-Fall) dataset includes 11 activities, as well as five different types of human falls, such as falling forward, backward, and sideways using hands or knees [183].The Atomic Visual Actions (AVA) dataset annotates 80 atomic visuals for 437 video clips of human actions [184].The Multiple Camera Fall (MCF) dataset was taken from eight cameras with different angles, capturing normal daily activities and simulated falls [185].The University of Rzeszow-Fall (UR-Fall) dataset contains 70 videos, categorized into 30 fall videos and 40 daily living videos [186].The VOC2007 dataset includes 9963 images, consisting of 24,640 annotated objects such as humans, animals, vehicles, and more [187].The Penn-Fudan dataset contains 170 images with 345 labeled pedestrians, with 96 images from the University of Pennsylvania and 74 from Fudan University [188].The UCF-50 dataset consists of 50 actions with a minimum of 100 videos for each category, taken from YouTube [189].Extending from the UCF-50 dataset, the UCF-101 dataset contains 101 classes with a total of 13,320 clips [190].
The OpenImages dataset includes 600 object classes with a total of 3.68 million bounding boxes attached [191].The Carnegie Mellon University (CMU) graphics lab dataset consists of 11 videos with a total of 2477 frames, 1268 of which depict abnormal actions [192].The University of Texas (UT)-Interaction dataset contains videos of continuous human interactions, divided into six classes: shake-hands, point, hug, push, kick, and punch [193].The Peliculas Movies (PEL) dataset includes 368 frames, of which 268 are fight frames taken from the movies [194].The Web Dataset (WED) consists of 1280 frames comprising 12 sequences of normal crowd scenes such as walking and running and 8 scenes of abnormal scenes including escape panics, protesters clashing, and crowd fighting [195].The Human Motion DataBase-51 (HMDB51) includes 51 action categories, which, in total, contains around 7000 manually annotated clips from YouTube [196].The Kinetics-600 is a large human action dataset with 480,000 video clips and categorized into 600 action classes [197].
The YouTube Action dataset includes 1640 videos, categorized into 11 classes collected from YouTube [198].
The unsupervised methods, utilizing reconstruction-based techniques like AE, VAE, and CAE, demonstrate robustness in learning from unlabeled data, achieving notable AUC scores such as 0.984 on Ped2 by Wang et al. in 2023 [108] and 0.988 on Ped1 by Ganokratanaa et al. in 2022 [140].These approaches capitalize on identifying patterns and anomalies without extensive labeled data, effectively addressing the challenge of data scarcity.Partially supervised methods, including semi-supervised and weakly supervised approaches, also show promising results with AUC scores like 0.945 on Ped1 by Sikdar and Chowdhury in 2020 [150], leveraging a combination of labeled and unlabeled data.Among these, state-of-the-art models continue to advance, exemplified by recent studies achieving high AUC scores through innovative techniques in abnormal behavior detection tasks.Table 2 shows the composition and features of some popular abnormal human behavior datasets.Figure 4 visually illustrates sample abnormal behavior images from each dataset.2.

Unsupervised Approach
Unsupervised detection refers to techniques that identify patterns, anomalies, or structures in data without the need for labeled examples.In the context of abnormal human behavior detection, this approach is particularly valuable because obtaining labels for various abnormal behaviors is often challenging and inefficient [199].Two popular methods within this category are reconstruction-based and generative techniques.Reconstruction-based detection models learn patterns from input images, while generative detection models attempt to generate an artificial image that they have learned.

Reconstruction-Based Detection
Reconstruction-based methods model normal data distribution with the principle that the model is trained using only normal data.Anomalous data are then assigned high reconstruction errors by the model [200].During the inference phase, if a test image is  2.

Deep Learning Techniques for Abnormal Human Behavior Detection 4.1. Unsupervised Approach
Unsupervised detection refers to techniques that identify patterns, anomalies, or structures in data without the need for labeled examples.In the context of abnormal human behavior detection, this approach is particularly valuable because obtaining labels for various abnormal behaviors is often challenging and inefficient [199].Two popular methods within this category are reconstruction-based and generative techniques.Reconstructionbased detection models learn patterns from input images, while generative detection models attempt to generate an artificial image that they have learned.

Reconstruction-Based Detection
Reconstruction-based methods model normal data distribution with the principle that the model is trained using only normal data.Anomalous data are then assigned high reconstruction errors by the model [200].During the inference phase, if a test image is abnormal, the model struggles to reconstruct the image.Reconstruction-based detection methods include auto-encoders (AE), variational auto-encoders (VAE), and convolutional auto-encoders (CAE).Auto-encoders are neural networks that learn input data and attempt to reconstruct new images based on previously learned patterns.Generally, AE consist of two structures: encoders and decoders.The objective is to minimize the reconstruction error, enabling the model to more accurately reconstruct images based on learned data.Table 3 summarizes the strengths and drawbacks of reconstruction-based methods for AHB detection with AUC scores on the Ped1, Ped2, and CUHK datasets.Several of the recent studies have utilized auto-encoders for AHB detection.Wang et al. [108] and Sampath and Kumar [111] proposed a spatio-temporal AE, achieving an AUC value of over 0.98 for detecting abnormal behavior on the UCSD Ped1 and Ped2 datasets.However, the spatio-temporal AE is unable to fully utilize and understand the implicit video information, especially when using a single modality camera.To overcome these drawbacks, as illustrated in Figure 5, Liu et al. [109] developed AHB detection using an object-centric scene inference network (AUC 0.917 on CUHK), while Li et al. [110] utilized skeleton features to avoid the manual specification of normal data, achieving an AUC of 0.883 on the CUHK dataset.Both methods present better results compared with Wang et al. [108], who reported an AUC score of 0.861 on the CUHK dataset.Unfortunately, while the skeleton features accurately identify normal walking, they may miss detecting some instances of abnormal pedestrian brawling.Therefore, Yan et al. [9] introduced clustering and scoring system approaches using AE to better distinguish abnormal human behaviors.Wang et al. [103] also introduced a self-supervised framework known as the abnormal event detection network, comprising a principal component analysis (PCA) network and kernel principal component analysis.The framework achieved an outstanding AUC score of 0.997 using the UMN dataset.However, the method still depends on certain hyperparameters.Several of the recent studies have utilized auto-encoders for AHB detection.Wang et al. [108] and Sampath and Kumar [111] proposed a spatio-temporal AE, achieving an AUC value of over 0.98 for detecting abnormal behavior on the UCSD Ped1 and Ped2 datasets.However, the spatio-temporal AE is unable to fully utilize and understand the implicit video information, especially when using a single modality camera.To overcome these drawbacks, as illustrated in Figure 5, Liu et al. [109] developed AHB detection using an object-centric scene inference network (AUC 0.917 on CUHK), while Li et al. [110] utilized skeleton features to avoid the manual specification of normal data, achieving an AUC of 0.883 on the CUHK dataset.Both methods present better results compared with Wang et al. [108], who reported an AUC score of 0.861 on the CUHK dataset.Unfortunately, while the skeleton features accurately identify normal walking, they may miss detecting some instances of abnormal pedestrian brawling.Therefore, Yan et al. [9] introduced clustering and scoring system approaches using AE to better distinguish abnormal human behaviors.Wang et al. [103] also introduced a self-supervised framework known as the abnormal event detection network, comprising a principal component analysis (PCA) network and kernel principal component analysis.The framework achieved an outstanding AUC score of 0.997 using the UMN dataset.However, the method still depends on certain hyperparameters.Additionally, foreground detection may inadvertently remove incorrect objects, resulting in false negative issues.VAEs are often conflated with traditional auto-encoders, despite being distinct entities.These models diverge in their mathematical formulations Additionally, foreground detection may inadvertently remove incorrect objects, resulting in false negative issues.VAEs are often conflated with traditional auto-encoders, despite being distinct entities.These models diverge in their mathematical formulations and objectives.VAE operates as a probabilistic generative model, requiring a neural network comprising an encoder and decoder.The encoder initially adjusts the parameters of the variational distribution, while the decoder maps from the latent space to the input space.VAEs are integral components of probabilistic graphical models and variational Bayesian methods [199,200].
Works on AHB detection using VAE were conducted by Wang et al. [112] to detect crowd scenes, which proved to be very challenging due to the complexity of frames.To address this issue, Yan et al. [114] and Wang et al. [118] generated a probability score using a double-flow VAE to differentiate abnormal behavior.However, the AUC score was not so high as when using two separate VAEs, as illustrated in Figure 6.Hence, several studies also employ temporal schemes to predict AHB scenes [115,117], resulting in significant improvements in AUC scores up to 0.961 on the Ped2 dataset.This indicates that the use of temporal schemes can substantially enhance the performance of AHB detection compared to methods relying solely on static features.The most recent research on AHB detection using the VAE method was conducted by Liu et al. [120].They proposed stochastic video normality networks to learn various patterns of normal events in temporal, spatial, and spatiotemporal dimensions.The concept involves encoding past frames into a posterior distribution, from which latent variables are sampled using a VAE to predict future frames.The AUC results of this network reach 0.984 using the Ped2 dataset and 0.907 using the CUHK dataset.However, the performance of this network relies on hyperparameter settings for optimal AHB detection.Therefore, Cho et al. [116] introduced an implicit twopath auto-encoder and distribution modeling of normal features based on a normalizing flow model in an unsupervised manner for AHB detection.The achieved AUC score is impressive, reaching 0.992 using the Ped2 dataset and 0.880 using the CUHK dataset, indicating high performance.However, distinguishing between normal and abnormal scenes becomes challenging due to the similarity in appearance and motion between pedestrians and walking patterns.Consequently, the VAE and normalizing flow model struggle to differentiate between normal and abnormal behaviors.
work comprising an encoder and decoder.The encoder initially adjusts the parameters of the variational distribution, while the decoder maps from the latent space to the input space.VAEs are integral components of probabilistic graphical models and variational Bayesian methods [199,200].
Works on AHB detection using VAE were conducted by Wang et al. [112] to detect crowd scenes, which proved to be very challenging due to the complexity of frames.To address this issue, Yan et al. [114] and Wang et al. [118] generated a probability score using a double-flow VAE to differentiate abnormal behavior.However, the AUC score was not so high as when using two separate VAEs, as illustrated in Figure 6.Hence, several studies also employ temporal schemes to predict AHB scenes [115,117], resulting in significant improvements in AUC scores up to 0.961 on the Ped2 dataset.This indicates that the use of temporal schemes can substantially enhance the performance of AHB detection compared to methods relying solely on static features.The most recent research on AHB detection using the VAE method was conducted by Liu et al. [120].They proposed stochastic video normality networks to learn various patterns of normal events in temporal, spatial, and spatiotemporal dimensions.The concept involves encoding past frames into a posterior distribution, from which latent variables are sampled using a VAE to predict future frames.The AUC results of this network reach 0.984 using the Ped2 dataset and 0.907 using the CUHK dataset.However, the performance of this network relies on hyperparameter settings for optimal AHB detection.Therefore, Cho et al. [116] introduced an implicit two-path auto-encoder and distribution modeling of normal features based on a normalizing flow model in an unsupervised manner for AHB detection.The achieved AUC score is impressive, reaching 0.992 using the Ped2 dataset and 0.880 using the CUHK dataset, indicating high performance.However, distinguishing between normal and abnormal scenes becomes challenging due to the similarity in appearance and motion between pedestrians and walking patterns.Consequently, the VAE and normalizing flow model struggle to differentiate between normal and abnormal behaviors.Basic auto-encoders (AE), including VAE, do not consider the two-dimensional structure of an image.Therefore, a solution is required from an unsupervised learning paradigm that can evenly distribute weights for each area in the image.The convolutional auto-encoder is designed to preserve the spatial locality of the input image, which is then passed on to the reconstruction stage.Subsequently, reconstruction is carried out based Basic auto-encoders (AE), including VAE, do not consider the two-dimensional structure of an image.Therefore, a solution is required from an unsupervised learning paradigm that can evenly distribute weights for each area in the image.The convolutional autoencoder is designed to preserve the spatial locality of the input image, which is then passed on to the reconstruction stage.Subsequently, reconstruction is carried out based on a linear combination of image patches using latent code [201].Max-pooling is performed to ensure filter selectivity as an activation function across overlapping subregions.This prevents reliance on any single weight generated by multiple areas in the image.During the reconstruction phase, the sparse latent code further reduces the average filter contributing to the decoding phase of each pixel, resulting in filters with high generalization [202].
Bahrami et al. [124] achieved an AUC score of 0.975 on the Ped2 dataset for frame-level detection using a spatiotemporal approach.However, the training time increases due to complex larger spatiotemporal parameters.To overcome the challenge of preserving spatial information in the deep layers, Kommanduri and Ghorai [128] designed a biresidual convolutional auto-encoder that is end-to-end trainable and introduces long-short projection skip connections.Additionally, Taghinezhad and Yazdi [129] proposed a novel multi-scale multi-path network architecture for AHB detection based on frame prediction.These two recent studies successfully achieved an AUC value above 0.976 using the Ped2 dataset, as illustrated in Figure 7.However, further research is needed on visual similarity, occlusions, and noise to achieve significant improvements in refined abnormality scores.
spatial information in the deep layers, Kommanduri and Ghorai [128] designed a bi-residual convolutional auto-encoder that is end-to-end trainable and introduces long-short projection skip connections.Additionally, Taghinezhad and Yazdi [129] proposed a novel multi-scale multi-path network architecture for AHB detection based on frame prediction.These two recent studies successfully achieved an AUC value above 0.976 using the Ped2 dataset, as illustrated in Figure 7.However, further research is needed on visual similarity, occlusions, and noise to achieve significant improvements in refined abnormality scores.In summary, while AE-based methods, such as those by Wang et al. [103] and Hu et al. [104], demonstrate solid performance on Ped1 and Ped2 datasets, they often rely heavily on hyperparameters and struggle with false negatives and object removal.VAEs, such as those proposed by Cho et al. [116] and Huang et al. [117], show excellent results on Ped2 but are highly sensitive to hyperparameter settings and face challenges in distinguishing visually similar abnormal scenes.CAE models, exemplified by Chu et al. [121] and Duman and Erdem [122], effectively detect anomalies using spatiotemporal features but still fall short compared to fully supervised methods, especially in handling distant activities and complex scenes.Additionally, models across all types suffer from issues such as high computational costs, the need for extensive hyperparameter tuning, and difficulties in generalizing to different datasets.These findings underscore the necessity for further optimization to enhance model robustness, efficiency, and applicability to realworld scenarios.

Generative Detection
The artificial image is generated from a learned distribution pattern, and its similarity to the original image is assessed [203].The difference between the original and fake images is used to detect whether abnormal human behavior is present in the captured frame.Since no labels are created, this approach remains within the unsupervised learning category.In generative adversarial networks (GANs), a random seed introduces some noise In summary, while AE-based methods, such as those by Wang et al. [103] and Hu et al. [104], demonstrate solid performance on Ped1 and Ped2 datasets, they often rely heavily on hyperparameters and struggle with false negatives and object removal.VAEs, such as those proposed by Cho et al. [116] and Huang et al. [117], show excellent results on Ped2 but are highly sensitive to hyperparameter settings and face challenges in distinguishing visually similar abnormal scenes.CAE models, exemplified by Chu et al. [121] and Duman and Erdem [122], effectively detect anomalies using spatiotemporal features but still fall short compared to fully supervised methods, especially in handling distant activities and complex scenes.Additionally, models across all types suffer from issues such as high computational costs, the need for extensive hyperparameter tuning, and difficulties in generalizing to different datasets.These findings underscore the necessity for further optimization to enhance model robustness, efficiency, and applicability to real-world scenarios.

Generative Detection
The artificial image is generated from a learned distribution pattern, and its similarity to the original image is assessed [203].The difference between the original and fake images is used to detect whether abnormal human behavior is present in the captured frame.Since no labels are created, this approach remains within the unsupervised learning category.In generative adversarial networks (GANs), a random seed introduces some noise to the initial random image.Subsequently, the generator layers attempt to produce fake examples.The objective in this scenario is to generate the best normal image possible.Then, the real normal image serves as the second input to the discriminator layers.These layers aim to distinguish between the normal image and the fake image generated by the generator.The weights of both the discriminator and generator models are updated using the backpropagation method.This process iterates until the maximum number of training epochs, as previously specified.
The recent research on AHB detection using GANs was conducted by Li et al. [148] and Huang et al. [147,204], achieving AUC scores above 0.968 using the Ped2 dataset.However, the recent research requires computation of a large number of parameters.As the number of input frames increases, detection speed decreases, and there is difficulty in determining skip intervals for large foreground motion amplitudes in video anomaly detection.Ganokratanaa et al. [132,140] proposed a novel unsupervised spatiotemporal anomaly detection and localization for surveillance videos using GANs.The AUC scores reached 0.996 on the UMN dataset, demonstrating near-perfect performance.Additionally, the model may face difficulties in distinguishing similar abnormal events from normal patterns.Table 4 shows other works utilizing the generative detection approach with AUC scores on Ped1, Ped2, CUHK, and ST datasets.When conducting comparisons, it is essential to ensure fairness by comparing the prior research using the same dataset.As illustrated in Figure 8, results using the ST dataset tend to be lower than those from other datasets.Interestingly, Aslam et al. [143] proposed an endto-end trainable two-stream attention-based approach that achieved an AUC score of 0.869 on the ST dataset and 0.894 on the CUHK dataset, which are the best results using these datasets compared to other studies.This is because during the inference stage, only the reconstruction branch is considered for computing the regularity score, while the prediction branch is utilized for better feature learning through GAN.These results highlight the need for further research to combine generative detection and reconstruction-based detection to achieve more optimal outcomes.

Partially Supervised Approach
Most reconstructive or generative approaches solely utilize normal samples, potentially resulting in a high false positive rate, particularly in real-world scenarios [154].Many individuals commonly associate model training using labeled data with supervised learning, while training with unlabeled data is often termed unsupervised learning.However, real-world situations often lack sufficient data for comprehensive training due to the high cost and time-consuming nature of full labeling [204].
Partially supervised learning occurs when both labeled and unlabeled data are available.Therefore, the question arises: How do partially supervised learning techniques lev-

Partially Supervised Approach
Most reconstructive or generative approaches solely utilize normal samples, potentially resulting in a high false positive rate, particularly in real-world scenarios [154].Many individuals commonly associate model training using labeled data with supervised learning, while training with unlabeled data is often termed unsupervised learning.However, real-world situations often lack sufficient data for comprehensive training due to the high cost and time-consuming nature of full labeling [204].
Partially supervised learning occurs when both labeled and unlabeled data are available.Therefore, the question arises: How do partially supervised learning techniques leverage unlabeled data to improve model performance when trained on limited labeled data?Labeled data act as anchor points for training and prediction phases with the unlabeled data [149].In partially supervised detection, two main schemes emerge: semi-supervised detection and weakly supervised detection.
Table 5 provides a summary of partially supervised detection research.Semi-supervised approaches, exemplified by Sikdar and Chowdhury [150] and Wu et al. [137], achieve high AUC scores (up to 0.989 for Ped2) through adaptive training and re-learning schemes but encounter challenges with sparse datasets, local descriptor construction, and dependencies on baseline models.In contrast, weakly supervised methods, such as those by Ullah et al. [162] and Chen et al. [164], leverage weakly labeled data, achieving high performance (up to 0.984 for Ped2) across diverse environments.However, they face challenges such as high false alarm rates, occlusion issues, and significant computational demands, particularly with transformer-based models.Therefore, while semi-supervised methods excel in data-specific performance, weakly supervised techniques offer broader applicability at the cost of increased complexity.Semi-supervised learning offers a method of learning the underlying structure of data using both labeled and unlabeled data [204].It falls within the partially supervised detection approach, which is commonly encountered in real-world scenarios due to the limited availability of labeled data [207].The semi-supervised detection scheme emphasizes augmenting limited labeled data with unlabeled data [208].The model is initially trained with labeled data to understand underlying patterns.It then uses predictions on unlabeled data to create pseudo-labels.The model is subsequently retrained with this combination of labeled and pseudo-labeled data, improving generalization.Techniques such as consistency regularization and graph-based methods are also used to ensure the model produces consistent predictions and propagates label information from labeled data to unlabeled data [209].
Sikdar and Chowdhury [150] introduced an adaptive training-less method for anomaly detection.The model identifies abnormal behavior without pre-training, dynamically adjusting certain model parameters during runtime.Achieving an AUC of 0.992 on the UMN dataset, the method has shown promising results.However, a slight performance lag was observed, attributed to the sparse dataset nature and challenges in constructing local descriptors.Singh et al. [151] proposed an algorithm for suspicious event detection based on direction and magnitude using a semi-supervised scheme, achieving an impressive AUC score of 0.999 using the UMN dataset, indicating near-perfect performance.However, the method's suitability for real-time applications is limited due to the time-consuming optical flow calculation required for each frame.Wu et al. [137] further enhanced this baseline approach with a semi-supervised re-learning scheme.They constructed a new training set by selectively extracting training instances from the original testing set, resulting in an increased AUC score from 0.858 to 0.885 on UCSD Ped1 datasets.Nevertheless, the model's performance remains closely tied to that of the baseline deep model, and occasional failure cases may still occur.
The research gap in semi-supervised detection includes developing methodologies that effectively handle sparse datasets and improve the accuracy of local descriptors.Exploring innovative semi-supervised and adaptive learning strategies could enhance the adaptability and robustness of anomaly detection models across different datasets and environments.Closing these gaps is essential for advancing anomaly detection systems in practical applications.

Weakly Supervised Detection
Weakly supervised learning aims to generate predictions with high information content [204].Unlike the semi-supervised scheme, the weakly supervised scheme focuses on enhancing detection results with limited labeled data [208].In weakly supervised detection, the model uses labels that are less precise, coarser, or noisier than fully supervised labels.Techniques such as multiple instance learning, where groups of instances are labeled rather than individual instances [210], and expectation-maximization algorithms, which estimate the most likely labels and optimize model parameters, are used [211].Regularization and adjustment of the loss function help the model deal with noise in the labels.This scheme addresses the challenge of missing training data without requiring extensive object annotation [212].
Recent works in weakly supervised AHB detection have utilized both video-level data [162][163][164]167], and image-level data [165,166].These recent works have demonstrated promising results, achieving AUC scores above 0.940 using the ST dataset.Temporal features have been a primary focus in these recent works to better distinguish abnormal human behavior.However, drawbacks of video-level weakly supervised AHB detection include increased computational resources required by transformers due to model parameter variation, limited interpretability of results, potential degradation of local discriminative representations in lengthy videos, and challenges in detecting anomalies in low-resolution videos.Additionally, image-level weakly supervised AHB detection faces several drawbacks, including high training time costs and a pseudo-label generator that lacks robustness.
Therefore, Ullah et al. [156] introduced a dual-stream CNN framework for detecting anomalous events in surveillance and non-surveillance environments.The first scheme employs a two-dimensional CNN as an auto-encoder for visual feature extraction and further utilizes temporal relations.Subsequently, three-dimensional features are extracted and integrated into two-dimensional spatiotemporal features for accurate detection.The achieved AUC scores are high, nearly 0.990 on the Violent Flow and Hockey Fight dataset, which indicates excellent performance.As illustrated in Figure 9, this work also achieves the highest result compared to others, reaching 0.858 using UCF-Crime.However, some extremely complex video sequences were mispredicted, contributing to the failure cases of the proposed model.This finding suggests that combining reconstruction-based and weakly supervised approaches could be a promising avenue for future research to integrate temporal and spatiotemporal features.

Fully Supervised Approach
Fully supervised learning is a paradigm where the model is trained to produce accurate detection outputs based on input data [59].Therefore, most research employing fully supervised learning schemes uses accuracy as the primary metric to assess model perfor-

Fully Supervised Approach
Fully supervised learning is a paradigm where the model is trained to produce accurate detection outputs based on input data [59].Therefore, most research employing fully supervised learning schemes uses accuracy as the primary metric to assess model performance [213].Typically, the model captures local features using CNN layers and then incorporates them into the LSTM layer to learn temporal relationships between features [93].
The recent works in fully supervised AHB detection focus on specific tasks such as fall detection [84,85] and detecting suspicious activity at automated teller machines [94].However, the effectiveness of the approach depends on factors such as image quality, camera position, and the presence of subjects.Moreover, fall detection encounters challenges in scenarios involving actions like crouching and sitting.The recent research addresses these challenges by combining CNN with LSTM and GRU to learn temporal features within video sequences [91,92].Some frameworks also utilize Kalman Filter, dual-stream CNN, dualattentional CNN (DA-CNN), and Bi-GRU.These studies use various datasets, including UP-Fall, AVA, UR-Fall, MCF, VOC2007, Penn-Fudan, OpenImages, CMU, UT-Interaction, PEL, Hockey Fight, WED, Ped1, Ped2, HMDB51, UCF-50, UCF-101, Kinetics-600, and YouTube Action.However, there are challenges in detecting motion on edge devices, and the model may generate non-zero probabilities for certain action classes.An outstanding result was achieved by Ahn et al. [89], who developed a vision-based factory safety monitoring system to detect human presence on assembly lines.They utilized YOLOv3 as a base model and employed the OpenImages dataset.The accuracy achieves a precision of 0.999 and a recall value of 0.964 across 24 detection classes.However, concerns were raised regarding the detection quality due to lens distortion issues.
The research gap in fully supervised detection encompasses several challenges: effectiveness depends on factors such as image quality, camera position, and subject presence.Additionally, optimizing hyperparameters for optimal results proves challenging.Lens distortion issues contribute to reduced detection accuracy compared to state-ofthe-art methods.Furthermore, higher computational requirements increase costs, while motion detection on edge devices may result in non-zero probabilities for certain action classes.Table 6 summarizes the strengths and drawbacks of fully supervised AHB detection research.

Summary: Advantages and Disadvantages
Offering a definitive answer to the question "Which abnormal human behavior detection technique is most suitable for a specific application?"may not always be practical.Therefore, this section explains the advantages and disadvantages of each deep learning technique.In the reconstruction-based approach, the auto-encoder method is commonly employed for image dimension reduction.This learning process is subsequently utilized to compute the loss function within the network, which is used to identify abnormal behavior within the input data.However, the distribution of these data presents a significant challenge when implementing models in heterogeneous environments.Additionally, the effectiveness of the model using this auto-encoder method is also contingent upon data quality.Table 7 summarizes the advantages and disadvantages of each deep learning technique.
Methods have been devised to address generalization issues arising from data distribution using VAE.Techniques such as regularization and probabilistic formulation are incorporated into VAE methods, enabling their applications in detecting abnormal behavior across more heterogeneous environments.However, VAEs may struggle to identify abnormal behavior occurring within specific time frames.Moreover, probabilistic calculations throughout the image can lead to small object sizes and slightly disrupted pixel localization.A specialized strategy is required to fully leverage the potential of VAEs in detecting abnormal behavior in surveillance videos.
Within the reconstruction-based approaches, convolutional auto-encoders are employed to maximize detection for each pixel in the image.CAEs offer benefits due to their generalization capabilities, enabling weight distribution across all areas of the input image.This feature allows the model to localize and identify abnormal human behaviors within specific regions.However, convolutional networks often require augmentation with other methods to optimize output.Therefore, techniques like max-pooling and deconvolutional layers from a supervised learning framework are used to assist in distributing weight dependencies for each pixel in the image.
Generative detection approaches excel in recognizing environments unseen during the learning phase.This advantage can be leveraged by refining the generator module to improve the quality of training images, mitigate noise in images, and distinguish between foreground and background objects to streamline detection targets.However, careful consideration is required regarding the model's priorities, whether to emphasize the model's adaptability to objects or prioritize computational time and resources.This prioritization is crucial as it influences training paradigms and the balance between the generator and discriminator.By understanding the model's requirements, training objectives become more focused, enabling the mitigation of several drawbacks associated with generative detection approaches in identifying abnormal human behaviors.In partially supervised detection, a semi-supervised detection scheme offers an alternative path to training detection models by concentrating on developing training data.By leveraging a small amount of labeled data, the focus shifts to instructing unlabeled data to serve as new references for subsequent model training.Through a pseudo-labeling scheme, additional labeled data can be generated with minimal supervision.This scheme finds diverse applications, particularly in adapting models to the latest data reflecting evolving real-world environments.Notably, it significantly reduces labeling time, minimizing human effort.However, it is crucial to acknowledge that if the reference data used for labeling are noisy or of poor quality, subsequent processes become more challenging.For endeavors aiming to augment unlabeled data and enhance dataset quality, the utilization of the semi-supervised detection scheme is recommended.
Weakly supervised detection is another scheme of partially supervised detection.This scheme focuses on training the model to yield higher-quality information.The emphasis lies in utilizing a small amount of data while still producing an accurate model.Much research has adapted layers and weighting strategies to enable models to learn patterns from sparsely labeled datasets and generalize from them.There are various advantages, such as quicker detection of video sequences and detecting abnormal human behavior.It is essential to note that hyperparameter tuning is crucial here.Recent advancements offer promising strategies to optimize model performance by tuning fewer hyperparameters, known as visual tuning [214].Visual tuning holds significant potential for growth through the application of self-supervised learning, which enables models to learn from vast amounts of unlabeled data.This approach reduces the need for extensive labeled datasets and improves the model's ability to generalize across different scenarios of abnormal behavior.Additionally, the input data must be of high quality.Weakly supervised detection is utilized when the focus and ultimate goal are on maximizing model output to make the model more informative and accurate.
Training a model to detect abnormal human behavior using fully supervised learning is highly beneficial if the behavior category to be detected has been determined.Detection using CNN is feasible only for short-term detection, i.e., detecting abnormal behavior at the frame level.CNN is less efficient for detecting behavior that requires a certain period to determine its abnormality, known as long-term abnormal behavior.Some research incorporates LSTM and GRU to assess behavior temporally [87,92,93].Unfortunately, data on abnormal human behaviors are scarce and expensive.Therefore, the fully supervised scheme necessitates more human effort solely to determine normal and abnormal data in the model training phase.Additionally, this scheme consumes significant computing resources and time due to the diverse nature of abnormal human behaviors.

Open Research Issues
This section discusses open research issues with deep learning techniques for abnormal human behavior detection in surveillance videos.Table 8 outlines the open research issues for each deep learning approach.
For unsupervised detection, particularly within the reconstruction-based approach, managing the variability of data distribution is a significant challenge, as it changes according to the detection environment of the target object.Addressing this requires developing strategies to handle the high levels of environmental variability in the data, potentially leading to new research directions.Additionally, accurately detecting AHB within specific temporal frames poses a challenge due to the difficulty of detection within certain timeframes.Thus, developing coherent temporal models is crucial to prioritize these temporal AHB detections effectively.The VAE method faces issues with pixel localization accuracy, necessitating alternative mechanisms to improve AHB detection precision.This challenge warrants further exploration as one of many open research questions aimed at creating a more sophisticated AHB detector.Concerning generative detection, addressing issues such as gradient exploding is of significant importance.This phenomenon causes abrupt changes in weight values during calculations, disrupting the learning process, particularly as the model performance heavily relies on the training data [141].This presents a significant challenge as the model needs to efficiently and effectively generalize to new data.Establishing mechanisms for improving model learning while preserving the model's benefits is the core principle for overcoming these challenges.As the model processes more data, the demand for computer resources increases, raising questions about whether the current resources can meet the demand for AHB detection.This research problem arises from the need to model and allocate proper computing resources of sufficient quality.Concerning complexity reduction, several questions arise.Integration between the generator and discriminator is necessary, as their inability to work in unison undoubtedly leads to poor model performance [119].Conversely, managing complexity is equally important to maintain the pace of the detection process.
The predominant challenge in partially supervised detection is limited interpretability.In this context, labels produced either automatically or manually may not always be accurate, casting doubt on the validity of the model interpretation.Therefore, special attention should be given to the development of new approaches to better predict abnormal human behavior.The noisy anchor data are also challenging, leading to a research question about implementing a robust preprocessing phase.Additionally, pixel-level AHB detection becomes inaccurate due to suboptimal performance, resulting in inaccurate results [166].The extraction technique may fail to capture important information at the pixel level, leading to the omission of vital details necessary for AHB detection.Hence, the classifier should employ an adequate approach to integrate the detected object pixel information.
Another challenge associated with fully supervised detection is the requirement for high computational resources.Analyzing this situation, an effective lightweight model is needed.The research on lightweight AHB detection brings many benefits, especially in rapid inference, faster decision-making, and minimal computational consumption.However, there are still challenges in detecting long-term AHB using fully supervised learning.Therefore, the research issue opened in the use of LSTM and GRU for the feasibility of long-term AHB detection follows the path of fully supervised detection.

Conclusions
The detection of abnormal human behavior in video surveillance systems is a crucial task, yet datasets demonstrating such behavior are scarce and costly to collect.Hence, developing strategies that effectively utilize optimal abnormal data alongside accurate models becomes imperative.To address this challenge, we present a comprehensive survey of deep learning techniques for abnormal human behavior detection in surveillance videos.This survey begins by defining abnormal human behaviors and categorizing them into three detection approaches: unsupervised, partially supervised, and fully supervised.Each approach is extensively described, including its strengths and drawbacks.Additionally, we conduct a comparative analysis of the prior research findings on popular benchmarking datasets.In unsupervised detection, the reconstruction-based detection approach excels in reducing image dimensionality and localizing AHB in normal data.However, it struggles with environmental diversity and inaccurate timeframe detection, often due to pixel localization issues.Generative detection approaches are adept at identifying AHB in unfamiliar scenarios and addressing data shortages.Yet, they face challenges like exploding gradients, high complexity, and significant computational demands.Partially supervised detection mitigates data scarcity by enhancing limited labeled data with minimal supervision.However, it grapples with noisy labeled data, limited interpretability, and suboptimal AHB detection at the pixel level.Fully supervised detection, while suitable for defined AHB detection classes, is resource-intensive for labeling and less effective in long-term detection.Additionally, its data scarcity poses a trade-off between comprehensiveness and efficacy.Finally, we discuss several open research issues in AHB detection, including the issue of high environmental variation data, optimizing temporal AHB detection, tackling gradient exploding, and reducing computational resource usage.Through investigation of these potential research issues, we aim to drive progress in this field, ultimately bringing greater benefits to video surveillance systems in the future.

Figure 1 .
Figure 1.Trend in the number of publications on deep learning for abnormal human behavior detection over the past five years (2019-2023).

Figure 1 .
Figure 1.Trend in the number of publications on deep learning for abnormal human behavior detection over the past five years (2019-2023).

Figure 2 .
Figure 2. Distribution of related published papers on abnormal human behavior detection by search engines.

Figure 2 .
Figure 2. Distribution of related published papers on abnormal human behavior detection by search engines.

Figure 3 .
Figure 3.The organizational structure of the survey.

Figure 3 .
Figure 3.The organizational structure of the survey.

Figure 4 .
Figure 4. Sample abnormal behavior images from each dataset listed in Table2.

Figure 4 .
Figure 4. Sample abnormal behavior images from each dataset listed in Table2.

Figure 5 .
Figure 5. Reconstruction-based AHB detection results using AE on the CUHK dataset.

Figure 5 .
Figure 5. Reconstruction-based AHB detection results using AE on the CUHK dataset.

Figure 6 .
Figure 6.Reconstruction-based AHB detection results using VAE on the CUHK dataset.

Figure 6 .
Figure 6.Reconstruction-based AHB detection results using VAE on the CUHK dataset.

Figure 7 .
Figure 7. Reconstruction-based AHB detection using CAE on the Ped2 and CUHK datasets.

Electronics 2024 , 37 Figure 9 .
Figure 9. Weakly supervised AHB detection results on the UCF-Crime and ST datasets.

Table 1 .
Summary of abnormal human behavior surveys in video surveillance.

Table 2 .
A summary of popular abnormal human behavior datasets.

Table 3 .
Strengths and Drawbacks of Reconstruction-based Methods.

Table 4 .
Strengths and Drawbacks of Generative Methods using GANs.

Table 5 .
Strengths and Drawbacks of Partially Supervised Approach.
Weakly supervised AHB detection results on the UCF-Crime and ST datasets.

Table 6 .
Strengths and Drawbacks of Fully Supervised Approach.

Table 8 .
Open research issues with deep learning techniques for AHB detection in surveillance videos.