Hybrid Classiﬁers for Spatio-Temporal Abnormal Behavior Detection, Tracking, and Recognition in Massive Hajj Crowds

: Individual abnormal behaviors vary depending on crowd sizes, contexts, and scenes. Challenges such as partial occlusions, blurring, a large number of abnormal behaviors, and camera viewing occur in large-scale crowds when detecting, tracking, and recognizing individuals with abnormalities. In this paper, our contribution is two-fold. First, we introduce an annotated and labeled large-scale crowd abnormal behavior Hajj dataset, HAJJv2. Second, we propose two methods of hybrid convolutional neural networks (CNNs) and random forests (RFs) to detect and recognize spatio-temporal abnormal behaviors in small and large-scale crowd videos. In small-scale crowd videos, a ResNet-50 pre-trained CNN model is ﬁne-tuned to verify whether every frame is normal or abnormal in the spatial domain. If anomalous behaviors are observed, a motion-based individual detection method based on the magnitudes and orientations of Horn–Schunck optical ﬂow is proposed to locate and track individuals with abnormal behaviors. A Kalman ﬁlter is employed in large-scale crowd videos to predict and track the detected individuals in the subsequent frames. Then, means and variances as statistical features are computed and fed to the RF classiﬁer to classify individuals with abnormal behaviors in the temporal domain. In large-scale crowds, we ﬁne-tune the ResNet-50 model using a YOLOv2 object detection technique to detect individuals with abnormal behaviors in the spatial domain. The proposed method achieves 99.76% and 93.71% of average area under the curves (AUCs) on two public benchmark small-scale crowd datasets, UMN and UCSD, respectively, while the large-scale crowd method achieves 76.08% average AUC using the HAJJv2 dataset. Our method outperforms state-of-the-art methods using the small-scale crowd datasets with a margin of 1.66%, 6.06%, and 2.85% on UMN, UCSD Ped1, and UCSD Ped2, respectively. It also produces an acceptable result in large-scale crowds.


Introduction
Abnormal behavior detection in videos has been receiving lots of attention. This research area has been widely examined in the past two decades due to its importance and challenging nature in the computer vision domain. Generally, abnormal behavior is described as the unusual act of an individual in an event such as running, walking in the opposite direction, jumping, etc. Individual abnormal behaviors can be perceived differently in different contexts and scenes. Therefore, the definition of abnormal behaviors may vary from one place or scenario to another. Similarly, the density and the number of individuals in the crowd often vary significantly, which can result in small or large crowds according to the context of the scene. A small-scale crowd often contains approximately tens of individuals gathering or moving in the same location, while a large-scale crowd contains hundreds or thousands of individuals in the same place. Therefore, the largescale crowd scene may raise many challenges as a result of many individuals moving to or gathering in one location at the same time. In large-scale crowds, challenges such as partial and full occlusions, blurring, a large number of abnormal behaviors, and low scaling usually occur when detecting, tracking, and recognizing abnormal behaviors. As a result, detecting, tracking, and recognizing anomalous actions in large crowds is difficult, whereas performing comparable tasks in small crowds is easier.
To ensure safety in public places, many studies have tackled the problem of abnormality detection in crowd scenes. These studies have exploited a wide range of trajectory features [1][2][3][4][5][6], dense motion features [7][8][9][10], spatial-temporal features [11][12][13], or deep learning-based features and optimization techniques for anomaly recognition [14][15][16][17][18]. Most of the developed methods perform a binary frame-level for anomaly detection. Several studies considered locating anomalies in crowd surveillance videos [17][18][19][20][21][22], and less attention was paid to multi-class anomalies [23]. This study proposes a hybrid model that first identifies anomalies at the frame-level and then locates and classifies crowd anomalies into one of multiple classes. Distinguishing between different types of abnormal behaviors (e.g., running and walking against the crowd) raises many challenges that are worth researching. The current proposed methods in the field are also often evaluated on datasets with lowto-moderate crowd density levels. In this research, we evaluate the proposed methods on both moderate and very high crowd density levels of benchmark datasets.
Hajj is an annual religious pilgrimage that takes place in Makkah, Saudi Arabia. It is considered a large-scale event because it regularly attracts over two million pilgrims from various countries and continents who congregate in one location. The diversity and cultural differences of pilgrims reduce our ability to understand their abnormal behaviors. However, we define, annotate, and label a set of abnormal behaviors based on the context of the Hajj. The definition of abnormal behaviors has been studied thoroughly in this research and is associated with the causes of potential obstacles or dangers to large-scale crowd flows. This analysis aims to help automate the detection, tracking, and recognition of abnormal behaviors in large-scale crowds using surveillance cameras to ensure pilgrims' safety in a smooth flow during Hajj. It also helps security authorities and decision-makers visualize and anticipate potential risks.
Our work is inspired by the power of convolutional neural networks (CNNs) and transfer learning in many computer vision tasks [24][25][26]. In addition to the success of CNNs, the work is also motivated by the success of random forests (RFs) in the classification of unstructured data [27]. The contributions of this work are summarized as follows: • We introduce a manually annotated and labeled large-scale crowd abnormal behaviors dataset for Hajj, HAJJv2; • We propose two methods of hybrid CNN and RF classifiers to detect, track, and recognize spatio-temporal abnormal behaviors in small-scale and large-scale crowd videos; • We evaluate the first proposed method on two common benchmark small-scale crowd video datasets, UMN and UCSD, against the currently published methods. Then, we evaluate the second proposed method on the HAJJv2 dataset and compare it with the previously existing method.
The remainder of this paper is organized as follows. We provide a literature review for abnormal behavior detection and recognition in Section 2. In Section 3, we briefly describe the abnormal behavior HAJJv2 dataset. Then, we present our proposed methods to detect, track, and recognize spatio-temporal abnormal behaviors in small and large crowd videos in Section 4. Experimental implementation, results, and evaluation are provided in Section 5. Then, a discussion on experimental evaluations, limitations, and challenges are provided in Section 6. Finally, we conclude our work and present some future directions in Section 7.

Related Work
Many research works have been proposed to detect abnormal behaviors in crowds in the past two decades. In this section, we provide the most recent related work. Current abnormal behavior detection and recognition methods can be briefly overviewed in two scales of crowds as follows: • Small-scale crowds: Many recent studies have proposed and evaluated their methods on small-scale and common benchmark crowd public datasets, including UMN and UCSD [10,[28][29][30][31][32][33].
Piciarelli et al. [6] introduced a normal model by clustering the trajectories of moving objects for anomaly detection. Then, Mehran et al. [28] proposed to use an optical flows-based social force model to detect abnormal behaviors. A grid of particles was computed over the frames. Then, a bag of words method was applied to classify normal and abnormal behaviors. After Mehran et al. [28]'s work, Mahadevan et al. [29] applied learned mixtures of dynamic textures based on optical flow with salient location identification to detect abnormalities in the spatial domain. In the temporal domain, the learned mixtures of dynamic textures based on optical flow with negative log-likelihood were applied to detect abnormalities. Then, Cong et al. [32] applied a sparse reconstruction cost and a dictionary to measure normal and abnormal behaviors. After that, Zhang et al. [10] introduced a social attribute-aware force model. Using an online fusion algorithm, the social attribute-aware force maps are computed. Then, global abnormal events are detected with a bag-of-words representation and local abnormal events with an abnormal map. Later, Hasan et al. [30] learned semi-supervised spatio-temporal local hand-crafted features on a convolutional autoencoder to detect abnormal patterns. Histograms of oriented gradients and histograms of optical lows were used to extract the spatiotemporal features from raw video frames to feed the convolutional autoencoder for classification. Fradi et al. [13] applied local feature tracking to describe the movements of the crowd. They represented the crowd as an evolving graph. To analyze the crowd scene for an abnormal event, mid-level features are extracted from the graph. Colque et al. [7] used the histograms of magnitude, orientation, and entropy of the optical flow with the nearest neighbor search algorithm to detect the anomalies. In the training phase, they stored the histograms of each moving object as normal patterns.
In the testing phase, they used the nearest neighbor search to find normal patterns to decide the abnormality. Coşar et al. [5] employed trajectory features and motion features. They used a bag-ofwords representation to describe the actions. Then, they applied a clustering algorithm to perform abnormal detection in an unsupervised manner. Followed by [5,7,13,30], Tudor Ionescu et al. [31] used a sliding window technique to obtain partial video frames. The motions and appearance features were extracted from the frames and fed to a linear binary classifier to detect normality and abnormality in behaviors.
First, Solmaz et al. [34] introduced a linear approximation using a Jacobian matrix to identify large-scale crowd abnormal behaviors. An optical flow and particle advection were used. Then, Wang et al. [35] started to cluster crowd feature maps to analyze motion patterns. Followed by [35], Alqaysi and Sasi [36] applied motion history image and segmented optical flow to extracted features. Then, a histogram was used for the motion direction and magnitude to detect crowd abnormal behaviors. Later, Zou et al. [37] detected large-scale crowd motions and trajectories using tracklets association. Similar to [37], Bera et al. [2] computed abnormal behavior trajectories using Bayesian learning techniques. Then, Pennisi et al. [38] segmented the extracted features to detect crowd abnormal behaviors. In recent years, Fradi et al. [13] and Wu et al. [39] worked on analyzing large-scale crowd properties using visual feature descriptors. Then, Luo et al. [42] proposed a large-scale crowd motion framework for abnormal behavior detection. However, they focused on a crowd level rather than an individual level in their study. Finally, Miao et al. [40,41] leveraged unmanned aerial vehicles, airborne LiDAR, and computer vision technologies to continuously analyze individual abnormal behaviors in large-scale crowds. However, existing methods are only confined to detecting and analyzing large-scale crowds as a mass. To the best of our knowledge, no existing works have detected individuals' abnormal behaviors in large-scale crowds, with the exception of the work presented in [33]. In comparison with the recent work in [33], the proposed methods do not require generating individual abnormal behavior images. Compared to the work in [33], the proposed method achieves better accuracy using the HAJJv1 dataset.

HAJJv2 Dataset
The HAJJv2 dataset is introduced due to the imbalance of training examples in each class and the absence of many annotations and labeling for individuals with abnormal behaviors in the HAJJv1 dataset [33]. The HAJJv2 dataset consists of nine manually collected videos from the annual Hajj religious event. All the videos are stored with an mp4 extension. The collected videos include individuals' abnormal behaviors in massive crowds. The videos are captured from different scenes and places in the wild during the Hajj event. Five videos are captured in the "Massaa" scene while other videos are captured in "Jamarat", "Arafat", and "Tawaf". These videos were recorded using high-resolution cameras. Then, the videos are cropped and split into training and testing sets. Each set contains nine short videos. Each video in the training set lasts for 25 s, while each video in the testing set lasts for 20 s.
In these videos, individuals' abnormal behaviors include standing, sitting, sleeping, running, moving in opposite or different crowd directions, and non-pedestrian entities such as cars and wheelchairs. These behaviors can be potentially dangerous for large-scale crowd flows. Figure 1 shows examples of these abnormal behaviors in the HAJJv2 dataset. The dataset statistics are provided in Table 1. As seen in the table, the dataset is imbalanced. The sitting class has the largest number of training and testing examples, while the running class has the smallest number of examples in the training and testing sets.   Individuals' anomalous behaviors in the videos are manually annotated and labeled for the training and testing sets. The annotations and labeling are stored in two CSV files. The training CSV file contains 170,772 annotated and labeled individuals' abnormal behaviors, while the testing CSV file contains 129,769 annotated and labeled individuals' abnormal behaviors. A comparison of existing public abnormal behavior datasets and the abnormal behavior HAJJv2 dataset is shown in Table 2. The videos and HAJJv2 dataset are publicly available for research and non-commercial use only. The videos and HAJJv2 annotations and labeling files can be downloaded from https://github.com/KAU-Smart-Crowd/HAJJv2_dataset, accessed on 10 February 2023.

Proposed Methods
In this section, we present the details of our proposed methods and algorithms. First, we show the individual abnormal behavior detection and recognition pipeline and algorithm for small-scale crowds. Then, similarly, a pipeline and an algorithm for detecting and recognizing abnormal behaviors are presented for large-scale crowds. Figure 2 shows the detection and recognition methodology pipelines for abnormal behaviors in small-scale and large-scale crowds. Figure 2a shows the pipeline for detecting and recognizing abnormal behaviors in small-scale crowd videos. The pipeline consists of spatial and temporal domains and hybrid classifiers. The spatial domain includes a pre-trained CNN classifier which focuses on classifying and detecting the abnormal behaviors generally on a frame level. On the other hand, the temporal domain includes the RF classifier that aims to classify and recognize individuals' behaviors at an object level within the frames.

Individual Abnormal Behavior Detection, Tracking, and Recognition in Small-Scale Crowds
Spatial domain: Training a specialized deep model from scratch requires a vast amount of data, a significant amount of resources, and a long training time. Transfer learning overcomes these challenges by utilizing pre-trained deep learning models that have been trained on a significant amount of labeled data and using the previously optimized weights to perform other predictive tasks. Due to the lack of sufficient abnormal training datasets, we utilize transfer learning in the spatial domain. We fine-tune the pre-trained model, ResNet-50 [26], to detect abnormalities at the frame level. Deeper networks are capable of extracting more complex feature patterns; however, they may cause a degradation problem, which degrades the detection performance. ResNet generally uses a deep residual learning framework to solve the degradation problem. This gives the advantage of using a deep neural network to extract the complex feature patterns in the spatial domain. Therefore, we use ResNet-50 in our experiments.
ResNet-50 consists of 49 convolutional layers as a feature extractor, followed by average pooling and a fully connected layer as a classifier. Fine-tuning the pre-trained models is performed by modifying the previous weights of the model such that they work with a new classification task. The classification layers of the pre-trained model are replaced by a fully connected layer and an output layer that outputs values equal to the number of classes. Anomaly detection is a binary classification problem. Thus, the classifier is trained on normal and abnormal frames. Therefore, we fine-tune ResNet-50 as a binary classifier using video frames from small-scale crowd datasets. We replace the last layer with a fully connected layer that maps 2048 units into 128, followed by an output layer that maps 128 units into 2 units, representing the normal and abnormal probabilities. Since the ResNet-50 model processes inputs with a size of 224 × 224 × 3, we resize the frames to input size. A feed-forward and back-propagation algorithm is applied by updating the errors and weights to converge. Figure 2a shows the detected normal and abnormal frames resulted from the ResNet-50 classifier. The detected normal frames appear in green, while the detected abnormal frames appear in red. Temporal domain: We use the optical flow to detect the anomalies at the pixel level. By analyzing the optical flow, we can observe crowd movements, instantaneous velocities, orientations, and magnitudes. These low-level features are used to recognize individuals' behaviors. After detecting the anomalies at the i-th frame (F i ), the optical flow of this frame (O i ) is computed using Horn-Schunck optical flows [44]. Then, magnitude (m i ) and orientation (r i ) features are automatically extracted from the optical flows. In small-scale crowd videos, binary magnitude-based masks using a threshold (T) are initiated to localize and track individuals within the frame (i.e., {I i j } z j=1 is the j-th individual in the i-th frame). Figure 3 shows the proposed binary magnitude-based mask using the small-scale crowd datasets. After extracting the magnitude and orientation features, the statistical features including means (µ) and variances (σ 2 ) are computed. They are computed for the total pixels (p) of the area that represent an individual (j), for both the magnitudes (m j ) and orientations (r j ), as follows: Then, these statistical features are fed to the RF classifier for training to classify and recognize individual temporal abnormal behaviors. Algorithm 1 shows the computational steps of the proposed method in the small-scale crowds. The algorithm runs O(n 2 ) in the worst case.

Algorithm 1:
A hybrid CNN and RF algorithm for spatio-temporal small-scale crowd abnormal behavior detection, tracking, and recognition in a video.

Input : Video frame sequences
. . , f n }. Output : Abnormal behavior frames and objects. Use

Individual Abnormal Behavior Detection and Recognition in Large-Scale Crowds
Figure 2b depicts the pipeline for detecting and recognizing abnormal behaviors in large-scale crowded scenes in the HAJJv2 dataset. Similar to the method presented for small-scale crowded scenes, the method for large-scale crowded scenes also consists of hybrid classifiers. The first classifier is accountable for detecting the abnormal behavior frames in the spatial domain, while the second classifier is employed for recognizing the individuals' abnormal behaviors in the temporal domain.
Spatial domain:Similar to the previous detection method for small-scale crowds, we fine-tune another pre-trained CNN model, ResNet-50. We train ResNet-50 as a one-class classifier using all abnormal behaviors in the training set of the HAJJv2 dataset. The main goal of the ResNet-50 model is to only detect individuals' abnormal behaviors in frames if they exist. To address the problem of the overlapped white areas when a large number of individuals and a large number of partial occlusions occur, the YOLOv2 [45] technique is employed to locate individuals with the abnormal behaviors in the spatial domain. We use the back-propagation algorithm to update the errors and weights in the ResNet-50 model until convergence.
Temporal domain: After detecting all individuals with abnormal behaviors, we employ Horn-Schunck optical flows (O i ) on the detected individuals. This approach is different from the previous method applied for small-scale crowds. We extract the m i and the r i features from the resulted optical flows. To track individuals with abnormal behaviors, a Kalman filter [46] is used directly with a YOLOv2 detector. The Kalman filter predicts individuals' locations in the next frames. We avoid using our binary magnitude-based masks since they mainly cause overlapping contiguous groups of white pixels due to heavy partial occlusions in large-scale crowded scenes.
After detecting individuals' abnormal behaviors and extracting their statistical features, we compute the means (µ) and variances (σ 2 ) for each individual, similar to the small-scale crowd method. Then, another RF classifier is used as a multi-class classifier to classify and recognize all individuals with abnormal behaviors.
Algorithm 2 shows the sequence of our implementable method in large-scale crowd videos. Similar to Algorithm 1, Algorithm 2 also runs O(n 2 ) in the worst case.

Experiments
In this section, we first provide details of the implementation of the proposed methods. Second, we briefly describe the benchmark datasets used in the experiments. Third, we show the results of our abnormal behavior detection and recognition qualitative and quantitative experiments. Then, we compare the results with the existing and the most recent methods for abnormal behavior detection in small and large crowds.

Implementation
We implemented the proposed methods in MATLAB R2020b. The ResNet-50 and the RF models were trained using NVIDIA Tesla V100S GPU server with 32GB of RAM.

Datasets
In this section, we use the most common and public benchmark datasets such as the UMN [43], UCSD [29], HAJJv1 [33], and HAJJv2 datasets to evaluate the proposed method on small-scale and large-scale crowds. HAJJv2 is described in Section 3. The UMN and USCD datasets are briefly described as follows: • The University of Minnesota (UMN) dataset. The UMN dataset is a small-scale crowd dataset that contains three different unrealistic scenes. Two scenes were recorded outdoors, while one was recorded indoors. Each UMN scene starts with a normal activity followed by an abnormal behavior. Walking, for example, is considered a normal activity, while running is an abnormal one. The frame resolution in UMN scenes is 320 × 240 pixels. The abnormal frames contain a short description at the top of the frames. Thus, we apply a pre-processing technique on the frames to remove the pixels that contain these descriptions to avoid biases in training and testing the model in the experiment. Figure 4 illustrates an example of UMN's frames. The training and testing splits are not explicitly specified. Moreover, the annotations are only available at the frame level. Due to these ambiguities, we use 70% of the frames for training and the rest for testing. To address the lack of pixel-level annotations, we consider all objects in the abnormal frames as abnormal individuals and all objects in the normal frames as normal individuals. The UMN scenes are evaluated separately since they have illumination and background variations. • The University of California, San Diego (UCSD) dataset. The UCSD dataset is also a small-scale crowd dataset that consists of two subsets, namely Pedestrian 1 (Ped1) and Pedestrian 2 (Ped2). The dataset contains clips from independent static cameras viewing pedestrian walkways. It includes abnormal behaviors such as bicycles, cars, carts, skateboards, and wheelchairs as non-pedestrian objects. Ped1 contains 34 normal behavior videos and 16 abnormal behavior videos. Each video contains 200 frames with a resolution of 238 × 158 pixels. Ped2 contains 16 normal behavior videos and 12 abnormal behavior videos. The videos have different numbers of frames with a resolution of 360 × 240 pixels. Both temporal and spatial annotations are provided. Thus, the UCSD is appropriate for locating and tracking abnormal objects in smallscale crowds. In our experiment, we use both normal and abnormal videos for training and testing. Figure 5 illustrates some examples from Ped1 and Ped2 frames.

Experimental Settings and Hyperparameters
For both small-scale and large-scale crowd experiments, different configurations are evaluated to determine the most effective approach, the details of which are described in the following.
Small-scale crowds: Different pre-trained CNN models such as ResNet-50, VGG-16, VGG-19, AlexNet, and SqueezeNet were examined in the spatial domain as part of the proposed method. According to our preliminary experiments, the ResNet-50 model achieves better performance on small-scale crowd datasets. We fine-tune the ResNet-50 model using the Adam optimizer [47] with a learning rate of 0.0001 for 15 epochs and 128 normal and abnormal frames per batch of each dataset.
Many methods to estimate optical flow, such as the Lucas-Kanade derivative of Gaussian, Lucas-Kanade, Farneback, and Horn-Schunck, are employed. The Horn-Schunck [44] method is selected since it provides magnitude and orientation features to create a binary magnitude-based mask to localize and track individuals. The means and variances are computed using these features to classify the individuals' abnormal behaviors in smallscale crowds.
In addition to using different pre-trained CNN models and optical flow estimators, different classifiers are examined, such as the linear classifier, decision tree, and RF with cross-validation. The RF classifier is selected since it achieves better results compared to the other classifiers.
Large-scale crowds: The ResNet-50 and SqueezeNet pre-trained CNN models are used as the base network of the YOLOv2 object detection technique. We initialize the weights on ImageNet [48]. Then, we fine-tune the model with a stochastic gradient descent (SGD) [49] optimizer for 20 epochs with a learning rate of 0.001 and a mini-batch of eight frames.
Similar to the small-scale crowd experiment, the RF classifier is also accountable for the recognition of abnormal behavior in the temporal domain. Unlike in the small-scale crowd experiment, we train the RF classifier using only the statistical features of the detected individuals with abnormal behaviors.

Effectiveness Evaluation
To evaluate the proposed methods, we evaluate them in both spatial and temporal domains. In the spatial domain, we consider the accuracy, precision, recall, F1 score, and area under the curve (AUC) metrics as performance measures. The accuracy, precision, and recall metrics are defined in terms of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) as the following: The receiver operating characteristics (ROCs) curve [50] is a plot of the true positive rate (TPR) and false positive rate (FPR). The ROC curve represents the change of TPR and FPR over different thresholds. Thus, it is a powerful metric to evaluate a classifier. However, it is difficult to compare different classifiers using the ROC curve. Therefore, the AUC is used to compute the area under the ROC curve and compare the performance of the classifiers. The AUC scores range from zero to one. Stronger classifiers have higher AUC scores.
Small-scale crowds: Table 3 shows the frame detection quantitative results in the spatial domain using the small-scale crowd datasets. The ResNet-50 classifier achieves 99.76% and 93.71% of average AUCs among the scenes on the UMN and UCSD datasets, respectively. Table 4 shows the quantitative results in the temporal domain using the RF classifier on small-scale crowd datasets. Figure 6a,b illustrate the ROC curves of our experiments using the ResNet-50 and the RF classifiers, respectively, on UMN and UCSD datasets. In Figure 7, samples of our qualitative results using the UMN and UCSD datasets are shown. One can notice that the proposed method detects and recognizes the abnormality correctly in the datasets' testing samples.
To better illustrate the comparison with existing methods in [28,32,33], Table 5 shows that the proposed method yields better results using the UMN dataset. Table 6 reports a performance comparison of the proposed method with the existing methods [28][29][30][31]33,51] using the UCSD dataset. It is clearly shown that the proposed method achieves higher AUCs using UCSD Ped1 and Ped2 scenes.

99.76
Large-scale crowds: We evaluate our method in the spatial and temporal domains on large-scale crowds using the HAJJv1 and HAJJv2 datasets. The spatial domain is evaluated using two criteria, track assignment and intersection over union (IOU). The results of track assignment are computed using Kalman filter assignment results for each detected object. Nevertheless, it is not important whether the detected pixels match most of the labeled pixels exactly. Thus, we use the IOU to evaluate the YOLOv2 detector. The IOU is a powerful evaluation metric to evaluate the detection of objects, as it is commonly used in the computer vision community. It computes the overlap ratio between the ground-truth and detected boxes. Then, using a 50% threshold of the overlapping boxes, we compute the TP, FP, and FN. The accuracy of the YOLOv2 is computed at the pixel level. A pixel is considered a TN if no TP, FP, and FN pixels are detected by the detector. It is observable that the accuracy cannot report the performance well. Since YOLOv2 does not detect any anomalies at most of the frames' pixels, and since the majority of the frames' pixels do not contain abnormal behaviors, the TN number is increased. This affects the accuracy computation and neglects the values of TP, FP, and FN. Tables 7 and 8 show our quantitative results in the spatial and temporal domains using the HAJJv2 dataset. Our fine-tuned pre-trained ResNet-50 model with the YOLOv2 detector achieves 91.77%, 92.47%, 27.99%, and 36.05% of average accuracy, average precision, average recall, and average F1 score, respectively, using the track assignment technique. Meanwhile, the same model achieves 92.72%, 31.68%, 16.49%, and 20.62% of average accuracy, average precision, average recall, and average F1 score, respectively, using the IOU technique. In the temporal domain, the RF classifier achieves 75.18% of AUC for abnormal behavior recognition using the HAJJv2 dataset. Figure 8 shows the qualitative results for the proposed method on the HAJJv2 dataset. Figure 6c shows the ROC curves for the RF classifier. A quantitative comparison with the work in [33] using the HAJJv1 dataset is also provided in Table 9.  [28] 67.5% 55.6% SF+MPPCA [29] 68.8% 61.3% MDT [29] 81.8% 82.9% Conv-AE [30] 75.0% 85.0% Stacked RNN [52] N/A 92.2% Unmasking [31] 68.4% 82.2% Alafif et al. [33] 82.81% 95.7% Ours 88.87% 98.55%

Discussion
The proposed methods are robust in detecting and recognizing individuals with abnormal behaviors in small-scale and large-scale crowd videos. The results show that the small-scale crowd method achieves a great performance in comparison with the state-ofthe-art techniques. Although the small-scale method outperforms other existing techniques, it shows unsatisfactory performance when using the UCSD Ped1 dataset. Several factors contributed to this, including the low resolution of the frame, the camera viewing, the shadows cast by trees, and the low illumination. In the large-scale crowds, we still have not achieved an excellent performance using the HAJJv2 dataset since, the videos in the dataset are very challenging to analyze. The challenges are represented by a far-away camera viewing as well as heavy, partial, and full occlusions with a significant number of individuals. Figure 8b,c shows some of the challenges in the Tawaf and Jamarat scenes, which are considered the hardest scenes for the classifiers to classify the individuals with abnormal behaviors. Due to the fact that these scenes contain a large number of individuals moving in one spot with heavy partial occlusions and far camera views, much more human attention and focus were required when annotating and labeling the abnormal behaviors. On the other hand, the easiest scenes for the annotators and labelers are in the Masaa scenes, since these videos are captured from a closed camera view and have a moderate number of partial occlusions. Therefore, these factors definitely contribute to the performance of the abnormal behavior detection and recognition classifiers. Much more future work is required to better detect and recognize the individuals with abnormal behaviors in large-scale and massive crowds.

Conclusions
In this research work, we first introduced the annotated and labeled large-scale crowd abnormal behavior dataset, HAJJv2. Second, we proposed two methods of hybrid CNNs and RFs to detect and recognize spatio-temporal abnormal behaviors in small-scale and large-scale crowd videos. In the small-scale crowd videos, a ResNet-50 pre-trained CNN model was fine-tuned to verify every frame, determining whether it is normal or abnormal in the spatial domain. If abnormal behaviors were found, a motion-based individual detection using the magnitude and orientation features of Horn-Schunck optical flow was employed to create a binary magnitude-based mask to localize and track individuals with abnormal behaviors. In large-scale crowd videos, a Kalman filter was employed to predict and track the detected individuals in the next frame. Then, means and variances as statistical features were computed and fed to the RF classifier to classify individuals with abnormal behaviors in the temporal domain. In the large-scale crowd videos, we fine-tuned the ResNet-50 model using the YOLOv2 object detection technique to detect individuals with abnormal behaviors in the spatial domain. The proposed method in a small-scale crowd achieved 99.76% and 93.71% average AUCs on the UMN and UCSD datasets, respectively, while the method in a large-scale crowd achieved 76.08% average AUC on the HAJJv2 dataset. Our method outperformed state-of-the-art methods using the small-scale crowd datasets with a margin of 1.67%, 6.06%, and 2.85% on the UMN, UCSD Ped1, and UCSD Ped2 datasets, respectively. It also achieved a satisfactory result for large crowds.
Still, a significant amount of work is needed to increase the effectiveness of abnormal behavior detection and recognition in large-scale crowded scenes due to their challenges. The majority of current research only uses small-scale crowded scenes in which abnormal behaviors can be easily extracted and classified. In the future, our work will be more focused on large-scale crowds. We will incorporate an attention mechanism and fusion strategies to enhance the performance. This work can potentially help researchers study and apply it in different contexts of crowded scenes, such as in airports, stadiums, and marathons. It can also be used in the manufacturing industry to inspect and detect abnormal behaviors of defective manufactured goods and products on a production line [53]. Examples and features of the products' unusual behaviors are required to be collected, extracted, and learned by a classifier to achieve high performance.