Bus Violence: An Open Benchmark for Video Violence Detection on Public Transport

The automatic detection of violent actions in public places through video analysis is difficult because the employed Artificial Intelligence-based techniques often suffer from generalization problems. Indeed, these algorithms hinge on large quantities of annotated data and usually experience a drastic drop in performance when used in scenarios never seen during the supervised learning phase. In this paper, we introduce and publicly release the Bus Violence benchmark, the first large-scale collection of video clips for violence detection on public transport, where some actors simulated violent actions inside a moving bus in changing conditions, such as the background or light. Moreover, we conduct a performance analysis of several state-of-the-art video violence detectors pre-trained with general violence detection databases on this newly established use case. The achieved moderate performances reveal the difficulties in generalizing from these popular methods, indicating the need to have this new collection of labeled data, beneficial for specializing them in this new scenario.


Introduction
The ubiquity of video surveillance cameras in modern cities and the significant growth of Artificial Intelligence (AI) provide new opportunities for developing functional smart Computer Vision-based applications and services for citizens, primarily based on deep learning solutions. Indeed, on the one hand, we are witnessing an increasing demand for video surveillance systems in public places to ensure security in different urban areas, such as streets, banks, or railway stations. On the other hand, it has become impossible or too expensive to manually monitor this massive amount of video data in real time: problems such as a lack of personnel and slow response arise, leading to the strong demand for automated systems.
In this context, many smart applications, ranging from crowd counting [1,2] and people tracking [3,4] to pedestrian detection [5,6], re-identification [7], or even facial reconstruction [8], have been proposed and are nowadays widely employed worldwide, helping to prevent many criminal activities by exploiting AI systems that automatically analyze this deluge of visual data, extracting relevant information. In this work, we focus on the specific task of violence detection in videos, a subset of human action recognition that aims to detect violent behaviors in video data. Although this task is crucial to investigate the harmful abnormal contents from the vast amounts of surveillance video data, it is relatively unexplored compared to common action recognition.
One of the potential places in which an automatic violence detection system should be developed is public transport, such as buses, trains, etc. However, evaluating the existing approaches (or creating new ones) in this scenario is difficult due to the lack of labeled data. Although some annotated datasets for video violence detection in general contexts already exist, the same cannot be said for the case of public transport environments. To fill this gap, in this work, we introduce a benchmark specifically designed for this scenario. We collected and publicly released [9] a large-scale dataset gathered from multiple cameras located inside a moving bus where several people simulated violent actions, such as stealing an object from another person, fighting between passengers, etc. Our dataset, named Bus Violence, contains 1400 video clips manually annotated as having (or not) violent scenes. To the best of our knowledge, it is the first dataset entirely located on public transport and is one of the biggest benchmarks for video violence detection in the literature. The main difference compared to the other existing databases is also connected to the dynamic background-the violent actions are recorded during bus movement, which indicates a different illumination (in contrast to the static background of other datasets), making violence detection much more challenging.
In this paper, we first introduce the dataset and describe the data collection and annotation processes. Then, we present an in-depth experimental analysis of the performance of several state-of-the-art video violence detectors in this newly established scenario, serving as baselines. Specifically, we employ our Bus Violence dataset as a testing ground for evaluating the generalization capabilities of some of the most popular deep learning-based architectures suitable for video violence detection, pre-trained over the general violence detection databases present in the literature. Indeed, the Domain Shift problem, i.e., the domain gap between the train and the test data distributions, is one of the most critical concerns affecting deep learning techniques, and it has become paramount to measure the performance of these algorithms against scenarios never seen during the supervised learning phase. We hope this benchmark and the obtained results may become a reference point for the scientific community concerning violence detection in videos captured from public transport.
Summarizing, the contributions of this paper are three-fold: • We introduce and publicly release [9] the Bus Violence dataset, a new collection of data for video violence detection on public transport; • We test the generalization capabilities over this newly established scenario by employing some state-of-the-art video violence detectors pre-trained over existing generalpurpose violence detection data; • We demonstrate through extensive experimentation that the probed architectures struggle to generalize to this very specific yet critical real-world scenario, suggesting that this new collection of labeled data could be beneficial to foster the research toward more generalizable deep learning methods able to also deal with very specific situations.
The rest of the paper is structured as follows. Section 2 reviews the related work on the existing datasets and methods for video violence detection. Section 3 describes the Bus Violence dataset. The performance analysis of several popular video violence detection techniques on this newly introduced benchmark is presented in Section 4. Finally, we conclude the paper with Section 5, suggesting some insights on future directions. The evaluation code and all other resources for reproducing the results are available at https://ciampluca.github.io/bus_violence_dataset/ (accessed on 20 September 2022).

Related Work
Several annotated datasets have been released in the last few years to support the supervised learning of modern video human action detectors based on deep neural networks. One of the biggest datasets was proposed in the project of Kinetics 400/600/700 [10][11][12] related to the number of human action classes, such as people interactions and single behavior. The given benchmark consists of high-quality videos of about 650,000 clips lasting around 10 s each. Alternatively, other options are represented by HMDB51 [13], which consists of nearly 7000 videos recorded for 51 action classes, and UCF-101 [14], made up of 101 action classes over 13 k clips and 27 h of video data. In contrast, datasets containing only abnormal actions (such as fights, robberies, or shootings) were introduced in the UCF-Crime benchmark [15], a large-scale dataset of 1900 real-world surveillance videos for anomaly detection.
However, in the literature, there are only a few benchmarks suitable for the video violence detection task, which consists of binary classifying clips as containing (or not) any actions considered to be violent. In [16], the authors introduced two video benchmarks for violence detection, namely the Hockey Fight and the Movies Fight datasets. The former consists of 200 clips extracted from short movies, a number that is insufficient nowadays. On the other hand, the second one has 1000 fight and non-fight clips from the ice hockey game. In this case, the lack of diversity represents the main drawback because all the videos are captured in a single scene. Another dataset, named Violent-Flows, has been presented in [17]. It consists of about 250 video clips of violent/non-violent behaviors in general contexts. The main peculiarity of this data collection is represented by its overcrowded scenes but low image quality. Moreover, in [18], the NTU CCTV-Fights is introduced, which covers 1000 videos of real-world fights coming from CCTV or mobile cameras.
More recently, the authors of [19,20] proposed the AIRTLab dataset, a small collection of 350 video clips labeled as "non-violent" and "violent," where the non-violent actions include behaviors such as hugs and claps that can cause false positives in the violence detection task. Furthermore, the Surveillance Camera Fight dataset has been presented in [21]. It consists of 300 videos in total, 150 of which describe fight sequences and 150 depict nonfight scenes, recorded from several surveillance cameras located in public spaces. Moreover, the RWF-2000 [22] and the Real-Life Violence Situations [23] datasets consist of video gathered from public surveillance cameras. In both collections, the authors collected 2000 video clips: half of them include violent behaviors, while the others belong to non-violent activities. All these benchmarks share the characteristic of having a still background because the clips are captured from fixed surveillance cameras. We summarize the statistics of all the above-described databases in Table 1. Table 1. Summary of the most popular existing datasets in the literature. We report the task for which they are used, together with the number of classes and videos that characterized them.

The Bus Violence Dataset
Our Bus Violence dataset [9] aims to overcome the lack of significant public datasets for human violence detection on public transport, such as buses or trains. Already published benchmarks mainly present situations with actions in stable conditions from videos gathered by urban surveillance cameras located in fixed positions, such as buildings, street lamps, etc. On the other hand, records on public transport change in many directions: (1) the background outside windows have a different view due to general movement, (2) the movement is dynamic, but it can be slow or fast, and (3) there are many illumination changes due to different weather conditions and the position of the vehicle. For those reasons, the proposed Bus Violence benchmark consists of data recorded in dynamic conditions (general bus movement). In the following, we detail the processes of the data collection and curation.

Data Collection
The videos were acquired in a three-hour window during the day, during which the bus continued traveling and stopping around closed zones. The participants of the records were getting inside and outside the bus, playing already defined actions. Specifically, the unwanted situations (treated as violent actions) were concerned as a fight between passengers, kicking and tearing pieces of equipment, and tearing out or stealing an object from another person (robbery). An important aspect is the diversity of people. Ten actors took part in the recordings and changed their clothes at different times to ensure a reliable variety of situations. In addition, thanks to the conditions in the closed depot, it was possible to obtain different lighting conditions, for example, driving in the sun, parking in a very shaded place, etc.
The test system was able to record videos from three cameras at 25 FPS in .mp4 format (H.264). Our recording system was installed manually by us and composed of two cameras located in the corners of the bus (with resolution 960 × 540 and 352 × 288 px, respectively) and one fisheye in the middle (1280 × 960 px). In total, we recorded a three-hour video-one hour dedicated to actions considered violent and two hours to non-violent situations.

Data Curation
After the acquisition, collected videos were manually checked and split. Specifically, we divided all the videos into single shorter clips, ranging from 16 frames to a maximum length of 48 frames, capturing an exact action (either violent or non-violent). This served to avoid single shots containing both violent and non-violent actions, which may be confusing for video-level violence detection models. Then, these resulting videos were filtered and annotated. In particular, the ones not containing a violent action were classified as nonviolent situations. In these clips, passengers were just sitting, standing, or walking inside a bus. More in-depth, we operated by exploiting a two-stage manual labeling procedure. In the first step, three human annotators performed a preliminary video classification into the two classes-violence/no violence. Then, in the second stage, two additional independent experts conducted further analysis, filtering out the wrong samples. To obtain more reliable labels, we decided not to leverage the use of automatic labeling tools that would have required further manual verification.
After the above-described operations, the non-violence class resulted in more videos than the violence class. Therefore, we undersampled the non-violence samples by randomly discarding videos to balance the dataset perfectly. In the end, the final curated dataset contains 1400 videos, evenly divided into the two classes. In each class, we obtained almost the same number of videos for each of the three different resolutions. Specifically, we obtained 212 violence and 240 non-violence clips for the 1280 × 960 px resolution, 222 violence and 210 non-violence for the 960 × 540 px resolution, and 266 violence and 250 non-violence for the 352 × 288 px resolution. We placed them in two separate folders, each containing 700 .mp4 video files encoded in the H.264 format. We report the final statistics of the resulting dataset in Table 2.
In Figures 1 and 2, we show some samples from the final curated dataset of the violence and non-violence classes, respectively.

Performance Analysis
In this section, we evaluate several deep learning-based video violence detectors present in the literature on our Bus Violence benchmark. Following the primary use case for this dataset explained in Section 1, we employ it as a test benchmark (although in this work we exploited the whole dataset as a test benchmark, in [9], we provide training and test splits for researchers interested in also using our data for training purposes) to understand how well the considered methods, pre-trained over existing general violence detection datasets, can generalize to this very specific yet challenging scenario.

Considered Methods
We selected some of the most popular methods coming from human action recognition, adapting them to our task, and some of the most representative techniques specific to video violence detection. We briefly summarize them below. We refer the reader to the papers describing the specific architectures for more details.
Human action recognition methods aim to classify videos in several classes, relying on the human actions that occur in them. Because actions can be formulated as spatiotemporal objects, many architectures that extend 2D image models to the spatiotemporal domain have been introduced in the literature. Here, we considered the ResNet 3D network [24] that handles both spatial and temporal dimensions using 3DConv layers [25] and the ResNet 2+1D architecture [24], which instead decomposes the convolutions into separate 2D spatial and 1D temporal filters [26]. Furthermore, we took into account SlowFast [27], a two-pathway model where the first one is designed to capture the semantic information that can be given by images or a few sparse frames operating at low frame rates, while the other one is responsible for capturing rapidly changing motion by operating at a fast refreshing speed. Finally, we exploited the Video Swim Transformer [28], a model that relies on the recently introduced Transformer attention modules in processing image feature maps. Specifically, it extends the efficient sliding-window Transformers proposed for image processing [29] to the temporal axis, obtaining a good efficiency-effective trade-off.
On the other hand, video violence detection methods aim at binary classifying videos to predict if they contain (or not) any actions considered to be violent. In this work, we exploited the architecture proposed in [30], consisting of a series of convolutional layers for spatial features extraction, followed by Convolutional Long Short Memory (Con-vLSTM) [31] for encoding the frame-level changes. Furthermore, we also considered the network in [32], a variant of [30], where a spatiotemporal encoder built on a standard convolutional backbone for features extraction is combined with the Bidirectional Convolutional LSTM (BiConvLSTM) architecture for extracting the long-term movement information present in the clips.
Although most of these techniques employ the raw RGB video stream as the input, we probed these architectures by also feeding them with the so-called frame-difference video stream, i.e., the difference in the adjacent frames. Frame differences serve as an efficient alternative to the computationally expensive optical flow. It is shown to be effective in several previous works [30,32,33] by promoting the model to encode the temporal changes between the adjacent frames, boosting the capture of the motion information.

Experimental Setting
We exploited three different, very general violence detection datasets to train the above methods: Surveillance Camera Fight [21], Real-Life Violence Situations [23], and RWF-2000 [22], already mentioned in Section 2 and summarized in Table 1. Surveillance Camera Fight contains 300 videos, while both Real-Life Violence Situations and RWF-2000 contain 2000 videos. All these datasets are perfectly balanced with respect to the number of violent and non-violent shots. The scenes captured in these datasets, recorded from fixed security cameras, collect very heterogeneous and everyday-life violent and non-violent actions. Therefore, they are the best candidate datasets available in the literature to train deep neural networks to recognize general violent actions. Other widely used datasets, such as Hockey Fight [16] or Movies Fight [16], do not contain enough diverse violence scenarios that can be transferable to public transport scenarios, and therefore we discarded them in our analysis.
Concerning the action recognition models, we replaced the final classification head with a binary classification layer, outputting the probability that the given video contains (or does not contain) violent actions. To obtain a fair comparison among all the considered methods, we employed their original implementations in PyTorch if present, and we re-implemented them otherwise. Moreover, when available, we started from the models pre-trained on Kinetics-400, the common dataset used for training general action recognition models.
Following previous works, we used Accuracy to measure the performance of the considered methods, defined as: where TP, TN, FP, and FN are the true positives, true negatives, false positives, and false negatives, respectively. To have a more in-depth comparison between the obtained results, we also considered as metrics the F1-score, the false alarm, and the missing alarm, defined as follows: where Precision and Recall are defined as TP TP+FP and TP TP+FN , respectively. Finally, to account also for the probabilities of the detections, we employed the Area Under the Receiver Operating Characteristics (ROC AUC), computed as the area under the curve plotted with the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings, where TPR = Recall = TP TP+FN and FPR = FP TN+FP . We employed the following evaluation protocol to have reliable statistics on the final metrics. For each of the three considered training datasets, we randomly varied the training and validation subsets five times, picking up the best model in terms of the accuracy and testing it over the full Bus Violence benchmark. Then, we reported the mean and the standard deviation of these five runs.

Results and Discussion
We report the results obtained by exploiting the three training general violence detection datasets in Tables 3-5. Considering the pre-training Surveillance Camera Fight dataset, the model which turns out to be the most performing is SlowFast, followed by the Video Swim Transformer. On the other hand, regarding the Real-life Violence Situations dataset in Table 4, the best model was the ResNet 3D network, followed by SlowFast. Finally, concerning the RWF-2000 benchmark (Table 5), the more accurate models are the ResNet 2+1D, the SlowFast, and the Video Swim Transformer architectures. However, overall, all the considered models exhibit a moderate performance, indicating the difficulties in generalizing their abilities in classifying videos in the new challenging scenario represented by our Bus Violence dataset.
An important observation can be made concerning false alarms and missing alarms. Specifically, while all the considered methods generally obtained very good results regarding the first metric, they struggled with the latter. Because missing alarms are critical in this use-case scenario, because they reflect violent actions that happened but were not detected, this represents a major limitation for all the state-of-the-art violence detection systems. The main method responsible for this problem is to be sought in the high number of false negatives, which indeed also affects the Recall and, consequently, the F1-score, another evaluation metric that is particularly problematic for all the considered methods. In Figure 3, we report some samples of a true positive, true negative, false positive, and false negative. Another point worthy of note is that the majority of the most performing methods come from the human action recognition task. We deem that they are more robust to generalization to unseen scenarios because they are pre-trained using the Kinetics-400 dataset, from which they learned more strong features able to help the network in classifying the videos also in this specific use case.
Finally, we report in Figure 4 the ROC curves concerning the three most performing models, i.e., SlowFast, ResNet 3D, and Video Swin Transformer, considering both the color and frame-difference inputs. Specifically, we plotted the curves for all three employed pre-training datasets. The dataset which provides the best generalization capabilities over our Bus Violence benchmark resulted in being the Surveillance Camera Fight dataset, followed by RWF-2000. However, as already highlighted, not one architecture shines when tested against our challenging scenario.

Conclusions and Future Directions
In this paper, we proposed and made freely available a novel dataset, called Bus Violence, which collects shots from surveillance cameras inside a moving bus, where some actors simulated both violent and non-violent actions. It is the first collection of videos describing violent scenes on public transport, characterized by peculiar challenges, such as different backgrounds due to the bus movement and illumination changes due to the varying positions of the vehicle. This dataset has been proposed as a benchmark for testing the current state-of-the-art violence detection and action detection networks in challenging public transport scenarios. This research is motivated by the fact that public transports are very exposed to many violent or criminal situations, and their automatic detection may be helpful to trigger an alarm to the local authorities promptly. However, it is known that state-of-the-art deep learning methods cannot generalize well to never seen scenarios due to the Domain Shift problem, and specific data are needed to train architectures to work correctly on the target scenarios.
In our work, we verified many state-of-the-art video-based architectures by training them on the largely used violence datasets (Surveillance Camera Fight, Real-life Violence Situations, and RWF-2000), and then testing them on the collected Bus Violence benchmark. The performed experiments showed that even very recent networks-such as Video Swin Transformers-could not generalize to an acceptable degree, probably due to the changing lighting and environmental conditions, as well as difficult camera angles and low-quality images. The CNN-based approaches seem to obtain the best results, still reaching an unsatisfactory level to make such systems reliable in real-world applications.
From our findings, we can conclude that the probed architectures cannot generalize to conceptually similar yet visually different scenarios. Therefore, we hope that the provided dataset will serve as a benchmark for training and/or evaluating novel architectures able to also generalize to these particular yet critical real-world situations. In this regard, we claim that domain-adaptation techniques are the key to obtaining features not biased to a specific target scenario [34,35]. Furthermore, we hope that the rising research in unsupervised and self-supervised video understanding [36,37] can be a good direction for acquiring high-level knowledge directly from pixels, without any manual or automatic labeling. This would pave the way toward plug-and-play smart cameras capable of learning about the specific scenario once deployed in the real world.
Finally, we also plan to use the acquired dataset for other relevant tasks on public transport, such as left-object detection and people counting, and to extend the collected videos to include other critical scenarios, such as unexpected emergencies-heart or panic attacks, that could be misclassified as some violent actions.
Funding: This work was supported by: European Union funds awarded to Blees Sp. z o.o. under grant POIR.01.01.01-00-0952/20-00 "Development of a system for analysing vision data captured by public transport vehicles interior monitoring, aimed at detecting undesirable situations/behaviours and passenger counting (including their classification by age group) and the objects they carry"); the EC H2020 project "AI4media: a Centre of Excellence delivering next generation AI Research and Training at the service of Media, Society and Democracy" under GA 951911; the research project (RAU-6, 2020) and projects for young scientists of the Silesian University of Technology (Gliwice, Poland); and the research project INAROS (INtelligenza ARtificiale per il mOnitoraggio e Supporto agli anziani), Tuscany POR FSE CUP B53D21008060008. The publication was supported under the Excellence Initiative-Research University program implemented at the Silesian University of Technology, year 2022.

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.