The Role of Deep Learning Models in the Detection of Anti-Social Behaviours towards Women in Public Transport from Surveillance Videos: A Scoping Review

: Increasing women’s active participation in economic, educational, and social spheres requires ensuring safe public transport environments. This study investigates the potential of machine learning-based models in addressing behaviours impacting the safety perception of women commuters. Specifically, we conduct a comprehensive review of the existing literature concerning the utilisation of deep learning models for identifying anti-social behaviours in public spaces. Employing a scoping review methodology, our study synthesises the current landscape, highlighting both the advantages and challenges associated with the automated detection of such behaviours. Additionally, we assess available video and audio datasets suitable for training detection algorithms in this context. The findings not only shed light on the feasibility of leveraging deep learning for recognising anti-social behaviours but also provide critical insights for researchers, developers, and transport operators. Our work aims to facilitate future studies focused on the development and implementation of deep learning models, enhancing safety for all passengers in public transportation systems.


Introduction
Public transport services are enablers of economic development.They provide individuals with access to medical facilities, workplaces, schools, and shops.However, public transport is experienced in a gendered way; mobilities shape how violence against women is perpetrated and how women experience it in public transport [1,2].Global-, regionaland local-level studies have shown that women experience various forms of violence in public transport [3][4][5].The types of violence experienced varied from demeaning remarks and threats or intimidation to robbery.The fear of violence in public spaces has the potential to affect the behaviour of passengers to the point of avoiding travelling [6].
Violent or anti-social behaviour can take many forms; there are many different definitions for anti-social behaviour, and definitions vary across contexts, social norms, and values [7].In [8], for example, anti-social behaviour is defined as a psychological term for actions that fall outside of the realm of what is considered normative in a particular society or culture.In the present study, we use the terms anti-social and abnormal interchangeably and follow the definition of anti-social behaviour adopted by the Western Australia Police, which defines anti-social behaviour as any 'behaviour that disturbs, annoys or interferes with a person's ability to go about their lawful business' [9].
Anti-social behaviour towards women in public transport is relevant to transport operators since they are concerned about improving the travel experience and ensuring customers' safety.This matter is also pertinent to governments and businesses, as closing the gender gap in economic participation could lead to organisations yielding better economic performance and outcomes [10][11][12].
Although video and audio surveillance technologies have the potential to address this issue, these technologies are inefficient for numerous reasons, such as the need to check the video frame-by-frame [13].An intelligent detection system may help reduce operational issues, decreasing the transport operators' efforts and costs.Previous studies have shown that technologies based on machine learning models surpass traditional technologies in anti-social behaviour detection areas [14][15][16][17][18].However, could machine learning-based detection systems be a useful and efficient technology to deal with behaviours that may influence women's perceptions of safety in public transport?To answer this question, this paper proposes a systematic scoping review that aims to understand the theoretical and practical benefits and limits to using deep learning-based technologies to detect from surveillance data anti-social behaviours associated with a threat to female commuters.Our goal is to support not only researchers and developers but also transport operators in the development and implementation of deep learning models for the detection of anti-social behaviours towards women in public transport.The scoping review will provide a timely summary of the literature describing the benefits of deep learning-based detection technologies, challenges in the automated detection of anti-social behaviours, and considerations for existing video and audio datasets that can be used to train such algorithms.
The rest of the paper is organised as follows.Section 2 presents the research methodology based on the scoping review approach adopted in this study.Section 3 describes the results of the scoping review, whereas Section 4 discusses the scoping review results obtained.Finally, Section 5 concludes the work.

Materials and Methods
The goal of the present study is to explore the benefits and limitations of using deep learning-based technologies to recognise anti-social behaviours threatening female commuters from audio and/or video surveillance data.For that, we adopted a scoping review methodology and followed the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR) [19].Scoping reviews are used to introduce a wide overview of the evidence relevant to a topic and are valuable when examining areas that are emerging, clarifying key ideas, and identifying gaps [20].To undertake a scoping review, we used the methodological framework proposed in [21] and refined in [22][23][24], which includes the following steps: (a) identification of the research questions, (b) identification of studies relevant to the research questions, (c) selection of studies for inclusion, (d) charting of information and data within the included studies, and (e) summarising and reporting of the review results.These steps as well as their sub-steps and outcomes are illustrated in Figure 1 and described in the next sections.

Identification of the Research Questions
The first step of the scoping review methodology consisted of identifying the research questions, which help shape and carry out the scoping review.The research questions that the present scoping review aims to answer are as follows.

•
What are the benefits of deep learning models in the detection of anti-social behaviours in public spaces?• What deep learning models have been used to detect anti-social behaviours in public spaces?• What audio and/or video datasets are relevant to the detection of anti-social behaviours towards women commuters?• What are the challenges in the use of deep learning models for the detection of antisocial behaviour(s) in public transport?

Identification of Studies Relevant to the Research Questions
The second step of the scoping review methodology is the identification of studies that are pertinent to the identified research questions.The goal of this step is to identify relevant studies in the field of anomaly detection by conducting a comprehensive literature search.For that, this step includes two activities (see Figure 1): (a) development of a search strategy and (b) creation of inclusion and exclusion criteria.These activities are described as follows.

Development of a Search Strategy
The search strategy specifies the academic reference and citation databases and the search terms to be utilised for finding pertinent articles.
On the other hand, we adopted a broad definition of keywords for search terms, as recommended in [21], to obtain a broad coverage of the available literature.The search terms used were: "anomaly", "unusual", "abnormal", "anomalous", "violent", "harass", "harassment", "sexual", "unsafe", "panic", "anti-social", "threatening", "violation" OR "violations" AND "behaviour", "behavior", "motion" OR "activity" AND "detection" OR "recognition" AND "audio", "video", "surveillance", "sound" OR "speech" AND "public transport", "public transportation", "transport", "transportation" OR "public spaces" AND "deep learning", "machine learning" OR "neural network".It is important to note that the words "woman" and "women" were not included as search terms.This is for the purpose of looking for examples of research focusing on similar safety issues.By broadening the focus at this search stage, it was possible to identify, as the discussion of the results shows (see Section 4), automated detection technologies that have not been used to detect anti-social behaviours perceived by women, but that are potentially suitable candidates.

Creation of Inclusion and Exclusion Criteria
After the development of a search strategy, we created inclusion and exclusion criteria, which are displayed in Table 1, to ensure the relevance of the selected papers.It was decided that all publications in English should be included, and no restrictions should be placed on the publication dates and study design.As the focus of this scoping review is on technologies for the detection of anti-social behaviours of perpetrators, only publications that fall within the anomaly detection research area should be included.Other anomaly detection research areas, such as traffic anomalies and environmental anomalies, are concerned with anti-social behaviours that either are not relevant, e.g., facial sleepiness expressions [25], or do not influence women's perception of safety in public transport, e.g., an individual abandoning an object [26].Finally, only publications that considered anti-social behaviours that affect women's perception of safety were included in the sample.Some of the sampled publications mentioned what anti-social behaviour could be identified by the proposed algorithm; others did not.However, based on the audio or video dataset adopted for training and/or assessing the performance of the proposed algorithm, we inferred what anti-social behaviour(s) could be detected by the introduced technology.
To decide what detected behaviour was of relevance to women commuters' safety perspective, we followed the definition of anti-social behaviour from the Western Australia Police, which was presented in Section 1.

Criteria Inclusion Criteria Exclusion Criteria Justification
Language English Non-English language English is the official language of the study's team

Selection of Studies for Inclusion
The third step of the scoping review methodological framework is the selection of relevant articles for inclusion in the article sample.A PRISMA flow diagram illustrating the procedures in this step is presented in Figure 2.These procedures ensure that the review is comprehensive, relevant, and unbiased.
The studies identified through the search strategy (see Section 2.3) formed the first sample of articles (see Figure 1), which contained 92 papers.Once the first sample was obtained, we proceeded with its refinement.Accordingly, two screening processes were undertaken in this step, and the inclusion and exclusion criteria guided these processes.
To manage and organise them, the publications in the first sample were imported into Covidence, a web-based systematic review software that enables various reviewers to work more efficiently through the steps of a systematic review [27].When the first sample was imported into the software, 16 duplicated publications were identified and then excluded, resulting in the second sample (n = 76).The first screening process consisted of two authors independently reviewing each title and abstract to exclude publications that did not adhere to the inclusion criteria.Papers that passed through the first screening process formed the second article sample (n = 28) and were then read in full by two authors independently to assess based on the inclusion criteria whether the paper should be inserted in the third article sample.The third sample included 10 manuscripts.
Conflicts at each stage of the screening processes were resolved through researcher discussion until agreement was reached.Data extraction was completed in Covidence using a data extraction template.The information extracted is presented in Table 2.
After we acquired the third sample of articles, we manually identified 72 additional potentially relevant references from the third sample.These potentially pertinent papers also went through the first and second screening processes, and those that moved through the two processes (n = 23) were added to the third sample, resulting in the final sample of articles, which contained 33 manuscripts.

Charting of Information within the Included Studies
The last stage of the scoping review methodological framework is the charting of the information within the articles included in the final sample by summarising each paper, including the details extracted using the data extraction template (see Table 2).A summary of the sampled articles is available in Appendix A.
Among the studies in the sample, 10 are surveys or review manuscripts, while the remaining 24 are applied research where a new or improved automated detection technology are presented.Of the applied research publications, 4 studied acoustic surveillance and 20 video surveillance.All sample articles fall within the research area of anomaly detection in public spaces.From the article sample, one can presume that audio/video anomaly detection in public settings has been researched since 2007, with an increased interest in the topic from 2018 and the largest number of articles (10) recorded in 2019.However, despite the increased gender-based vulnerabilities experienced by women in public places [6], in accordance with our article sample, it seems that women's perception of safety has not been given consideration in the research area of anomaly detection.Although every applied research publication in the sample introduced an algorithm to identify one or more behaviours that could impact women's safety perspective in public transport, none of the technologies was proposed with the aim of detecting perpetrators' behaviours towards women commuters.With respect to the training datasets, some of the applied research studies adopted their own datasets, which were designed based on real data, and others utilised well-known datasets.As highlighted in Section 2.4, we used the dataset adopted in an applied research manuscript to infer what anti-social behaviours could be detected by the proposed technology when that was not specified in the study.

Summarising and Reporting the Review Results
The last step of the scoping review is summarising and reporting the review results.The results of the present scoping review are presented and discussed in the next section.

Synthesis of the Literature
Following the questions in the data extraction template (see Table 1), we undertook a synthesis of the literature on the automated detection of anti-social behaviours in public spaces based on the identified research questions (see Section 2.1).These questions are addressed in detail as follows.The major potential benefit of deep learning-based detection technologies presented in the literature is to improve the level of safety of passengers [7,8,13,14,28].The primary aim of audio and/or video surveillance systems in public transport is to assist in preventing crime and terrorist activities.Detecting anti-social activities at the early stage and even predicting these activities before they happen is of great value to prevent serious incidents from occurring [28].
The traditional procedure for the task of capturing the images and audio of possible transgressions from surveillance systems is to appoint human controllers (e.g., security guards) whose job is to analyse large volumes of repetitive and monotonous images from multiple cameras [16,29].These exhausting human efforts make it very difficult for the controller to always remain vigilant, which might lead to abnormal events going unnoticed [15][16][17][18]29].Moreover, if the station workers are off at night when an incident occurs, a belated response may have dangerous results.If automated detection technologies are utilised, authorities can respond to abnormal and violent behaviours sooner, maximising convenience and the feeling and the actual level of security [16,30].Nonetheless, none of the sampled publications has validated that passengers felt safer because an automated detection technology was in place.
Another potential benefit of deep learning-based detection technologies is that they might be used for monitoring people who fall on train tracks and/or when platforms/ carriages are too crowded, etc. Automated detection technologies can also be cost-effective.These technologies can reduce the costs associated with the manual monitoring incurred by transport operators [15,18,[31][32][33].Additionally, deep learning detection models usually only require infrastructure that is already available in public spaces, that is, a set of closedcircuit television (CCTV) cameras [34].

What Deep Learning Models Have Been Used to Detect Anti-Social Behaviours in Public Transport?
The deep learning model proposed in most of the 24 applied research articles in the sample was the convolutional neural networks (CNN) model, followed by the Gaussian mixture model (GMM).Nine sampled applied research manuscripts introduced a CNNbased model [15,29,31,33,[35][36][37][38][39], whereas seven applied research articles presented a GMMbased model [40][41][42][43].These models are briefly detailed as follows.

CNN-Based Deep Learning Models Sampled
Ramachandran and Palivela [36] proposed a framework aimed at detecting suspicious human behaviour in surveillance videos and distinguishing it from normal activities.The model utilised a CNN to extract features from optical flow slices and pre-trained the activities based on real-time data.These learned features were used to predict the type of activity and classified using a multi-class support vector machine (MSVM).The CNN is a type of deep learning model that is highly effective in analysing and processing structured grid-like data, such as images or sequential data.It is widely used for various tasks in computer vision, including image classification, object detection, and image segmentation.CNNs are inspired by the visual cortex of animals and exploit the spatial relationships present in the input data.They consist of multiple layers, including convolutional layers, pooling layers, and fully connected layers.The convolutional layers apply filters to the input data, capturing local patterns and features.The pooling layers downsample the feature maps, reducing the spatial dimensions and extracting the most relevant information.The fully connected layers combine the high-level features and make the final predictions [44].The system was designed for public places and security-sensitive environments.The performance of the system was evaluated using standard datasets and achieved 95% accuracy.The model involved pre-processing the input videos, extracting motion patterns, resizing and inputting them into the CNN for feature learning, and then using MSVM for classification.The proposed approach demonstrated high performance and accuracy, outperforming other classifiers like KNN and random forest models, while reducing false alarms in detecting suspicious behavior.
Proano et al. [35] addressed the challenge of detecting abnormal behaviours in surveillance videos and proposed an approach using computer vision and pattern recognition.They fully labelled a dataset of 16,853 videos, dividing them into segments and labelling each segment as normal or abnormal.They then utilised a generic 3D convolutional neural network (C3D) to extract feature vectors from the segments and trained a multilayer perceptron (MLP) for classification.The MLP is a type of artificial neural network (ANN) that consists of multiple layers of interconnected nodes, called neurons.It is a feedforward neural network, where information flows only in one direction, from the input layer through the hidden layers to the output layer.The MLP is widely used for various tasks such as classification, regression, and pattern recognition.Each neuron in the MLP receives inputs from the previous layer, applies an activation function to the weighted sum of its inputs, and produces an output.The activation functions introduce non-linearities to the network, enabling it to learn and represent complex relationships in the data.The MLP learns by adjusting the weights and biases of its neurons through a process called back-propagation.It uses an optimisation algorithm, such as gradient descent, to iteratively minimise the difference between the predicted output and the desired output, known as the loss or error function [45].The contribution of the paper included the labelled dataset, improved results compared to baseline research with an area under the curve (AUC) of 0.863, and the demonstration that a segment-labelled dataset enhances the classifier performance.The approach achieved promising results in abnormal behaviour detection and validated its performance through tenfold cross-validation.The proposed model consisted of segment labelling, feature extraction using the C3D network, and classification with the MLP.
Landi et al. [37] introduced a new approach to anomaly detection in surveillance videos, specifically focusing on real-world anomalies like burglaries and assaults.The proposed model considered the impact of locality by analysing spatiotemporal tubes instead of whole-frame video segments.The authors enriched existing surveillance videos with spatial and temporal annotations, creating the first dataset for anomaly detection with bounding box supervision in both the train and test sets, called UCFCrime2Local.The experimental results demonstrated that a network trained with spatiotemporal tubes outperforms a similar model trained with whole-frame videos.The study also revealed the robustness of the locality concept to different errors during the tube extraction phase at test time.Additionally, the model was capable of providing spatiotemporal proposals for unseen surveillance videos, expanding the dataset without requiring further human labelling.The contributions of the paper included the novel approach to anomaly detection using action tubes, the development of a trainable model for handling different locations within a video segment, and the creation of the UCFCrime2Local dataset with bounding box supervision.The experiments emphasised the significance of locality, the model's robustness, and its ability to provide weak annotations for new videos, highlighting the potential of considering locality in anomaly detection.
Liu and Ma [39] addressed the background-bias phenomenon in anomaly detection using deep neural networks.The paper investigated whether deep networks focus on learning the background rather than the essence of anomalies and proposed a solution to alleviate this bias.The researchers conducted experiments and verified that deep networks tend to rely on the background rather than the anomaly patterns.To tackle this problem, they introduced an end-to-end framework with a novel region loss that guided the network to focus on the anomalous region.They also incorporated meta learning to prevent overfitting and improve generalisation.The largest anomaly detection dataset was re-annotated with bounding boxes, and extensive experiments demonstrated the effectiveness of the proposed approach in reducing the influence of the background and outperforming other methods.
Majhi et al. [15] presented a two-stream CNN architecture for anomalous event detection in surveillance videos.The proposed model utilised both normal and anomalous videos to improve performance.The two-stream CNN is a deep learning architecture that combines spatial and temporal information for tasks such as action recognition in videos.It consists of two separate CNN streams: one that processes spatial information from individual frames and another that analyses temporal information from sequences of frames.The spatial stream is designed to capture static visual appearance and local spatial patterns within individual frames.It takes an individual frame as input and applies convolutional and pooling layers to learn hierarchical representations of the visual content.The temporal stream, on the other hand, focuses on modelling motion information and capturing temporal dynamics over a sequence of frames.It takes a series of frames as input and processes them through 3D convolutional and pooling layers, which capture spatiotemporal patterns and motion information.The outputs from the spatial and temporal streams are fused or combined at a later stage, often through concatenation or element-wise operations, to obtain a unified representation that incorporates both spatial and temporal information.This combined representation is then used for making predictions or classifications.The twostream CNN architecture enables the model to effectively leverage both static appearance and temporal dynamics, providing enhanced performance for tasks that involve analysing videos [46,47].A database pre-processing technique was introduced to capture spatial and temporal information for each second of a video, which was then fed as input to the two-stream CNN architecture.The model was evaluated using the UCF-crime dataset and achieved a superior classification accuracy compared to other state-of-the-art techniques.The proposed method achieved 88.74% accuracy on the UCF-crime dataset with a frame processing time of 346 milliseconds.The contributions of the paper included the database pre-processing technique and the efficient two-stream CNN architecture for real-world anomalous event detection.The study suggested the potential for future enhancement of system performance using other deep learning models.
Affonso et al. [29] presented a method for detecting anomalies, specifically assaults, in the public transportation environment using CNN and CCTV cameras.The challenges in this task included equipment standardisation, low image quality, poor camera positioning, and an imbalanced dataset.The proposed method utilised CNN architectures to classify CCTV images and determine the probability of a robbery occurrence.The method achieved promising results despite the limitations of low-quality images and the absence of temporal information analysis.Future work involved exploring the impact of temporal information on the precision of the proposed method.
Ullah et al. [31] presented an efficient deep learning-based framework for intelligent anomaly detection in surveillance networks.The proposed model utilised a pre-trained CNN to extract spatiotemporal features from a sequence of frames, followed by a multilayer bi-directional long short-term memory (BD-LSTM) model for classification of anomalous and normal events.The BD-LSTM is a variant of the LSTM recurrent neural network (RNN) architecture.It is designed to effectively capture both past and future dependencies in sequential data by processing the input sequence in both forward and backward directions simultaneously.LSTMs are a type of RNN that excel at handling sequential data due to their ability to capture long-term dependencies and mitigate the vanishing gradient problem.Bi-LSTMs extend the capabilities of traditional LSTMs by incorporating information from both the past and the future, making them suitable for tasks such as sequence labelling, sentiment analysis, and machine translation.In a BD-LSTM, the input sequence is processed in two separate LSTM layers: one in the forward direction and the other in the backward direction.Each LSTM layer consists of memory cells and gates that regulate the flow of information.By combining the outputs of the forward and backward LSTMs, the BD-LSTM captures a comprehensive representation of the input sequence, taking into account both the past and future context [48].Extensive experiments on benchmark datasets demonstrated the effectiveness of the framework, achieving a significant increase in accuracy compared to state-of-the-art methods.The contributions of the work included the development of an adaptable framework for real-world surveillance scenarios, the utilisation of deep CNN features combined with BD-LSTM for improved accuracy, and the demonstration of real-time anomaly detection with a reduced model size and processing time.
Dileep et al. [33] introduced a real-time suspicious human activity recognition system using CNN and 2D pose estimation.The system extracted skeletal images from video frames using pose estimation to identify human poses, which were then classified using a pre-trained CNN to determine suspicious activities like trespassing or falls.The system can generate alerts through various means to prevent unusual activities and can be applied in public places, homes, hospitals, and other surveillance areas.The proposed model combined pose estimation and CNN architecture to recognise suspicious activities accurately.The model focused on fall detection and trespassing activities and was tested on a custom dataset.

GMM-Based Deep Learning Models Sampled
Valenzise et al. [43] presented an audio-based video surveillance system that automatically detects anomalous audio events, such as screams or gunshots, in a public square and localises the position of the sound source to steer a video camera accordingly.The system employed two parallel GMM classifiers trained on different features to discriminate screams and gunshots from noise.The position of the sound source was estimated using time difference of arrivals (TDOA) at a microphone array and a linear-correction least square localisation algorithm.The experimental results demonstrated a high precision in event detection and accurate source localisation.The proposed approach differed from previous works by emphasising the phase of feature selection and providing insights into camera zooming based on localisation confidence.The system was planned for real-time implementation in a public square.
Clavel et al. [41] introduced a model for automatic emotion recognition in speech, focusing specifically on fear-type emotions during abnormal, life-threatening situations.The proposed model utilised the SAFE corpus, a collection of 7 h of audiovisual sequences from fiction movies containing recordings of both normal and abnormal situations.The corpus provides a wide range of emotional manifestations and addresses the lack of corpora illustrating strong emotions.The annotation strategy described the emotion and situation evolution in context.The emotion recognition system was based on dissociated acoustic models for voiced and unvoiced speech, which were merged during the classification step.The results showed promise, with an error rate of about 30%.The paper further discussed the integration of emotion recognition into automatic surveillance systems and its potential role in understanding human behaviour and diagnosing abnormal situations.The model development involved acquiring emotional data, manually annotating emotional content, describing the acoustic features, and developing machine learning algorithms.The study represented preliminary work in the emerging field of emotion recognition, particularly in the context of audio-video surveillance, with the aim of addressing the challenges posed by heterogeneous, noisy data, and specific emotional classes.
Ntalampiras et al. [40] presented an efficient methodology for acoustic surveillance of atypical situations, specifically focusing on the recognition of specific sound events such as screams, explosions, and gunshots.The proposed model utilised a probabilistic hierarchical scheme based on GMM and carefully selected sound parameters.Notably, the model included a model adaptation loop for adaptability to different sound environments.The GMM is a probabilistic model that represented a probability distribution as a combination of multiple normal distributions.It is usually used for the modelling of complex data distributions that cannot be described accurately with a single Gaussian distribution.In the GMM model, the underlying assumption is that the observed data points are generated from a mixture of K Gaussian distributions, where each Gaussian distribution in the mixture represents a distinct component or cluster in the data.The GMM assigned a probability to each data point, indicating the likelihood of it belonging to each component.This allows for the modelling of data points that may come from different sources or follow different patterns [49].Extensive experimentation and testing, including real-world installation and operational detection rates over three days, demonstrated the effectiveness of the system in terms of recognition accuracy, miss probability, and false alarm rates.The methodology was applied to various environments, including metro stations, urban areas, and military settings.The paper also highlighted the integration of visual and infrared sensors for enhanced detection of hazardous events.Ntalampiras et al. [42] explored the use of novelty detection in acoustic surveillance of abnormal situations.The proposed model aimed to identify unknown or novel audio data that deviated significantly from the trained data.A multi-domain feature vector was constructed using various acoustic parameters to capture diverse characteristics of audio signals.Three probabilistic novelty detection methodologies were employed and evaluated using real-world recordings from different locations with normal and abnormal sound events.The results demonstrated the effectiveness of probabilistic novelty detection in accurately identifying abnormal audio events.The paper proposed a framework that utilised multi-domain audio descriptors and achieved a high detection accuracy.Proposed future steps involved combining the acoustic component with other modalities, such as CCD and IR cameras, to improve accuracy and facilitate human behaviour detection and interpretation.The experiments aimed to evaluate the methodology further and identify any limitations.
In conclusion, various studies focusing on anomaly detection from surveillance in different contexts are presented.These studies explored the use of advanced techniques to address the challenges associated with recognising specific sound events and detecting abnormal behaviours in surveillance videos.The proposed models leverage features extracted from audio, optical flow, skeletal images, or spatiotemporal tubes, and applied classification algorithms to accurately differentiate between normal and abnormal events.The studies highlight the importance of considering spatial and temporal information, exploiting locality, addressing background bias, and incorporating emotion recognition for a comprehensive understanding of abnormal situations.These advancements contribute to improving the effectiveness of surveillance systems, enhancing public safety, and enabling early intervention in potentially hazardous situations.

What Audio and/or Video Datasets Are Relevant to the Detection of Anti-Social Behaviours towards Women Commuters?
Studies carried out in anomaly detection have proposed many training and validation datasets that are suitable for different applications.With respect to the detection of antisocial behaviours relevant to the present study, five training and validation datasets were selected from the literature.These datasets are explained in detail below and a summary of their main features is presented in Table 3.

BEHAVE
The BEHAVE video dataset, proposed in [50], includes videos with crime-oriented abnormal behaviour that were recorded in the real world using a commercial tripodmounted camcorder.This dataset was adopted in five manuscripts in the sample.It has around 90,000 frames of humans identified with bounding boxes, providing ground-truth tracking information along with descriptions of behaviours for interacting groups.This dataset covers four video clips recorded at 25 frames per second with a resolution of 640 × 480 pixels.In total, there are 125 different people in this dataset, having a total of 83,545 bounding boxes for each interacting person.BEHAVE was annotated at the level of individual bounding boxes and frame-by-frame behaviour to segregate abnormal and normal events.The anti-social behaviours included in this dataset that are relevant to this study are chasing, fighting, following, and running together.No training and test partitions are provided in this dataset.

ShanghaiTech Campus
This video dataset was first proposed in [32] and was adopted in 11 manuscripts in the sample, making it the most used.This dataset was created using multiple surveillance cameras with different view angles installed at different spots to capture real events at a university campus.ShanghaiTech has challenging light conditions and is a very large dataset, containing 13 scenes, 130 abnormal events, and over 270,000 training frames.The resolution of each video frame is 480 × 856 pixels.For training and validation, this dataset has 330 training videos and 107 testing ones.The abnormal behaviours included in this dataset that are relevant to this research are fighting, throwing objects, chasing, brawling, and pushing.The training set contains only normal videos, while the test set contains a large number of normal and abnormal videos, which might lead to a low performance in anomalous behaviours detection.Additionally, the anomalies in the test set are annotated at the pixel level [51].

UCF-Crime
UCF-Crime, which is a large-scale video training dataset introduced in [52], was used in seven manuscripts in the sample.The authors generated this dataset from videos on YouTube and LiveLeak using text search terms with slight variations of each anomaly term, e.g., "car crash" and "road accident".It includes 13 anomalies with a high impact on public safety, and the anti-social behaviours pertinent to our research are abuse, arrest, assault, burglary, shooting, stealing, fighting, burglary, robbery, and vandalism.UCF-Crime consists of 1900 untrimmed surveillance videos, of which half contain anomalies, and some of the videos have multiple anomalies.The dataset is divided into two parts: the training sample, consisting of 800 normal and 810 anomalous videos, and the testing data sample, including the remaining 150 normal and 140 anomalous videos [31].Unlike the static backgrounds in ShanghaiTech, UCF-Crime consists of complicated and diverse backgrounds, which may make it difficult to detect violent behaviours [53].

XD-Violence
XD-Violence is an audio and visual violence dataset proposed in [54].This dataset was utilised in two manuscripts in the sample.To create this dataset, the authors collected 91 videos from both movies and YouTube by using text search queries.Violent movies were used to collect both violent and non-violent events, and non-violent movies were only used to collect non-violent events.XD-Violence is the largest-scale anomaly dataset, with a total of 217 h and 4754 untrimmed videos, consisting of 2405 violent videos and 2349 non-violent videos.This dataset includes six anomaly events and four are relevant to this research, namely, abuse, fighting, riot, and shooting.The dataset was split into two parts: the training set, containing 3954 videos, and the test set, including 800 videos, where the test set consists of 500 violent videos and 300 non-violent videos.The training set contains video-level annotations, while the test set contains frame-level annotations [51].By containing audio-visual signals, this dataset can leverage multi-modal information and provide more confidence in the detection algorithm's outcomes.

SAFE Corpus
The Situation Analysis in a Fictional and Emotional (SAFE) Corpus is an audio and video training dataset introduced in [41] and adopted in one manuscript in the sample.It consists of 400 audio-visual sequences from 8 s to 5 min in English generated from a collection of 30 movies from various genres: thrillers, psychological drama, horror movies, and movies that aim at reconstituting dramatic news items or historical events or natural disasters.A sequence is a movie section illustrating one type of situation, e.g., kidnapping, physical aggression, or flood.A total of 7 h and 5275 segments is obtained in which speech represents 76% of the data, and of which 71% depicts abnormal situations with fear-type emotional manifestations, among other emotions.The anomalies illustrated in the dataset that are of relevance to this study are physical/psychological threats and aggression against human beings.The speech data include about 400 different speakers.The distribution of speech duration according to gender is 47% male speakers and 31% female speakers.The remaining 20% of the spoken duration consists of overlaps between speakers, including oral manifestations of the crowd (2%).The annotation of a given segment was performed by a human labeller and influenced both by audio and video information contained in the whole sequence.These datasets have two good traits.Firstly, they are large-scale datasets, which is beneficial for training generalisable methods for abnormal behaviour detection.Secondly, they include a diversity of scenarios.In this way, abnormal behaviour detection models can respond to complicated and diverse environments and be more robust, and their outcomes can be free of bias.However, most of these datasets were created from synthetic data, that is, data artificially manufactured.One of the limitations of synthetic data is that the images generally do not present a uniform pixel quality and having dataset images with the same consistent pixel quality improves the model's performance.Another limitation is that synthetic data may not accurately reflect real-world situations and may not be representative of the underlying data distribution.Accordingly, synthetic datasets may not capture the complexity and variability of real-world data, resulting in deep learning models that generate erroneous behaviour prediction on real-life data [55,56].

What Are the Challenges in the Use of Deep Learning Models for the Detection of Anti-Social Behaviour(s) in Public Transport?
Challenges in anomaly detection from audio and/or video surveillance data were a topic covered in most of the publications in the sample.These challenges are due to many factors, such as environmental conditions and the nature of the technology, and can affect the performance (accuracy) of a deep learning model.A deep learning-based detection model should be able to stably and continuously output the correct result, which is the fundamental requirement in abnormal behaviour detection [28].While a false positive is when the target behaviour did not occur, but the detection model predicted it happened, a false negative is when the target behaviour occurred, but the detection model did not predict it.False positives and false negatives can both result in wrong decisions and lead to severe consequences [17,28].To avoid incorrect alerts, the accuracy of the detection model must be maximised [28,57].We classify the identified challenges based on their factors and describe them as follows.

Ambiguity
This challenge refers to the ambiguous nature of anti-social activities.In real-world scenarios, the boundary between abnormal and normal behaviours is unclear.Humans can easily recognise abnormal or typical events based on common sense, but deep learning models need to use features learned from the differences between data representing abnormal and normal behaviours to detect these events [16,17,31].For example, a deep learning-based model could classify two friends playing by chasing each other as anti-social behaviour.Additionally, some normal data samples may present an abnormal feature, which will reduce the detection accuracy of the model [34,58].

Background
This challenge relates to the environment where surveillance systems operate.Since this environment varies in illumination conditions over time, deep learning-based detection techniques might have difficulty detecting abnormal behaviours in some locations/time periods [16,17,31,34].Another issue is when the background is dynamic and complex, for instance, when many people are walking through the cameras all the time.Space, video resolution, and target occlusion might limit the detection accuracy as well [15,17,34,40,59].If the detection model was only trained in a specific space, the detection accuracy of the model might be lower when the abnormal behaviour happens in a space different from that in the training dataset.With respect to video resolution, the detection model is not able to work well when the resolution of the surveillance video is different from that of the training dataset.Target occlusion means that the target activity is blocked by objects or individuals.Consequently, the detection model cannot detect the target behaviour, and then the performance of the algorithm decreases.Additionally, the poor positioning of CCTV cameras, including camera angles, can substantially impact the performance of detection models, as the appearance of behaviours may change depending on the different positions and angles of the cameras [29,34,59,60].In some cases, the same individual at different poses can be detected by humans with ease.However, it might be difficult for deep learning models to detect and track the same target with various poses [61].

Data Imbalance
This challenge indicates the imbalance between data on abnormal and normal events.Although normal events happen every day, the frequency of anti-social events is low, which leads to a data imbalance [28,34,38,61].A data imbalance makes deep learning models difficult to train, as these models must analyse a set of data containing abnormal and normal events to learn to detect the target behaviour(s).Hence, if the training and validation datasets are imbalanced, it becomes arduous not only to train the detection model but also to evaluate its performance [28].Moreover, the anti-social events may not be recorded, and a deep learning model may not be able to detect an anti-social behaviour from surveillance videos if the behaviour is not present in the training dataset [62].

Dependency and Diversity
This challenge concerns both the contextual dependency of the definition of anti-social behaviours and their diverse nature.Although there are numerous studies on abnormal behaviour detection, there is not a unified definition of abnormal behaviour [28,58], and some of the definitions introduced are dependent on the context, and thus, cannot be adopted in other abnormal behaviour detection tasks.Even the same abnormal events are unlikely to have the same characteristics and they may vary in different backgrounds.The contextual dependency of abnormal behaviours makes the detection models not adaptable [58].Moreover, anti-social behaviour is diverse.Its scope is wide and not limited to a specific behaviour; it rather refers to a wide class of behaviours.Therefore, there is a challenge in pre-defining the structure or class of anti-social events [40,61].

Data Quality
Another challenge is dealing with audio or video data with varying levels of quality [34,58].Audio or video data recorded from the same equipment can have various levels of quality due to variation or malfunction of the equipment, poor lighting conditions or other environmental characteristics affecting video recording quality, background noise affecting audio recording quality, or high data compression leading to loss of resolution or decoding artefacts.

Privacy and Availability
This challenge refers to data privacy limitations and the consequent lack of available training and testing datasets.During the construction of surveillance systems for transportation networks, there are certain rules issued by the government or relevant authorities that must be followed.They specify the needs and the justification of the CCTV system.These rules cover the legislative and privacy limitations [28,63].One of the main ethical concerns in the domain of CCTV is privacy.Video analytics could identify an individual's interactions with other individuals and objects.The individual privacy of facial and behavioural information in surveillance might be compromised, especially if the data are open [28,58].This leads to a lack of open-source datasets from real-life cases [17,34,58,61].Furthermore, surveillance data are considered sensitive and relevant to confidential and legal issues, which restricts data sharing [30].

Uncertainty
Another challenge relates to the changes that can be made to video or image data before they are input to the detection model and affect the performance of the model.These changes are called adversarial attacks [15,31,36,64] and could be, for instance, rotating the image, adding white noise to the image, and/or changing the scale of the image.These perturbations are usually too small to be perceptible for deep learning models [28] and could significantly disrupt security systems such as surveillance systems.For example, adversarial attacks can be implemented such that an anti-social activity is not detected by the model, and then no action is taken to address/report the incident.

Trade-offs
A balanced trade-off between real-time processing and the desired level of accuracy is needed in the application of deep learning models for detecting anti-social behaviours [34,59].The high accuracy of the detection and localisation of abnormal behaviours by deep learningbased methods is achieved at the cost of high computational complexity and a long processing time [34].A few important factors directly relevant to achieving real-time performance include the model architecture, hardware accelerators, data pre-processing, and threshold setting.Deep learning models usually consist of many hidden layers with millions of trainable parameters, which requires substantial computational resources, making them very challenging to deploy in real time.Model optimisation techniques (e.g., model compression, model pruning) are one of the common approaches adopted to reduce the model complexity without compromising the model accuracy by a significant margin.Furthermore, at the hardware level, GPU-enabled chips are used as hardware accelerators to support the deep learning models' deployment in real time.In terms of input data, a pre-processed data feed rather than a raw data feed is often used to improve the real-time performance.

Discussion
Although computer vision has evolved in the last decade as a key technology for numerous applications replacing human supervision, such as abnormal behaviour detection, our article sample shows that, so far, no academic research on abnormal behaviour detection has proposed a deep learning-based model to detect anti-social behaviours towards women commuters.Nevertheless, all sampled articles considered one or more behaviours that are of relevance to this research, such as abuse [15,31,35,38,65] and assault [35,37,38,53,65].Some of the sampled applied research articles did not mention what abnormal behaviour is detected by the proposed deep learning model, but they described what dataset was used to train/validate the model.As such, we could infer what anti-social behaviours can be detected by the introduced model.It is important to mention that the sampled articles that clarified the anti-social behaviour targeted by the introduced technology did not specifically define the behaviour.For instance, a targeted behaviour in some of the sampled articles was abuse.Abuse has a broad definition and can come in many forms, such as physical abuse, psychological abuse, and verbal abuse.Nevertheless, none of the sampled publications that considered abuse as a target behaviour introduced a definition for it.We believe that this is such because all sampled articles are concerned with technical aspects rather than social considerations.More importantly, no sampled applied research manuscript described the patterns/features in the dataset that were looked for by the introduced deep learning model to identify the target behaviour(s), which leaves readers only guessing how to train their own algorithms for detecting the same behaviour(s) and using the same training dataset.
Despite that, this scoping review presented deep learning models that are potentially good candidates for the detection of anti-social behaviours against women while using public transport.They can therefore be used as benchmarks when developing a deep learning algorithm for that task.However, taking into consideration the ambiguity [16,17,31,34,58] and dependency and diversity challenges [28,40,58,61], before using algorithms from the literature as benchmarks, it is necessary to specify what the target anti-social behaviours are, their visual and audio features, and in what public transport setting the behaviours will be detected, e.g., on a bus or at a platform.
This scoping review also introduced other relevant themes from the literature on abnormal behaviour detection, namely, the benefits of automated detection technology and considerations on audio and video training datasets, and reviewed the available audio and video training datasets relevant to the detection of anti-social behaviours towards women commuters from surveillance data.There are four training and testing datasets that contain anti-social behaviours relevant to this research.The advantages of these datasets are that they are large-scale and present a range of scenarios.However, three of them consist of synthetic data.From an algorithm development perspective, the disadvantage of synthetic data is that firstly, images in the synthetic dataset usually do not have consistent pixel quality, and having dataset images from the same capturing device(s)-with the same consistent pixel quality-significantly impacts the model's performance.Secondly, real-world data define the scope of the problem very well; that is, they are considered to be more representative of the actual condition.Therefore, when implementing a deep learning model for the detection of target anti-social behaviours, it is important to develop a training and testing dataset from real data from the actual use case.

Conclusions
Anti-social behaviour impacts public transport satisfaction and is among the key factors why travellers choose not to use public transport.Studies have highlighted the various forms of violence experienced by women commuters.Accordingly, predicting and preventing anti-social behaviours towards women within public transport environments is imperative to enhance their safety.
This systematic scoping review aimed to comprehensively understand the practical and theoretical implications of utilising deep learning algorithms for identifying abnormal behaviours towards women commuters using surveillance data.Our study successfully addressed the identified research questions.

Existing Landscape and Gap Analysis
This study presented a limited number of proposed deep learning models specifically designed for detecting anti-social behaviours in public spaces, despite substantial advancements achieved in other domains.Notably, this review identified the absence of automated technology focusing on behaviours impacting women passengers' perceived safety.Therefore, omitting the search terms "woman" and "women" facilitated the identification of deep learning models that have not been applied to detect behaviours relevant to women but are potentially suitable candidates.

Potential of Deep Learning Models and Identified Challenges
While our findings suggest the potential utility of deep learning models in detecting behaviours directed at women commuters, several challenges must be addressed.The discussion of the review results highlights these challenges, emphasising the need for further research.Specifically, future studies should explicitly delineate the patterns within training and validation data sets that model the target behaviours.Moreover, they should detail innovative approaches to overcoming common challenges like ambiguous data and data imbalance in this domain.

Proposals for Stakeholders
The current findings underscore the potential for transport operators to harness deep learning models in identifying undesirable behaviours from CCTV video footage.Nevertheless, the efficacy of these models depends on several critical factors, including the quality of the video footage, the accuracy of the training data, and the complexity of the behaviours targeted for detection.Privacy concerns can impede access to real-world case data, thereby potentially impacting the models' effectiveness.Consequently, researchers and developers must consider these factors when proposing novel deep learning solutions aimed at detecting anti-social behaviours affecting women commuters.Furthermore, there exists an opportunity for researchers, serving as catalysts for progress and innovation, to propose innovative solutions that overcome these challenges, thus contributing significantly to the enhancement of public safety within transport environments.

Limitations of the Present Study
Although this study benefits many stakeholders in the domain of deep learning model applications, it is important to recognise some of its limitations.Despite rigorous search strategies, certain databases, articles in non-English languages, or unpublished studies may have been excluded from the article sample, leading to potential gaps in the review.Additionally, due to its nature, the present scoping review may not delve deeply into specific research questions or provide definitive answers to focused queries.

Figure 2 .
Figure 2. PRISMA flow diagram for paper selection.

3. 1 .
What Are the Benefits of Deep Learning Models in the Detection of Anti-Social Behaviours in Public Spaces?

Table 3 .
Summary of datasets.