Novel Hand Gesture Alert System

: Sexual assault can cause great societal damage, with negative socio-economic, mental, sexual, physical and reproductive consequences. According to the Eurostat, the number of crimes increased in the European Union between 2008 and 2016. However, despite the increase in security tools such as cameras, it is usually di ﬃ cult to know if an individual is subject to an assault based on his or her posture. Hand gestures are seen by many as the natural means of nonverbal communication when interacting with a computer, and a considerable amount of research has been performed. In addition, the identiﬁable hand placement characteristics provided by modern inexpensive commercial depth cameras can be used in a variety of gesture recognition-based systems, particularly for human-machine interactions. This paper introduces a novel gesture alert system that uses a combination of Convolution Neural Networks (CNNs). The overall system can be subdivided into three main parts: ﬁrstly, the human detection in the image using a pretrained “You Only Look Once (YOLO)” method, which extracts the related bounding boxes containing his / her hands; secondly, the gesture detection / classiﬁcation stage, which processes the bounding box images; and thirdly, we introduced a module called “counterGesture”, which triggers the alert.


Introduction
In a sexual assault, the assailant assaults the victim quickly and brutally, without any prior contact, usually at night in a public place. This can be done by physical force or threats of force, or by the abuser giving the victim alcohol as part of the crime. Sexual assault includes rape and sexual coercion [1]. In the United States, a significant number of women face sexual assault every day, and about one in three women have been victims of this crime [2]. Similarly, the number of sexual assaults in Europe increased between 2008 and 2016 [3] despite the tremendous increase in security tools such as cameras, which require special human attention to analyze the scene [4][5][6]. Most of the studies proposed in the past few years tend to answer the question: was it a sexual assault? However, few of them focus on the early detection of this crime [7,8]. Indeed, this is a difficult task, as the posture of both individuals (the rapist and the victim) might be both on the same abscissa of the camera, as shown in Figure 1. Regardless of the victim's position, a part of his/her body may be visible most of the time, especially the hands, which may describe a specific pattern if the victim is aware of what is known as the "security-gestures" described in this article. This article aims to present a new hand gesture alert system, which takes advantage of a defined set of gestures that can trigger a warning when the described computer vision system detects them. To achieve this aim, the system uses human detection, hand extraction, and Convolution Neural Networks (CNNs) [9][10][11][12]. It detects human bodies in a video, extracts the region of interest (hands) and detects the hand gesture that will be processed and will trigger the alert, if the hand gesture corresponds to one of the predefined hand gestures. The remaining part of this article is structured as follows: first of all, the related work will be presented; secondly, the proposed architecture will be addressed; thirdly, experiments, results, and discussion will be presented; lastly, a summary of the proposed work and further perspectives are offered in the conclusion.

Related Work
Several researchers focus on human detection and surveillance; in addition, the detection of gestures is attracting increasing interest, as is the classification of gestures. The following section gives the reader an overview of the state-of-the-art work in these areas.

Human Detection
In today's frameworks, the extraction of Regions of Interest (ROIs) and the representation of characteristics are the two main factors under study [13][14][15][16][17][18]. In [19], the difference in intensity of an individual pixel is incorporated into shape-oriented features to capture salient features. However, the framework has selected significant thresholds based on assumptions. In [20], the foregrounds are separated by subtracting the background and are then sorted by a Support Vector Machine (SVM). However, the frame only detects the upper part of the human body, and Histogram of Oriented Gradients (HOG) [21] features [22] are combined into a composite local feature. Nevertheless, the extended dimension of the composite function increases the processing costs of the system, and the other commonly used feature descriptors are shapelet, Edge Orientation Histogram (EOH) feature and Haar wavelet function. In the case of partial occlusion, it is more effective to partially detect the human body [23] rather than the whole body. However, improved accuracy also increases treatment costs. The system proposed in [24] requires about several seconds to process a single frame.
Researchers are also interested in setting up a tracking algorithm using RGB (Red Green Blue) video streams. A tracking algorithm typically consists of two different components: a local component, which includes the characteristics extracted from the target being tracked; and global characteristics, which determine the probability of locating the target. The use of a CNN to extract the characteristics of the target being tracked is a prevalent and effective method. This approach focuses primarily on object detection and local features of the target to be used for monitoring purposes [6][7][8][9]. Qi et al. [23] used two different CNN structures to distinguish the target from other distractors in the scene. Wang et al. [24] devised a structure composed of two distinct parts: a shared part shared by all the training videos and a multi-domain part that classified different videos in the training set. The first part extracted the common features to be used for tracking. Fu et al. [25] designed a CNN-based discriminant filter to obtain local characteristics.

Hand Gesture Recognition (Detection and Classification)
Looking at the previous work, we can see that various studies are dealing differently with the classification and location issues related to gestures. Regarding the location of the hand, also known as the hand detection, the authors of [26][27][28] extract the hand from the body using depth indices and by setting a threshold, estimated at a specific moment. The authors of [29,30] used skin color maps, and the authors of [31,32] achieved better segmentation results using both depth thresholding and skin detection (using color). In terms of classification, several CNN-based approaches relied on hand-crafted features [33,34], which can capture information about the silhouette, shape, and structure.
In [35], the authors presented a 3D dynamic system that helps to recognize gestures using hand pose information. More precisely, the authors used the natural structure of the topology of the hand-called the skeletal data of the hand-to extract kinematic descriptors from the actual hand of the sequence of gestures. Using a Fisher kernel and a multilevel temporal pyramid, respectively, the descriptors were encoded in a temporal and statistical representation. Considering a feature vector calculated over the entire pre-segmented gesture, an improvement in the recognition can be achieved by associating a linear SVM classifier directly at the end (see Figure 2). In terms of social factors, while browsing the literature, the need to acquire statistical data appeared. Indeed, many databases are available in the United States and offer real visibility on the increasing crime rate in general and in relation to sexual assault. Based on the data collection in [33], Figure 3 presents a forecast that reveals the areas with a high risk of experiencing a significant amount of sexual assault in the United States over the next three years.

Proposed Model
In this part, the alert system based on a hierarchical convolution neural network is described. Its architecture is composed of three main parts: the extractions of the related bounding boxes containing subject hands, the gesture detection/classification stage, and the counterGesture. After the presentation of the architecture, we will describe each component and move to the experiment section.
For the architecture, we envision an environment where many people walk on the streets. The system below takes as its input the video stream of the scene and processes it to identify if a subject is facing sexual assault. To do this, the video stream will be segmented into several frames that will be processed continuously. Based on Figure 4, we can distinguish the extraction of the regions of interest, which are the hands of everyone present in the scene. It is essential to specify that each hand is associated with its owner so that we know which subject triggers the alert.

Human Detector and Regions of Interest (ROIs)
Researchers are addressing the problem of multiple object tracking (MOT) with neural networks. They are doing this primarily by building robust models that capture information about movement, appearance, and interactions between objects. Considering the issue of MOT, we adopt a conventional methodology to follow different hypotheses with Kalman recursive filtering and image-by-image association. To further illustrate our idea, consider the following situation: when an object is obstructed for a more extended period, the following Kalman filter predictions increase the uncertainty associated with the position of the object (see Figure 4). We use a standard Kalman filter with a constant velocity motion and a linear observation model, where we take the delimitation coordinates (u, v, and h) as the direct observation of the state of the object.
For each track k (bounding boxes associated to the same identifier ID), we count the number of frames since the last successful measurement association a k . This counter is incremented when previewing the Kalman filter and reset to 0 when the track has been associated with a measure. Besides, Algorithm 1 describes how bounding boxes associated to IDs are processed in the human detector module. Furthermore, tracks that exceed a predefined maximum age A max are considered to have left the scene and are removed from the set of tracks. New track assumptions are initiated for each detection that cannot be associated with an existing track. These new tracks are classified as tentative during their first three frames. In the meantime, we expect a successful metric association at each time step. Tracks that are not successfully associated with the measure in their first three frames are deleted.
As a result, the mass probability propagates in the state space, and the probability of observation decreases. Intuitively, the association metric should take into account this dispersion of the probability mass by increasing the measurement distance of the track. Counter-intuitively, when two tracks compete for the same detection, the Mahalanobis calculation should be used: where we note the projection of the i th track distribution in the measuring space by (y i ; S i ) and the j-th terminal box detection by d j .
Distance promotes more considerable uncertainty because it effectively reduces the standard deviation distance of any detection relative to the projected runway average. This behavior is undesirable because it can lead to increased fragmentation of unstable tracks and tracks in general. Therefore, we used a pairing cascade that gives priority to the most frequently seen objects in order to encode our notion of distributed probability in the association probability.
The human detector presented in this article is based on the "You Only Look Once (YOLO)" [36][37][38][39] method, which discretizes the output space of selected images into a set of default images of different formats and scales per map location. At the time of the prediction, the network generates scores for the presence of each object category in each default zone and produces adjustments to the area to better match the shape of the object. Also, the network combines the predictions of several feature maps with different resolutions to handle objects of various sizes naturally. The following human detector algorithm can be given: Cost Our pre-trained human detector YOLOv3 [36,40,41] (Table 1) is configured to obtain for almost everyone (human) in the image (I i ) a bounding box that surrounds him. Considering a video V subdivided into n frames as each I i frame might contain multiple subjects present in I i+1 frame. It will be interesting to keep track of all these subjects, so we use a so-called object tracking, which uses multiple detections to identify a specific object over time. To address this requirement, the model uses an easy and fast algorithm called SORT (Simple Online and Real-time Tracking) [42], which obtains references for the objects in the image. Therefore, instead of the regular detections, which include the coordinates of the bounding box and a class prediction, we obtain tracked subjects. These also include an object ID, which is associated with the ROI, so that for each frame, we know which hand belongs to which subject in the frame. Each bounding box area A F i of the hands is further extracted as an individual frame F i , so that which will be concatenated (time-wise, t) to those of the same subject from the primary image (I).
The simulation of one video recording for each submitted video will be combined with another individual video as input V c for the next module, as:

Detector and Classifier
Over the past few years, CNN-based models have shown impressive results when they are performing gesture and action recognition tasks. CNN 3D architectures are distinguished mainly by video analysis because they use temporal relationships between images. A new frame is described in the following section, whose goal is the detection and recognition of a specific hand gesture that will trigger the alarm.
Detector: Since we have no limitations regarding the size of the model, another architecture, with excellent classification performance, can be selected by a classifier. This leads us to use two recent 3D CNN architectures [43], the previously described YOLOv3 as the detector and a custom 3D CNN with a novelty, the introduction of "counterGesture", which is activated once a gesture is classified as belonging to our specific set. Given the context of this paper, the detector can be described as a tool responsible for processing sequential frames (video) and activating the classifier if there is a potential gesture in the video. It is worth mentioning that the human detector module mentioned above will have as its output a collection of hand images describing a gesture and associated to each subject. An algorithm (Algorithm 2) is given below. For j = 1 to m, do 2 For each "frame window" G j , do 3 Process a batch of hand images 4 classifierIsActivated ← True If (max 1 − max 2 ) ≥ t early , then 9 isEarlyDetect ← "True" 10 Return gesture (max 1 ) 11 i ← i + 1 12 j ← j + 1 Classifier: Depending on the proposed model, any classifier with good accuracy can be used. As shown in Table 2, we have listed the parameters of the classifier used in this article, which is a 3D convolutional neural network. Besides, we designed the model so that the number of parameters, P(3D), is greater than the number of parameters, P(2D), of a conventional 2D convolutional neural network. It should be mentioned that 3D CNNs require more training data to avoid overfitting. For this reason, we first trained our classifier on a well-known dataset, "Jester" [24], which is the largest hand gesture dataset (public dataset); then, the model was fine-tuned on nvGesture datasets with direct consequences, namely, accuracy and training time. (512, 25) CounterGesture: the system comes with a novel module, which describes the counterGesture as a module responsible for counting the occurrences of gestures similar to the predefined set of gestures ( Figure 5). As mentioned in the description of Figure 6, the counterGesture comes with two operators in its architecture: the first one sets the flag to 1, and the listener 1 starts the timer; the second operator is responsible for checking the time spent already and the value contained in the incrementor, and for triggering the alert if the condition below is respected. getTime < n and Counter == 3 (5) where n represents a predefined maximum duration allowed to collect the three gestures. Figure 5. A video is submitted to a module called the human detector, which extracts images containing a human. These frames are processed using scrolling sliding windows in which the detection queue is placed at the very beginning of the classifier assignment queue. If the detector recognizes an action/gesture, the classifier is activated via the Post Processing Service (PPS) [44] and, if it corresponds to one of the gestures contained in our specific set, the alert is triggered.

Experiments
The overall system can be divided into three parts: first, the extraction of the ROI as an output of the human detection module; second, the image classification, which categorizes all of the hand frames into one possible gesture; and third, a match that will trigger the alert. During the training part, we trained these three components separately. Furthermore, the counterGesture counts only the number of correspondences between the predicted gesture and an element of the predefined set of gestures. The latter is shown in Figure 6.
The EgoGesture dataset is a new multimodal dataset for the egocentric recognition of hand gestures [45], and it was created not only for the detection of gestures in continuous data but also for the classification of segmented gestures. This dataset contains eighty-three classes of dynamic and static gestures collected from six outdoor and indoor scenes. We organized the training set, validation and test set by separating topics with a 3:1:1 ratio, which gave respectively: 1239 elements, 411 validation videos, and 431 test videos, with 14,416, 4768, and 4977 gesture samples. All models were first pre-trained on the Jester dataset [45]. For test set evaluations, we used both the training and the validation games for the training.
To perform our experiments, we used a specific device as described in the table (Table 3). We were able to extract the ROI (Figures 7 and 8) and classify the gesture (Figure 7). In addition, Figure 9 shows the accuracy of our classifier after 350 epochs.

Results and Discussion
During the experiment, we firstly studied the performance of various versions of the neural network VGG-16 and a custom 3D CNN architecture on the classification task. In addition to this, we paid attention to the performance result of the number of input frames submitted to our gesture classification. Figure 9 provides an overview of how well our classifier performs, and the results in Table 4 show that we achieve a better performance by increasing the size of entries for all modalities. This depends strongly on the characteristics of the datasets used, especially the average duration of the gestures. Table 4. Improvement of the custom 3D CNN by increasing the input size and comparison to state-of-the-art techniques applied on the EgoGesture dataset [45]. Considering the video used in our experiment, let us have a close look at the portion of the video where the sexual assault is happening. Table 5 shows how Algorithm 1 (human detector) is applied on the video to detect a human, and it also gives an idea of when Algorithm 2 (hand gesture) detects the gesture. The hand gesture will activate the CounterGesture every time a specific gesture is identified. However, it is worth mentioning that in the case of a total occlusion of the hands of the victim, it is not possible for the system to detect the gesture and hence trigger the alert. As future work, we will explore the possibility of adding other factors to the decision making. Also, as shown in Table 5, the alarm is triggered after the counterGesture = 3. Table 5.

Model
Step-by-step application of our algorithms on a sexual assault video. When processing images from video surveillance, material quality (poor quality of image data, low light level, blur, pixelation of small objects) can be an obstacle; to address it, we propose enhancing the images (Figure 10) before submitting them to the human detector module ( Figure 5). Besides, the RGB-D (Depth sensor) frames are examined for different input sizes. Over the course of the experiment, the depth modality has proven to be essential to increase the performance, rather than a simple RGB input. Indeed, the depth component allowed the filtering of the movement from the background and helped to focus more on the movement of the hand, resulting in the discrimination of the features with the depth modality.

Conclusions
Considering sexual assault, this paper proposes a solution via the use of a new hierarchical architecture with three models for hand gesture alert systems. The proposed architecture enables efficient resource utilization and early detection for essential hand gesture alert applications. We obtained approximate results for both datasets when we evaluated our proposed model.
We defined a set of hands gestures that were identified by our classifier, and we introduced a module called "counterGesture". The latter allowed us to count the number of occurrences of a predefined gesture and trigger the alert. Besides, we found that the training time was far too long on the Jest dataset at a learning rate varying between 0.0001 and 0.001. We anticipate that in our future work, we will associate the facial expression with the alert decision in order to investigate ResNext [48][49][50][51] or a faster impact of the CNN-Region mode, at the same time as the detector and classifier [52,53]. Further study should consider the combination of lighter Deep neural networks (DNNs) to maintain accuracy and improve speed.