A Deep Learning Framework for Recognizing Both Static and Dynamic Gestures

Intuitive user interfaces are indispensable to interact with the human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-robot interaction in social or industrial settings. We employ a pose-driven spatial attention strategy, which guides our proposed Static and Dynamic gestures Network—StaDNet. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Network (CNN) in StaDNet is fine-tuned on a background-substituted hand gestures dataset. It is utilized to detect 10 static gestures for each hand as well as to obtain the hand image-embeddings. These are subsequently fused with the augmented pose vector and then passed to the stacked Long Short-Term Memory blocks. Thus, human-centred frame-wise information from the augmented pose vector and from the left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments, we show that the proposed approach surpasses the state-of-the-art results on the large-scale Chalearn 2016 dataset. Moreover, we transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.


Introduction
The modern manufacturing industry requires human-centered smart frameworks, which aim to focus on human abilities and not conversely demand humans to adjust to whatever technology. In this context, gesture-driven user-interfaces tend to exploit human's prior knowledge and are vital for intuitive interaction of humans with smart devices [1]. Gesture recognition is a problem that has been widely studied for developing human-computer/machine interfaces with an input device alternative to the traditional ones (e.g., mouse, keyboard, teach pendants and touch interfaces). Its applications include robot control [2][3][4], health monitoring systems [5], interactive games [6] and sign language recognition [7].
The aim of our work is to develop a robust, vision-based gestures recognition strategy suitable for human-robot/computer interaction tasks in social or industrial settings. Industrial applications where human safety is critical, often require specialized sensors compatible with safety standards such as ISO/TS 15066. Yet, scenarios which require human sensing in industry or in social settings are broad. Monocular cameras offer benefits which specialized or multi-modal sensors do not have, such as being lightweight, inexpensive, platform independent and easy to integrate. This is desirable for robotic assistants in commercial businesses such as restaurants, hotels, or clinics. We therefore, pro-

Related Work
The gesture detection techniques can be mainly divided into two categories: wearable strategies and non-wearable methods. The wearable strategies include electronic/glovebased systems [13,14], and markers-based vision [15] methods. However, these are often expensive, counter-intuitive and limit the operator's dexterity in his/her routine tasks. Conversely, non-wearable strategies such as pure-vision based methods, do not require structuring the environment and/or the operator, while they offer ease-of-use to interact with the robots/machines. Moreover, the consumer-based vision sensors have rich output, are portable and low cost, even when depth is also measured by the sensor such as Microsoft Kinect or Intel Realsense cameras. Therefore, in this research we opt for a pure vision-based method and review only the works with vision-based gestures detection.
Traditional activity recognition approaches aggregate local spatio-temporal information via hand-crafted features. These visual representations include the Harris3D detector [16], the Cuboid detector [17], dense sampling of video blocks [18], dense trajectories [19] and improved trajectories [20]. Visual representations obtained through optical flow, for example, Histograms of Oriented Gradients (HOG), Histograms of Optical Flow (HOF) and Motion Boundary Histograms (MBH) also achieved excellent results for video classification on a variety of datasets [18,21]. In these approaches, global descriptors of the videos are obtained by encoding the hand-crafted features using Bag of Words (BoW) and Fischer vector encodings [22]. Subsequently, the descriptors are assigned to one or several nearest elements in a vocabulary [23] while the classification is typically performed through Support Vector Machines (SVMs). In [24], the authors segmented the human silhouettes from the depth videos using Otsu's method of global image threshold [25]. They extracted a single Extended-Motion History Image (Extended-MHI) as a global representation for each gesture. Subsequently, maximum correlations coefficient was utilized to recognize gestures in a One-Shot learning setting. Other works that utilized One-Shot Learning for gesture recognition include [26][27][28]. Lately, the tremendous success of deep neural networks on image classification tasks [29,30] instigated its application in activity recognition domain. The literature on the approaches that exploit deep neural networks for gestures/activity recognition is already enormous. Here, we focus on related notables which have inspired our proposed strategy.

3D Convolutional Neural Networks
Among the pioneer works in this category, [31] adapted Convolutional Neural Networks (CNNs) to 3D volumes (3D-CNNs), obtained by stacking video frames, to learn spatio-temporal features for action recognition. In [32], Baccouche et al. proposed an approach for learning the evolution of temporal information through a combination of 3D-CNNs and Long Short term Memory (LSTM) recurrent neural networks [33]. The short video clips of approximately 9 successive frames were first passed through a 3D-CNN features extractor while the extracted features were subsequently fed to the LSTM network. However, Karpathy et al. in [34] found that the stacked-frames architecture performed similar to the one with single-image input.
A Few-Shot temporal activity detection strategy is proposed in [35], which utilized 3D-CNNs for features extraction from the untrimmed input video as well as from the few-shot examples. A two-stage proposal network was applied on top of the extracted features while the refined proposals were compared using cosine similarity functions.
To handle resulting high-dimensional video representations, the authors of [36] proposed the use of random projection-based ensemble learning in deep networks for video classification. They also proposed rectified linear encoding (RLE) method to deal with redundancy in the initial results of the classifiers. The output from RLE is then fused by a fully-connected layer that produced the final classification results.

Multi-Modal Multi-Scale Strategies
The authors of [7] presented a multi-modal multi-scale detection strategy for dynamic poses of varying temporal scales as an extension to their previous work [37]. They utilized the RGB and depth modalities, as well as the articulated pose information obtained through the depth map. The authors proposed a complex learning method which included pretraining of individual classifiers on separate channels and iterative fusion of all modalities on shared hidden and output layers. This approach involved recognizing 20 categories from Italian conversational gestures, performed by different people and recorded with an RGB-D sensor. This strategy was similar in function to [34] except that it included depth images and pose as additional modalities. However, it lacked a dedicated equipment to learn evolution of temporal information and may fail when understanding long-term dependencies of the gestures is required.
In [38], authors proposed a multi-modal large-scale gesture recognition scheme on the Chalearn 2016 Looking at People Isolated Gestures recognition dataset [39]. In [40], ResC3D network was exploited for feature extraction, and late fusion combined features from multi-modal inputs in terms of canonical correlation analysis. The authors used linear Support Vector Machine (SVM) to classify final gestures. They proposed a key frame attention mechanism, which relied on movement intensity in the form of optical flow, as an indicator for frame selection.

Multi-Stream Optical Flow-Based Methods
The authors of [41] proposed an optical flow-based method exploiting convolutional network networks for activity recognition along the same lines of [34]. They presented the idea of decoupling spatial and temporal networks. The proposed architecture in [41] is related to the two-stream hypothesis of the human visual cortex [42]. The spatial stream in their work operated on individual video frames, while the input to the temporal stream was formed by stacking optical flow displacement fields between multiple consecutive frames.
The authors of [43] presented improved results in action recognition, by employing a trajectory-pooled two-stream CNN inspired by [41]. They exploited the concept of improved trajectories as low level trajectory extractor. This allowed characterization of the background motion in two consecutive frames through the estimation of the homography matrix taking camera motion into account. Optical flow-based methods (e.g., the key frame attention mechanism proposed in [38]) may help emphasizing frames with motion, but are unable to differentiate motion caused by irrelevant objects in the background.

CNN-LSTM and Convolutional-LSTM Networks
The work in [44] proposed aggregation of frame-level CNN activations through (1) Feature-pooling method and (2) LSTM network for longer sequences. The authors argued that the predictions on individual frames of video sequences or on shorter clips as performed in [34], might only contain local information of the video description, while it could also confuse classes if there are fine-grained distinctions.
The authors of [45] proposed a Long-term Recurrent Convolutional Network (LRCN) for multiple situations including sequential input and static output for cases like activity recognition. The visual features from RGB images were extracted through a deep CNN, which were then fed into stacked LSTM in distinctive configurations corresponding to the task at hand. The parameters were learned in an "end-to-end" fashion, such that the visual features relevant to the sequential classification problem were extracted.
The authors in [46] proposed a method to process sequential images through Convolutional-LSTM (ConvLSTM), which is a variant of LSTM containing a convolution operation inside the LSTM cell. In [47], the authors studied redundancy and attention in ConvLSTM by deriving its several variants for gesture recognition. They proposed Gated-ConvLSTM by removing spatial convolutional structures in the gates as they scarcely contributed to the spatio-temporal feature fusion in their study. The authors evaluated results on the Chalearn 2016 dataset and found that the Gated-ConvLSTM achieved reduc-tion in parameters size and in computational cost. However, it did not improve detection accuracy to a considerable amount.

Multi-Label Video Classification
The authors of [48] presented a multi-label action recognition scheme. It was based on Multi-LSTM network which tackled with multiple inputs and outputs. The authors finetuned VGG-16 pre-trained on ImageNet [49], on Multi-THUMOS dataset at the individual frame level. Multi-THUMOS is an extension of THUMOS dataset [50]. A fixed length window of 4096-dimensional "fc7" features of the fine-tuned VGG-16 was passed as input to the LSTM, through an attention mechanism, that weighted the contribution of individual frames in the window.

Attention-Based Strategies
The application of convolutional operations on entire input images tends to be computationally expensive. In [12], Rensink discussed the idea of visual representation, which implied that the humans do not form detailed depiction of all objects in a scene. Instead, their perception focuses selectively on the objects needed immediately. This was supported by the concept of visual attention applied for deep learning methods as in [51].
Baradel et al. [52] proposed a spatio-temporal attention mechanism conditioned on human pose. The proposed spatial-attention mechanism was inspired by the work of Mnih et al. [51] on glimpse sensors. A spatial attention distribution was learned conjointly through the hidden state of the LSTM network and through the learned pose feature representations. Later, Baradel et al. extend their work in [53] and proposed that the spatial attention distribution can be learned only through an augmented pose vector, which was defined by the concatenation of current pose, velocity and accelerations of each joint over time.
The authors of [54] proposed a three streams attention network for activity detection. These were statistic-based, learning-based and global-pooling attention streams. Shared ResNet was used to extract spatial features from image sequences. They also proposed a global attention regularization scheme to enable the employed recurrent networks to learn dynamics based on global information.
Lately, the authors of [55] presented the state-of-the-art results on the Chalearn 2016 dataset. They proposed a novel multi-channel architecture, namely FOANet, built upon a spatial focus of attention (FOA) concept. They cropped the regions of interest occupied by the hands in the RGB and depth images, through the region proposal network and Faster R-CNN method. The architecture comprised of 12 channels in total with: 1 global (full-sized image) channel and 2 focused (left and right hand crops) channels for each of the 4 modalities (RGB, depth and optical flow fields extracted from the RGB and depth images). The softmax scores of each modality were fused through a sparse fusion network.

Datasets
For dynamic gestures classification, we use the Chalearn 2016 Isolated Gestures dataset [39], referred to simply as Chalearn 2016 in the rest of the paper. It is a largescale dataset which contains Kinect V1 color and depth recordings in 320 × 240 resolution of 249 dynamic gestures recorded with the help of 21 volunteers. The gestures vocabulary in the Chalearn 2016 is mainly from nine groups corresponding to the different application domains: body language gestures, gesticulations, illustrators, emblems, sign language, semaphores, pantomimes, activities and dance postures. The dataset has 47,930 videos with each video (color + depth) representing one gesture. It has to be noted that the Chalearn 2016 does not take into account any specific industry requirements, and that Kinect V1 is obsolete. However, as we intend to target a broader human-robot interaction domain which includes the fast-growing field of socially assistive as well as household robotics, this requires robots to have the capacity to capture, process and understand human requests in a robust, natural and fluent manner. Considering the fact that the Chalearn 2016 dataset offers a challenging set of gestures taken from a comprehensive gestures vocabulary with inter-class similarities and intra-class differences, we assumed the Chalearn 2016 suitable for training and benchmarking results of our strategy.
To demonstrate the utility of our approach on a different gesture dataset, we evaluate the performance of our model on the Praxis gesture dataset [56] as well. This dataset is designed to diagnose apraxia in humans, which is a motor disorder caused by brain damage. This dataset contains RGB (960 × 540 resolution) and depth (512 × 424 resolution) images recorded by 60 subjects plus 4 clinicians with Kinect V2. In total, 29 gestures were performed by the volunteers (15 static and 14 dynamic gestures). In our work, only dynamic gestures that is, 14 classes are considered while their pathological aspect is not taken into account that is, only gestures labeled "correct" are selected. Thus, the total number of considered videos in this dataset is 1247 with mean length of all samples equal to 54 frames. StaDNet is trained exclusively on color images of these datasets for dynamic gestures detection.

Our Strategy
In this work, we develop a novel unified strategy to model human-centered spatiotemporal dependencies for the recognition of static as well as dynamic gestures. Our Spatial Attention Module localizes and crops hand images of the person, which are subsequently passed as inputs to StaDNet unlike previous methods that take entire images as input for example, [44,45]. Contrary to [48], where a pre-trained state-of-the-art network is fine-tuned on entire image frames of gestures datasets, we fine-tune Inception V3 on a background-substituted hand gestures dataset, used as our CNN block. Thus, our CNN has learned to concentrate on image pixels occupied exclusively by hands. This enables it to accurately distinguish subtle hand movements. We have fine-tuned Inception V3 with a softmax layer, to classify 10 ASL static hand gestures while the features from the last fully connected (FC) layer of the network are extracted as image-embeddings of size 1024 elements. These are used as input to the dynamic gestures detector in conjunction with the augmented pose vector which we explain in Sections 5.1.2 and 6.1. Moreover, in contrast to the previous strategies for dynamic gestures recognition/video analysis [7,52,53], which employed 3D human skeletons to learn large-scale body motion-and corresponding sensor modalities-we only utilize 2D upper-body skeleton as an additional modality to our algorithm. However, scale information about the subjects is lost in monocular images. To address this, we also propose learning-based depth estimators, which determine the approximate depth of the person from the camera and region-of-interest around his/her hands from upper-body 2D skeleton coordinates only. In a nutshell, StaDNet only exploits the RGB hand images and an augmented pose vector obtained from 8 upper-body 2D skeleton coordinates, unlike other existing approaches like [55], which include full-frame images in addition to hand images, depth frames and even optical flow frames altogether.
To reiterate, our method does not require depth sensing. We only utilized the (raw) depth map from Kinect V2 offline, to obtain ground truth depth values of a given 2D skeleton for our learning-based depth estimators. These values can be obtained from any stateof-the-art depth sensor. Once the depth estimators are trained, our method only requires RGB modality to process images and detect gestures on-line. We employ OpenPose [57] which is an efficient discriminative 2D pose extractor, to extract the human skeleton and human hands' keypoints in images. OpenPose also works exclusively on the RGB images. Thus, our method can be deployed on a system with any RGB camera, be it a webcam or an industrial color (or RGB-D) camera. Nevertheless, we only tested OpenPose in laboratory or indoor domestic environments and not in a real industry. Yet, since our framework is not restricted to the use of OpenPose, we could integrate another pose extractor system, better suited for the target application scenario.

Spatial Attention Module
Our spatial attention module is divided into two parts-Pose Pre-processing Module and Focus on Hands Module (see Figure 1). We detail these modules in the following.

Pose Pre-Processing Module
We first resize the dataset videos to 1080 × C pixels, where C is the value of resized image columns obtained with respect to new row value, that is, 1080, while maintaining the aspect ratio of the original image (1440 in our work). The necessity to resize the input videos will be explained in Section 5.1.3. After having resized the videos, we feed them to OpenPose, one at a time, and the output skeleton joint and hand keypoint coordinates are saved for offline pre-processing. The pose pre-processing is composed of three parts, detailed hereby: skeleton filter, skeleton position and scale normalization and skeleton depth estimation.

Skeleton Filter
For each image, OpenPose extracts N skeleton joint coordinates depending on the selected body model while it does not employ pose tracking between images. The occasional jitter in the skeleton output and missing joint coordinates between successive frames may hinder gesture learning. Thus, we develop a two-step pose filter that rectifies occasional disappearance of the joint(s) coordinates and smooths the OpenPose output. The filter operates on a window of K consecutive images (K is an adjustable odd number, 7 in this work), while the filtered skeleton is obtained in the center frame. We note p i k = (x i , y i ), the image coordinates of the ith joint in the skeleton output by OpenPose at the k-th image within the window. If OpenPose does not detect joint i on image k: p i k = ∅. In a first step, we replace coordinates of the missing joints. Onlyr (we user = 7) consecutive replacements are allowed for each joint i, and we monitor this via a coordinate replacement counter, noted r i . The procedure is driven by the following two equations: Equation (1) states that the i-th joint at the latest (current) image K is replaced by the same joint at the previous image K − 1 under three conditions: if it is not detected, if it has been detected in all previous images, and if in the past it has not been replaced up tor consecutive times already. If any of the conditions is false, we do not replace the coordinates and we reset the replacement counter for the considered joint: r i = 0. Similarly, (2) states that the i-th joint coordinates over the window should not be taken into account that is, joint will be considered missing, if it is not detected in the current image K and if it has already been replaced more thanr consecutive times (we allow onlyr consecutive replacements driven by (1)). This also resets the replacement counter value for the considered joint. Moreover, the i-th joint in all of the window's K − 1 images is set to its position in the current image K, if it has never been detected in the window up to the current image.
In the second step, we apply Gaussian smoothing to each p i , over the window of K images. Applying this filter removes jitter from the skeleton pose and smooths out the joint movements in the image at the center of the filter window. Figure 2 shows the output of our skeleton filter for one window of images. Observe that-thanks to Equation (1)-our filter has added the right wrist coordinates (shown only in the central image). These are obtained from the K-th frame, while they were missing in all raw skeletons from frame 1 to K − 1. Figure 1 includes a simple illustration of our goal for skeleton position and scale normalization. We focus on the 8 upper-body joints shown in Figure 3: p 0,...,7 , with p 0 corresponding to the Neck joint, which we consider as root node. Position normalization consists in eliminating the influence of the user's position in the image, by subtracting the Neck joint coordinates from those of the other joints. Scale normalization consists in eliminating the influence of the user's depth. We do this by dividing the position-shifted joint coordinates by the neck depth d n , on each image, so the all joints are replaced according to:

Skeleton Position and Scale Normalization
Since our framework must work without requiring a depth sensor, we have developed a skeleton depth estimator to derive the neck depth, d n and use it instead of d n in (3). This estimator is a neural network, which maps a 97-dimensional pose vector, derived from the 8 upper body joint positions, to the depth of the Neck joint. We will explain it hereby. In the left image, we show 8 upper-body joint coordinates (red), vectors connecting these joints (black) and angles between these vectors (green). From all upper-body joints, we compute a line of best fit (blue). In the right image, we show all the vectors (purple) between unique pairs of upper-body joints. We also compute the angles (not shown) between the vectors and the line of best fit. From 8 upper-body joints, we obtain 97 components of the augmented pose vector.

Skeleton Depth Estimation
Inspired by [7], which demonstrated that augmenting pose coordinates may improve performance of gesture classifiers, we develop a 97 dimensional augmented pose vector x n (subscript n means Neck here) from 8 upper-body joint coordinates. From the joints coordinates, we obtain-via least squares-a line of best fit. In addition to 7 vectors from anatomically connected joints, 21 vectors between unique pairs of all upper-body coordinates are also obtained. The lengths of individual augmented vectors are also included in x n . We also include the 6 angles formed by all triplets of anatomically connected joints, and the 28 angles, between the 28 (anatomically connected plus augmented) vectors and the line of best fit. The resultant 97-dimensional augmented pose vector concatenates: 42 elements from abscissas and ordinates of the augmented vectors, their 21 estimated lengths and 34 relevant angles.
To obtain the ground-truth depth of Neck joint, denoted d n , we utilize OpenSign dataset. OpenSign is recorded with Kinect V2 which outputs the RGB and the registered depth images with resolution 1080 × 1920. We apply our augmented pose extractor to all images in the dataset and-for each image-we associate x n to the corresponding Neck depth. A 9 layers neural network f n is then designed, to optimize parameters θ n , given augmented pose vector x n and ground-truth d n to regress the approximate distance value d n with a mean squared error of 8.34 × 10 −4 . Formally: It is to be noted that the estimated depth d n is a relative value and not in metric units, and that the resolution of ground truth images in OpenSign is 1080 × 1920. For scale normalization (as explained in Section 5.1.2), we utilize the estimated depth d n . Thus, the input images from the Chalearn 2016 dataset are resized such that the row count of the images is maintained to 1080. This is required as we need to re-scale the predicted depth to the original representation of the depth map in OpenSign (or to that of Kinect V2). Yet, the StaDNet input image size can be adapted to the user's needs if the depth estimators are not employed.

Focus on Hands Module
This module focuses on hands in two steps: first, by localizing them in the scene, and then by determining the size of their bounding boxes, in order to crop hand images.

Hand Localization
One way to localize hands in an image is to exploit Kinect SDK or middleware like OpenNI (or its derivatives). These libraries however do not provide accurate hand-sensing and are deprecated as well. Another way of localizing hands in an image is via detectors, possibly trained on hand images as in [58]. Yet, such strategies struggle to distinguish left and right hands, since they operate locally, thus lacking contextual information. To keep the framework generic, we decided not to employ specific hand sensing functionalities from Kinect-be it V1 or V2-or other more modern sensing devices. Instead, we localize the hand via the hand key-points obtained from OpenPose. This works well for any RGB camera and therefore does not require a specific platform (e.g., Kinect) for hand sensing.
OpenPose outputs 42 (21 per hand) hand key-points on each image. We observed that these key-points are more susceptible to jitter and misdetections than the skeleton key-points, particularly on the low resolution videos of the Chalearn 2016 dataset. Therefore, we apply the same filter of Equations (1) and (2) to the raw hand key-points output by OpenPose. Then, we estimate the mean of all N j detected hand key-point coordinates p j , to obtain: the hand center in the image.

Hand Bounding-Box Estimation
Once the hands are located in the image, the surrounding image patches must be cropped for gesture recognition. Since at run-time our gestures recognition system relies only on the RGB images (without depth), we develop two additional neural networks, f l and f r , to estimate each hand's bounding box size. These networks are analogous to the one described in Section 5.1.2. Following the scale-normalization approach, for each hand we build a 54 dimensional augmented pose vector from 6 key-points. These augmented pose vectors (x l and x r ) are mapped to the ground-truth hands depth values (d l and d r ) obtained from OpenSign dataset, through two independent neural networks: In (6) and (7), f l and f r are 9-layer neural networks that optimize parameters θ l and θ r given augmented poses x l and x r and ground-truth depths d l and d r , to estimate depths d l and d r . Mean squared error for f l and f r are 4.50 × 10 −4 and 6.83 × 10 −4 , respectively. The size of the each bounding box is inversely proportional to the corresponding depth ( d l or d r ) obtained by applying (6) to the pure RGB images. The orientation of each bounding box is estimated from the inclination between corresponding forearm and horizon. The final outputs are the cropped images of the hands, i l and i r . Now since our depth estimators f n , f l and f r have been trained, we do not require explicit depth sensing either to normalize the skeleton or to estimate the hand bounding boxes.

Video Data Processing
Our proposed spatial attention module conceptually allows end-to-end training of the gestures. However, we train our network in multiple stages to speed-up the training process (the details of which are given in Section 8). Yet, this requires the videos to be processed step-by-step beforehand. This is done in four steps, that is, (1) 2D pose-estimation, (2) features extraction, (3) label-wise sorting and zero-padding and (4) train-ready data formulation. While prior 2D-pose estimation may be considered a compulsory step-even if the network is trained in an end-to-end fashion-the other steps can be integrated into the training algorithm.

Dynamic Features: Joints Velocities and Accelerations
As described in Section 5, our features of interest for gestures recognition are skeleton and hand images. The concept of augmented pose for scale-normalization has been detailed in Section 5.1.2. For dynamic gestures recognition, velocity and acceleration vectors from 8 upper-body joints, containing information about the dynamics of motion, are also appended to the pose vector x n to form a new 129 components augmented pose x dyn . Inspired by [7], joint velocities and accelerations are computed as first and second derivatives of the scale-normalized joint coordinates. At each image k: The velocities and accelerations obtained from (8) and (9) are scaled by the video frame-rate to make values time-consistent, before appending them in the augmented pose vector x dyn . For every frame output by the skeleton filter of Section 5.1.1, scalenormalized augmented pose vectors x dyn (as explained in Section 5.1.2) plus left i l and right i r hands cropped images (extracted as explained in Section 5.2) are appended in three individual arrays.

Train-Ready Data Formulation
The videos in the Chalearn 2016 are randomly distributed. Once the features of interest (i l , i r and x dyn ) are extracted and saved in .h5 files, we sort them with respect to their labels. It is natural to expect the dataset videos (previously sequences of images, now arrays of features) to be of different lengths. The average video length in this dataset is 32 frames, while we fix the length of each sequence to 40 images in our work. If the length of a sequence is less than 40, we pad zeros symmetrically at the start and end of the sequence. Alternatively, if the length is greater than 40, we perform symmetric trimming of the sequence. Once the lengths of sequences are rectified (padded or trimmed), we append all corresponding sequences of a gesture label into a single array. At the end of this procedure, we are left with the 249 gestures in the Chalearn 2016 dataset, along with an array of the ground-truth labels. Each feature of the combined augmented pose vectors is normalized to zero mean and unit variance, while for hand images we perform pixel-wise division by the maximum intensity value (e.g., 255). The label-wise sorting presented in this section is only necessary if one wants to train a network on selected gestures (as we will explain in Section 8). Otherwise, creating only a ground-truth label array should suffice.

Dynamic Gesture Recognition
To classify dynamic gestures, StaDNet learns to model the spatio-temporal dependencies of the input video sequences. As already explained in Sections 5.2 and 6.1, we obtain cropped hand images i l and i r as well as the augmented pose vector x dyn for each frame in a video sequence. These features are aggregated in time through Long-Short Term Memory networks to detect dynamic gestures performed in the videos. However, we do not pass raw hand images, but extract image embeddings of size 1024 elements per hand. These image embeddings are extracted from the last fully connected layer of our static hand gesture detector and can be considered as rich latent space representations of hand gestures. This is done according to: with: • g sta the static hand gesture detector, which returns the frame-wise hand gesture class probabilities p l,r and the embeddings vectors e l,r from its last fully connected layers; • θ st the learned parameters of g sta . For each frame of a video sequence of length N, the obtained hand image embeddings e l , e r and augmented pose vector x dyn are subsequently fused in vector ψ, and then passed to stacked LSTMs followed by g dyn network. This network outputs dynamic gestures probability p dyn for each video: ψ = [e l ; x dyn ; e r ] p dyn = g dyn (LSTMs( The g dyn network consists of a fully connected layer and a softmax layer which takes the output of LSTMs as input; θ LSTMs and θ dyn are model parameters to be learned for the detection of dynamic gestures, while p dyn is the detected class probability obtained as output from the softmax layer. The illustration of our network is presented in Figure 4. We employ dropout regularization method between successive layers to prevent over-fitting and improve generalization, and batch-normalization to accelerate training.

Training
The proposed network is trained on a computer with Intel © Core i7-6800K (3.4 GHz) CPU, dual Nvidia GeForce GTX 1080 GPUs, 64 GB system memory and Ubuntu 16.04 Operating system. The neural network is designed, trained and evaluated in Python-Keras with tensorflow back-end, while skeleton extraction with OpenPose is performed in C++.
The Chalearn 2016 dataset has 35,875 videos in the provided training set, with only the top 47 gestures (arranged in descending order of the number of samples) representing 34% of all videos. The numbers of videos in the provided validation and test sets are 5784 and 6271, respectively. The distribution of train, validation and test data in our work is slightly different from the approach proposed in the challenge. We combine and shuffle the provided train, validation and test sets together, leading to 47,930 total videos. For weight initialization, 12,210 training videos of 47 gestures are utilized to perform pretraining with a validation split of 0.2. We subsequently proceed to train our network for all 249 gestures on 35,930 videos, initializing the parameters with the pre-trained model weights. In this work, we utilize the Holdout cross-validation method, which aligns with the original exercise of the Chalearn 2016 challenge. Thus, we optimize the hyper-parameters on the validation data of 6000 videos, while the results are presented on the test data of the remaining 6000 videos.
As already explained in Section 3, we utilize only 1247 videos for 14 correctly performed dynamic gestures from the Praxis Cognitive Assessment Dataset. Given the small size of this dataset, we adapt the network hyper-parameters to avoid over-fitting.

Results
For the Chalearn 2016 dataset, the proposed network is initially trained on 47 gestures with a low learning rate of 1 × 10 −5 . After approximately 66,000 epochs, a top-1 validation accuracy of 95.45% is obtained. The parameters learned for 47 gestures are employed to initialize weights for complete data training for 249 gestures as previously described. The network is trained in four phases. In the first phase, we perform weights initialization, inspired by the transfer learning concept of deep networks, by replacing the classification layer (with softmax activation function) by the same with output number of neurons corresponding to the number of class labels in the dataset. In our case, we replace the softmax layer in the trained network for 47 gestures plus the FC layer immediately preceding it. The proposed model is trained for 249 gestures classes with a learning rate of 1 × 10 −3 and a decay value of 1 × 10 −3 with Adam optimizer. The early iterations are performed with all layers of the network locked except the newly added FC and softmax layers. As the number of epochs increases, we successively unlock the network layers from the bottom (deep layers).
In the second phase, network layers until the last LSTM block are unlocked. All LSTM blocks and then the complete model are unlocked, respectively in the third and fourth phase. By approximately 2700 epochs, our network achieves 86.69% top-1 validation accuracy for all 249 gestures and 86.75% top-1 test accuracy, surpassing the state-of-art methods on this dataset. The prediction time for each video sample is 57.17 ms, excluding pre-processing of the video frames. Thus, we are confident that the online dynamic gesture recognition can be achieved in interaction time. The training curve of the complete model is shown in Figure 5 while the confusion matrix/heat-map with evaluations on test set is shown in Figure 6. Our results on the Chalearn 2016 dataset are compared with the reported state-of-the-art in Table 1.   Inspecting the training curves, we observe that the network is progressing towards slight over-fitting in the fourth phase when all network layers are unlocked. Specifically, the first time-distributed FC layer is considered the culprit for this phenomenon. Although we already have a dropout layer immediately after this layer, with dropout rate equaling 0.85, we skip to further dive deeper to rectify this. However, it is assumed that substitution of this layer with the strategy of pose-driven temporal attention [53] or with the adaptive hidden layer [61], may help reduce this undesirable phenomenon and ultimately further improve results. Moreover, recent studies argue that data augmentation that is, the technique of perturbing data without altering class labels, are able to greatly improve model robustness and generalization performance [62]. As we do not use any data augmentation on the videos in model training for dynamic gestures, doing the contrary might help to reduce over-fitting here.
For the Praxis dataset, the optimizer and values of learning rate and decay, are the same as for the Chalearn 2016 dataset. The hyper-parameters including number of neurons in FC layers plus hidden and cell states of LSTM blocks are (reduced) adapted to avoid over-fitting. Our model obtains 99.6% top-1 test accuracy on 501 samples. The training curve of the StaDNet on the Praxis dataset is shown in Figure 7, the normalized confusion matrix on this dataset is shown in Figure 8, while the comparison of the results with the state-of-the-art is shown in Table 2. We also quantify the performance of our static hand gesture detector on a test set of 4190 hand images. The overall top-1 test accuracy is found to be 98.9%. The normalized confusion matrix for 10 static hand gestures is shown in Figure 9.   We devised robotic experiments for gesture-controlled safe human-robot interaction tasks as already presented in [11]. These are preliminary experiments that allow the human operator to communicate with the robot through static hand gestures in real-time while dynamic gestures integration is yet to be done. The experiments were performed on BAZAR robot [63] which has two Kuka LWR 4+ arms with two Shadow Dexterous Hands attached at the end-effectors. We exploited OpenPHRI [64], which is an open-source library, to control the robot while corroborating safety of the human operator. A finite state machine is developed to control behavior of the robot which is determined by the sensory information for example, hand gestures, distance of the human operator from the robot, joint-torque sensing and so forth. The experiment is decomposed into two phases: (1) a teaching by demonstration phase, where the user manually guides the robot to a set of waypoints and (2) a replay phase, where the robot autonomously goes to every recorded waypoint to perform a given task, here force control. A video of the experiment is available online (http://youtu.be/lB5vXc8LMnk, accessed on 22 March 2021) and snapshots are given in Figure 10. Figure 10. Snapshots of our gesture-controlled safe human-robot interaction experiment taken from [11] with the authors' permission. The human operator manually guides the robot to waypoints in the workspace then asks the robot to record them through a gesture. The human operator can transmit other commands to the robot like replay, stop, resume, reteach, and so forth with only hand gestures.

Conclusions
In this paper, a unified framework for simultaneous recognition of static hands and dynamic upper-body gestures, StaDNet is proposed. A novel idea of learning-based depth estimator is also presented, which predicts the distance of the person and his/her hands, exploiting only the upper-body 2D skeleton coordinates. By virtue of this feature, monocular images are sufficient and the proposed framework does not require depth sensing. Thus, the use of StaDNet for gestures detection is not limited to any specialized camera and can work with most conventional RGB cameras. Monocular images are indeed sensitive to the changing lighting conditions and might fail to work in extreme conditions for example, during sand blasting operation in the industry or during fog and rain in the outdoors. To develop immunity against such lighting corruptions, data augmentation strategies such as [65] can be exploited. One might argue that employing HSV or HSL color models instead of RGB might be more appropriate to deal with changing ambient light conditions. However, StaDNet actually relies on OpenPose for skeleton extraction and on the hand gesture detector from our previous work [11]. OpenPose is the state-of-art in skeleton extraction from monocular camera and takes RGB images as input. Furthermore, our static hand gesture also takes RGB images as input and performs well with 98.9% top-1 test accuracy on 10 static hand gestures as we show in Figure 9. In spite of that, we are aware that HSV or HSL had been commonly used for hand segmentation in the literature by thresholding the values of Hue, Saturation and Value/Lightness. This indeed intrigues our eagerness to train and compare the performance of deep models for hand gesture detector in this color model/space, which we plan to do in our future work.
Our pose-driven hard spatial attention mechanism directs the focus of StaDNet on upper-body pose to model large-scale body movements of the limbs and, on the hand images for subtle hand/fingers movements. This enables StaDNet to out-score the existing approaches on the Chalearn 2016 dataset. The presented weight initialization strategy addresses the imbalance in class distribution in the Chalearn 2016 dataset, thus facilitates parameters optimization for all 249 gestures. Our static gestures detector outputs the predicted label frame-wise at approximately 21 fps with the state-of-the-art recognition accuracy. However, class recognition for dynamic gestures is performed on isolated gestures videos, executed by an individual in the scene. We plan to extend this work for continuous dynamic gestures recognition to demonstrate its utility in human-machine interaction. This can be achieved in one way by developing a binary motion detector to detect start and end instances of the gestures. Although a multi-stage training strategy is presented, we envision an end-to-end training approach for online learning of new gestures.