Dynamic Hand Gesture Recognition Based on a Leap Motion Controller and Two-Layer Bidirectional Recurrent Neural Network.

Dynamic hand gesture recognition is one of the most significant tools for human-computer interaction. In order to improve the accuracy of the dynamic hand gesture recognition, in this paper, a two-layer Bidirectional Recurrent Neural Network for the recognition of dynamic hand gestures from a Leap Motion Controller (LMC) is proposed. In addition, based on LMC, an efficient way to capture the dynamic hand gestures is identified. Dynamic hand gestures are represented by sets of feature vectors from the LMC. The proposed system has been tested on the American Sign Language (ASL) datasets with 360 samples and 480 samples, and the Handicraft-Gesture dataset, respectively. On the ASL dataset with 360 samples, the system achieves accuracies of 100% and 96.3% on the training and testing sets. On the ASL dataset with 480 samples, the system achieves accuracies of 100% and 95.2%. On the Handicraft-Gesture dataset, the system achieves accuracies of 100% and 96.7%. In addition, 5-fold, 10-fold, and Leave-One-Out cross-validation are performed on these datasets. The accuracies are 93.33%, 94.1%, and 98.33% (360 samples), 93.75%, 93.5%, and 98.13% (480 samples), and 88.66%, 90%, and 92% on ASL and Handicraft-Gesture datasets, respectively. The developed system demonstrates similar or better performance compared to other approaches in the literature.


Introduction
The main purpose of human-computer interaction is to allow users to freely control the device with some simple operations [1]. The human-computer interaction techniques include face recognition, language recognition, text recognition, and so on. As one of the important and powerful interaction methods, dynamic hand gesture recognition has attracted wide attention and been used in various fields, such as the video game industry, food industry, and machinery industry [2][3][4].
Dynamic gesture recognition can be used in many applications, such as virtual reality [5,6], the remote operation of robot [7,8], video games [9,10], sign language interpretation [11,12], and so on. In general, dynamic hand gesture recognition can be mainly divided into vision-based gesture recognition and wearable device-based gesture recognition. In the past, many researches on dynamic gesture recognition basically used a monocular camera [13] to capture images of dynamic gestures, segmented images, and extracted the gesture model from a series of frames. One disadvantage of this method is that a large amount of computation is required to segment the hand information. The other method adopted a data glove [14] to collect data that focuses on the coordinates of palms and fingertips and the bending degrees of finger joints. It can intuitively obtain hand information without a lot of calculation. However, a user must wear gloves, and people using the same device may get sign language by using LMC [25]. In the system, the authors presented a modified Long Short-Term Memory model which has three input gates and an output gate, adding a 2D-Convolutional Neural Network to the input gate that receives feature inputs. At last, the system had been tested with Indian Sign Language and got the average accuracies of 89.5% and 72.3% on isolated sign words and signed sentences, respectively. The Arabic Sign Language recognition system with two LMCs was shown in [26]. Deriche et al. adopted the Linear Discriminant Analysis approach and a Bayesian approach that contains the Gaussian Mixture Model. As a result, an accuracy of nearly 92% was obtained.
In order to improve the performance of the dynamic hand gesture recognition on American Sign Language (ASL) and Handicraft Gesture datasets from several works, we present a gesture recognition system in this paper, which consists of the LMC and the two-layer Bidirectional Recurrent Neural Network (BRNN). In the first stage, the proposed algorithm accurately determines the start and end of dynamic hand gestures by calculating the changes in hand rotation angle and palm speed between two adjacent frames, which can ensure the validity of features. Then, to obtain a better model, the features of a single finger and adjacent fingers are introduced into the input vector of the model. In the next stage, we compare the effects of changes in the hyper-parameter on the accuracy of the classifier and improve the performance of the model. In the third stage, the validation and comparison are performed on the proposed system. Here, the challenges are how to collect real-time dynamic hand gestures efficiently and how to search for a better model with hyper-parameters. The first problem is solved, to a large extent, by using an algorithm to discriminate hand movement, and then the second problem is tackled by using a two-layer BRNN. In summary, the contributions of the proposed system are shown: Based on the LMC, we propose a method to accurately determine when to start and end collecting dynamic gestures through its C++ function library.
The selection of features, based on fingers and palm, are highly discriminative for recognizing some gestures from the ASL and Handicraft-Gesture datasets.
For the first time, we combine a two-layer BRNN with the special features extracted by the LMC in the field of the dynamic gesture recognition. Moreover, the accuracy is higher than that reported in similar works found in the literature.

Materials and Methods
In this section, we present the architecture of our gesture recognition system. The systematic framework includes four parts, which are feature extraction, data collection and processing, two-layer BRNN, and model details. The architecture of gesture recognition system is shown in Figure 1, where the LMC is used to acquire the dynamic hand sequences, and data are processed by specific methods and sent into the two-layer BRNN model. Finally, the BRNN model outputs the recognition result.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 18 recognition system for continuous sign language by using LMC [25]. In the system, the authors presented a modified Long Short-Term Memory model which has three input gates and an output gate, adding a 2D-Convolutional Neural Network to the input gate that receives feature inputs. At last, the system had been tested with Indian Sign Language and got the average accuracies of 89.5% and 72.3% on isolated sign words and signed sentences, respectively. The Arabic Sign Language recognition system with two LMCs was shown in [26]. Deriche et al. adopted the Linear Discriminant Analysis approach and a Bayesian approach that contains the Gaussian Mixture Model. As a result, an accuracy of nearly 92% was obtained.
In order to improve the performance of the dynamic hand gesture recognition on American Sign Language (ASL) and Handicraft Gesture datasets from several works, we present a gesture recognition system in this paper, which consists of the LMC and the two-layer Bidirectional Recurrent Neural Network (BRNN). In the first stage, the proposed algorithm accurately determines the start and end of dynamic hand gestures by calculating the changes in hand rotation angle and palm speed between two adjacent frames, which can ensure the validity of features. Then, to obtain a better model, the features of a single finger and adjacent fingers are introduced into the input vector of the model. In the next stage, we compare the effects of changes in the hyper-parameter on the accuracy of the classifier and improve the performance of the model. In the third stage, the validation and comparison are performed on the proposed system. Here, the challenges are how to collect real-time dynamic hand gestures efficiently and how to search for a better model with hyper-parameters. The first problem is solved, to a large extent, by using an algorithm to discriminate hand movement, and then the second problem is tackled by using a two-layer BRNN. In summary, the contributions of the proposed system are shown: Based on the LMC, we propose a method to accurately determine when to start and end collecting dynamic gestures through its C++ function library.
The selection of features, based on fingers and palm, are highly discriminative for recognizing some gestures from the ASL and Handicraft-Gesture datasets.
For the first time, we combine a two-layer BRNN with the special features extracted by the LMC in the field of the dynamic gesture recognition. Moreover, the accuracy is higher than that reported in similar works found in the literature.

Materials and Methods
In this section, we present the architecture of our gesture recognition system. The systematic framework includes four parts, which are feature extraction, data collection and processing, twolayer BRNN, and model details. The architecture of gesture recognition system is shown in Figure 1, where the LMC is used to acquire the dynamic hand sequences, and data are processed by specific methods and sent into the two-layer BRNN model. Finally, the BRNN model outputs the recognition result.

Feature Extraction
The LMC can detect and track the human hand and wrist which appears 25mm to 600mm above the device plane, returning information on coordinate, velocity, acceleration, and others of five fingers, palm, and wrist. Each dynamic motion can be taken as a sequence of consecutive frames.

Feature Extraction
The LMC can detect and track the human hand and wrist which appears 25 mm to 600 mm above the device plane, returning information on coordinate, velocity, acceleration, and others of five fingers, palm, and wrist. Each dynamic motion can be taken as a sequence of consecutive frames. Each frame can be considered as the combination of different hand features. In this part, we see a frame as a vector which consists of following features: Fingertip distance where, D i is the Euclidean distance between a fingertip coordinate and the palm coordinate, O is the origin of the three-dimensional coordinate system of Leap Motion, F i is the fingertip coordinate in the three-dimensional coordinate system of Leap Motion, C is the palm coordinate in the three-dimensional coordinate system of Leap Motion. Fingertip angle where, θ i is the angle between vector → CF p i and vector → CF i , F i p is the projection of F i in the palm plane.
Fingertip height where, H i is the vertical distance from point F i to palm plane, → N is the palm normal ( Figure 2), · is the dot product, * is the multiplication sign.

Data Collection and Processing
Dynamic gesture recognition relies on real-time and accurate gesture tracking. LMC uses binocular RGB high-definition cameras to improve gesture positioning accuracy and reduce the problems caused by occlusions between fingers. The infrared camera is used to filter images, which greatly reduces the impact of the background environment. Finally, a convolutional neural network is used to perform multi-layer convolution filtering on the image to extract feature data and provide it to the user. In terms of real-time and accuracy, LMC can provide gesture data more stably and accurately, providing a stable data guarantee for applications in some fields.
However, a challenge in dynamic hand gesture data collection is how to determine the start and end of a dynamic hand gesture. When LMC performs gesture acquisition, it obtains time-based dynamic sequences. Therefore, during the gesture collection, it is necessary to determine when the gesture execution starts and stops according to the threshold. This work uses the palm rotation threshold in a three-dimensional coordinate system and finger speed threshold to determine the start and end points of the dynamic hand gesture. The rotation of the palm needs to be obtained through comparison and calculation of the current frame and the historical frames, and the change in finger speed can be obtained through the library function that comes with LMC. The Algorithm 1 is shown below: Algorithm 1 If(Hand is not empty) { If ( sqrt ( pow ( frame. rotation Angle(last frame, x), 2)+ pow( frame. rotation Angle(last frame, y), 2)+ pow( frame. rotation Angle(last frame, z), 2) ) > threshold1 The distance of adjacent fingertips The angle of adjacent fingertips where, Θ i is the angle between vector → CF i and vector → CF i+1 . The coordinate of the palm x palm , y palm , z palm The above features are derived from the work [21,27,28] and they are extended and modified. Note that they are independent of each other and the final feature vector at time t is: X t = D 1 , · · · , D 5 , θ 1 , · · · , θ 5 , H 1 , · · · , H 5 , DD 1 , · · · , DD 4 , Θ 1 , · · · , Θ 4 , x, y, z This proposed vector not only solves the mislabeling problem caused by making dynamic hand gestures in different positions but also helps to recognize the difference which exists between adjacent fingers when people show different dynamic hand gestures.

of 17
A full dynamic gesture vector can be represented by: where, T is the number of frames required to detect a full dynamic gesture and varies according to different dynamic gestures.

Data Collection and Processing
Dynamic gesture recognition relies on real-time and accurate gesture tracking. LMC uses binocular RGB high-definition cameras to improve gesture positioning accuracy and reduce the problems caused by occlusions between fingers. The infrared camera is used to filter images, which greatly reduces the impact of the background environment. Finally, a convolutional neural network is used to perform multi-layer convolution filtering on the image to extract feature data and provide it to the user. In terms of real-time and accuracy, LMC can provide gesture data more stably and accurately, providing a stable data guarantee for applications in some fields.
However, a challenge in dynamic hand gesture data collection is how to determine the start and end of a dynamic hand gesture. When LMC performs gesture acquisition, it obtains time-based dynamic sequences. Therefore, during the gesture collection, it is necessary to determine when the gesture execution starts and stops according to the threshold. This work uses the palm rotation threshold in a three-dimensional coordinate system and finger speed threshold to determine the start and end points of the dynamic hand gesture. The rotation of the palm needs to be obtained through comparison and calculation of the current frame and the historical frames, and the change in finger speed can be obtained through the library function that comes with LMC. The Algorithm 1 is shown below: LMC allows access to a wide variety of features contained in the original dataset. After extraction and calculation, and obtaining specific features, the high-dimensional hand gesture sequences will make the model too computationally intensive in learning. In addition, every gesture in the acquisition process will have problems such as inconsistent shapes and inconsistent range sizes. Therefore, data need to be pre-processed, so that it is standardized to eliminate noise interference and reduce model learning costs. The same attribute of all samples is normalized separately, and all the samples become a series of data with a mean of 0 and a variance of 1.

The Basic Long Short-Term Memory Unit
The Long Short-Term Memory unit, as shown in Figure 3, is seen as the memory block and consists of three multiplicative gates which are input gate (update gate), output gate, and forget gate in the network. Additionally, the memory block also includes memory cells that are self-connected and save the temporal information of the network. The input gate is responsible for processing the input of the information. The output gate generates the flow of the output of cell activations to the rest of the units. The forget gate adaptively forgets some information in the memory cell. At time t, the formulas in the unit are defined as follow: where, Γ u , Γ o , Γ f , and c t are the input gate, the output gate, the forget gate, and the new memory cell, respectively. The symbols b u , b f , b o and b c are bias vectors, respectively. Then a t , c t , a t−1 and c t−1 are the output of current block, final memory cell, output of previous block and previous memory cell. These vectors have the same length. W u , W o , W f and W c are the weights of the input gate, the output gate, the forget gate, the new memory cell, and the prediction, respectively. tanh is the hyperbolic tangent function. g is the softmax function. σ is the logistic sigmoid activation function. is the element-wise product of the vectors.
learning costs. The same attribute of all samples is normalized separately, and all the samples become a series of data with a mean of 0 and a variance of 1.

The Basic Long Short-Term Memory Unit
The Long Short-Term Memory unit, as shown in Figure 3, is seen as the memory block and consists of three multiplicative gates which are input gate (update gate), output gate, and forget gate in the network. Additionally, the memory block also includes memory cells that are self-connected and save the temporal information of the network. The input gate is responsible for processing the input of the information. The output gate generates the flow of the output of cell activations to the rest of the units. The forget gate adaptively forgets some information in the memory cell. At time t, the formulas in the unit are defined as follow: where, , , , and are the input gate, the output gate, the forget gate, and the new memory cell, respectively. The symbols , , and are bias vectors, respectively. Then , , and are the output of current block, final memory cell, output of previous block and previous memory cell. These vectors have the same length.
, , and are the weights of the input gate, the output gate, the forget gate, the new memory cell, and the prediction, respectively. is the hyperbolic tangent function.
is the softmax function. is the logistic sigmoid activation function. ⊙ is the element-wise product of the vectors.

Bidirectional Recurrent Neural Network
The common Recurrent Neural Network provides an extremely useful method to handle timebased sequences which shows correlations between closely linked data elements in the sequence. Figure

Bidirectional Recurrent Neural Network
The common Recurrent Neural Network provides an extremely useful method to handle time-based sequences which shows correlations between closely linked data elements in the sequence. Figure 4 shows a basic Recurrent Neural Network architecture with three units. Meanwhile, the previous Long Short-Term Memory unit provides its memory cell and output for the current unit as time goes on. In this figure, the input vector x which contains T frames is input into the network frame by frame. Compared with some other time-based models like Multilayer perceptron and time delay neural networks [29], this structure utilizes most of the available input information, from the start frame to current frame, to output vector y without the limitation of using a fixed-format input. 4 shows a basic Recurrent Neural Network architecture with three units. Meanwhile, the previous Long Short-Term Memory unit provides its memory cell and output for the current unit as time goes on. In this figure, the input vector x which contains T frames is input into the network frame by frame. Compared with some other time-based models like Multilayer perceptron and time delay neural networks [29], this structure utilizes most of the available input information, from the start frame to current frame, to output vector y without the limitation of using a fixed-format input.  However, in the one-way Recurrent Neural Network, the current unit can only output the outcome based on the information of previous units. Especially, in some problems, the output of the current unit is not only related to previous units, but also related to the future units. In this case, it is possible to use two separate recurrent neural networks and then somehow merge outputs. In the single layer Bidirectional Recurrent Neural Network, one Recurrent Neural Network goes in a forward direction. On the contrary, another one goes in backward direction. At each time point, the input is provided to two independent Long Short-Term Memory units in opposite directions and they combine their outcomes based on the hidden state. In our work, we adopt this structure to build a two-layer BRNN and combine outcomes of the final Long Short-Term Memory unit of both networks ( Figure 5). There are many other merging procedures to combine outcomes and it is generally not clear how to merge network outcomes in an optimal way since different networks trained on the same data can no longer be regarded as independent [29]. The formulas of the basic BRNN unit at time t are given by the following equations: However, in the one-way Recurrent Neural Network, the current unit can only output the outcome based on the information of previous units. Especially, in some problems, the output of the current unit is not only related to previous units, but also related to the future units. In this case, it is possible to use two separate recurrent neural networks and then somehow merge outputs. In the single layer Bidirectional Recurrent Neural Network, one Recurrent Neural Network goes in a forward direction. On the contrary, another one goes in backward direction. At each time point, the input is provided to two independent Long Short-Term Memory units in opposite directions and they combine their outcomes based on the hidden state. In our work, we adopt this structure to build a two-layer BRNN and combine outcomes of the final Long Short-Term Memory unit of both networks ( Figure 5). There are many other merging procedures to combine outcomes and it is generally not clear how to merge network outcomes in an optimal way since different networks trained on the same data can no longer be regarded as independent [29]. The formulas of the basic BRNN unit at time t are given by the following equations: Sensors 2020, 20, 2106 8 of 17 the output of the two-layer BRNN is defined as follows: where, → represents the forward Recurrent Neural Network and ← represents the backward Recurrent Neural Network. The scales of these four RNNs are the same.
current frame, to output vector y without the limitation of using a fixed-format input.  However, in the one-way Recurrent Neural Network, the current unit can only output the outcome based on the information of previous units. Especially, in some problems, the output of the current unit is not only related to previous units, but also related to the future units. In this case, it is possible to use two separate recurrent neural networks and then somehow merge outputs. In the single layer Bidirectional Recurrent Neural Network, one Recurrent Neural Network goes in a forward direction. On the contrary, another one goes in backward direction. At each time point, the input is provided to two independent Long Short-Term Memory units in opposite directions and they combine their outcomes based on the hidden state. In our work, we adopt this structure to build a two-layer BRNN and combine outcomes of the final Long Short-Term Memory unit of both networks ( Figure 5). There are many other merging procedures to combine outcomes and it is generally not clear how to merge network outcomes in an optimal way since different networks trained on the same data can no longer be regarded as independent [29]. The formulas of the basic BRNN unit at time t are given by the following equations:

Loss Function
The loss function is defined as below: where,ŷ is prediction of the sample X, y is the label of the sample X, and y t is the label of the X t in the sample X. In addition, m is the number of samples and T is the length of one sample based on the time sequence. Further, · is the inner product. This formulation is obtained by referring to and modifying the cross-entropy proposed in [30].

Learning Rate
Learning rate is the most important hyper-parameter in all kinds of neural networks. Consequently, Cyclical Learning Rate method was adopted in this work, which can eliminate the need to vary the learning rate yet achieve near optimal classification accuracy. Instead of monotonically decreasing the learning rate, CLR lets the learning rate vary cyclically between reasonable boundary values which are achieved through, linearly, increasing the learning rate for a few epochs. The corresponding codes are the given as: Sensors 2020, 20, 2106 9 of 17 where [x] is to take the largest integer less than x, |x| is to take the absolute value of x, and max(x, y) is to take the larger of x and y. The en is the sequence number of epochs. More details can be found in [31].

Experimental Results and Discussion
This section shows a series of dynamic hand gesture recognition experiments. The systematic framework includes six parts: dynamic gesture datasets, selection of the optimal dropout rate on two datasets, experiment results on ASL dataset, experiment results on Handicraft-Gesture dataset, k-fold and Leave-One-Out cross-validation, and comparisons between our results with the works found in the literature.

Dynamic Gesture Datasets
Here, we created two new datasets composed of 12 hand gestures and 10 hand gestures, respectively. They are called ASL Dataset ( Figure 6) and Handicraft-Gesture Dataset [21] (Figure 7). In particular, the ASL Dataset is picked from the American Sign Language and the other dataset is chosen from the paper [21].
Sensors 2020, 20, x FOR PEER REVIEW 9 of 18 values which are achieved through, linearly, increasing the learning rate for a few epochs. The corresponding codes are the given as: where [x] is to take the largest integer less than x, |x| is to take the absolute value of x, and max(x, y) is to take the larger of x and y. The en is the sequence number of epochs. More details can be found in [31].

Experimental Results and Discussion
This section shows a series of dynamic hand gesture recognition experiments. The systematic framework includes six parts: dynamic gesture datasets, selection of the optimal dropout rate on two datasets, experiment results on ASL dataset, experiment results on Handicraft-Gesture dataset, kfold and Leave-One-Out cross-validation, and comparisons between our results with the works found in the literature.  Here, we created two new datasets composed of 12 hand gestures and 10 hand gestures, respectively. They are called ASL Dataset ( Figure 6) and Handicraft-Gesture Dataset [21] (Figure 7). In particular, the ASL Dataset is picked from the American Sign Language and the other dataset is chosen from the paper [21].

ASL Dataset
Currently, most of datasets are based on images. In order to test the performance of the model we proposed, we built this dataset with an LMC. This dataset contains a series of gestures from ASL. The 12 gestures are bathroom, blue, green, j, like, milk, past, pig, sorry, where, yellow, and z. The arrows are the motion trajectories of the hand or the fingers.

ASL Dataset
Currently, most of datasets are based on images. In order to test the performance of the model we proposed, we built this dataset with an LMC. This dataset contains a series of gestures from ASL. The 12 gestures are bathroom, blue, green, j, like, milk, past, pig, sorry, where, yellow, and z. The arrows are the motion trajectories of the hand or the fingers. This work is mainly focused on single-hand dynamic gestures. In order to compare with other works, like [21,23] which has the dataset containing bathroom, blue, finish, green, hungry, milk, past, pig, store, where, j, and z, we replaced finish, hungry, and store with sorry, like, and yellow which are nearly equivalent to them in terms of difficulty.

Handicraft-Gesture Dataset
In order to evaluate the proposed model with more different dynamic hand gestures, we introduced the Handicraft-Gesture dataset. This dataset includes 10 gestures, namely poke, scrape, circle, pull, pinch, slap, cut, key type, press, and mow.
All gestures are dynamic and based on a single hand, coming from four different people aged 25. In order to compare with other literatures, there are two types of ASL datasets which contain 360 and 480 samples, respectively. In particular, the dataset with 480 samples is from [23] which has a dataset composed of 720 static hand sequences and 480 dynamic hand sequences. In the one that contains 360 samples, each type of gesture has 30 samples that are performed by four different persons. According to [21], 70% of the data are used as the training set and 30% are used as the testing set. In the other one, each type of gesture has 40 samples that are performed by four different persons. According to [23], 65% of the data are used as the training set and 35% are used as the testing set. The Handicraft-Gesture dataset is composed of 300 samples and each type of gesture has 30 samples that are performed by four different persons. According to [21], 70% of the data are used as the training set and 30% are used as the testing set.

Selection of the Optimal Dropout Rate on Two Dataset
Although BRNN has shown excellent ability to capture deep information between sequences, BRNN architectures are prone to overfit [32,33]. In order to avoid the problem of overfitting, one of the most useful techniques in neural networks, namely dropout regularization, was taken. Several experiments on the selection of the dropout rate were conducted.
The Figure 8a shows that the best dropout rate for the ASL dataset is between 0.3 and 0.4, and the Figure 8b indicates that the best dropout rate for the Handicraft-Gesture dataset is between 0.4 and 0.5. However, as the number of iterations increases, the dropout rate still needs to be tuned properly in these two ranges.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 18 and 0.5. However, as the number of iterations increases, the dropout rate still needs to be tuned properly in these two ranges.

Experiment Results on ASL Dataset
Using an LMC and a computer with an Intel i5 3.2GHz CPU and 8GB RAM, experiments were performed to estimate the performance of the proposed two-layer BRNN model. The model was trained using cross-entropy loss function, Adam gradient descent algorithm [34], and varying learning rate. The model was implemented by using a framework in C++, written by ourselves. Figure 9a-f shows changes in the recognition accuracy of the model on the training and testing sets as the number of iterations increases and loss curves which represent the sum of the errors provided for each training and test sample. On the ASL dataset with 360 samples, after 630 epochs (18, 900 iterations), the curve of the train accuracy converges to 100% and the curve of the test accuracy converges to 96.2963%, respectively. On the ASL dataset with 480 samples, after 1170 epochs (70, 200 iterations), the curve of the train accuracy converges to 100% and the curve of the test accuracy converges to 95.238%, respectively. In Figure 9c,d, a sudden and unexpected leap of value happens over the epoch of 340 for accuracies and losses on both training and testing sets. To explore the cause we did another three types of experiments, each performed five times. The first experiment used the CLR but did not use the dropout, the second experiment used the dropout but did not use the CLR, and the third experiment did not use any of them. The result shows that the leap happens randomly when the model uses the dropout and CLR. It seems to be a random and transient instability because this leap only happens when the learning rate varies during the first period (0-400 epoch), and after this the model will consistently converge with smaller learning rate. This leap happens infrequently, and it does not appear in experiments on the other two datasets.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 18 and 0.5. However, as the number of iterations increases, the dropout rate still needs to be tuned properly in these two ranges.  In order to better evaluate the proposed approach, three significant metrics were taken, namely Precision, Recall, and F1-score. As a standard, these indexes can be used to measure the performance of the model [35]. The results are presented in Table 1. In addition, heat-maps of the confusion matrix were also computed and drawn ( Figure 10). Each column of the confusion matrix represents the predictive value while each row represents the real value. In the confusion matrix, each diagonal element represents the prediction accuracy of the corresponding row identifier. The elements below or above the diagonal are misclassified. As depicted in Figure 10, for each gesture, there is a low accuracy for gesture 'green' (78%) on the dataset with 360 samples. However, the accuracy of each gesture is above 93% on the dataset with 480 samples, which is a good performance.

Experiment Results on Handicraft-Gesture Dataset
Similar experiments and evaluations were executed on the Handicraft-Gesture dataset. As shown in Figure 9, after 630 epochs (15,750 iterations), the curve of the train accuracy converges to about 100% and the curve of the test accuracy converges to about 96.6667%, respectively. In addition, Table 2 shows the F1-score of 96.6563% and Figure 11 describes the heat-map of the confusion matrix for recognizing hand gestures on this dataset.
Except for low accuracies for gesture 'key tap' (78%) and 'poke' (89%), all other accuracies are 100%. It is clear that, at some moments, key tap and poke have similar hand shapes and the same motion trajectories in different directions. Different people make this gesture with different force and directions. For this reason, the key tap and poke are misclassified in some cases.

Experiment Results on Handicraft-Gesture Dataset
Similar experiments and evaluations were executed on the Handicraft-Gesture dataset. As shown in Figure 9, after 630 epochs (15,750 iterations), the curve of the train accuracy converges to about 100% and the curve of the test accuracy converges to about 96.6667%, respectively. In addition, Table 2 shows the F1-score of 96.6563% and Figure 11 describes the heat-map of the confusion matrix for recognizing hand gestures on this dataset.

K-Fold and Leave-One-Out Cross-Validation
As a statistical analysis method, cross-validation is commonly used to verify the performance of the recognition model. In the work, we adopted the 5-fold cross-validation, 10-fold cross-validation, and Leave-One-Out cross-validation to verify the model. In the 5-fold cross-validation, the dataset is divided into 5 subsets. In each subset, there are 72 samples for the ASL dataset with 360 samples, 96 samples for the ASL dataset with 480 samples, and 60 samples for the Handicraft-Gesture dataset. In 10-fold cross-validation, the dataset is divided into 10 subsets. In each subset, there are 36 samples for the ASL dataset with 360 samples, 48 samples for the ASL dataset with 480 samples, and 30 samples for the Handicraft-Gesture dataset. In the Leave-One-Out cross-validation, for each dataset, one sample is used as the testing set and the others are used as the training set. Leave-One-Out crossvalidation is robust, but it is computationally expensive. In order to reduce the computation time, we chose to use pre-training (Transfer Learning) [36] to conduct Leave-One-Out cross-validation. In our experiments, for each dataset, pre-training starts with a model resulted from about 200 epochs of computation on the same dataset. We noticed that the model converges quickly on these datasets. Therefore, we only ran nearly five epochs (nearly 150 iterations) for the Leave-One-Out crossvalidation experiments to obtain comparably good results. The results are shown in Table 3-5.  Table 4. Recognition accuracies (%) for the three datasets by using the 10-fold cross-validation method. Except for low accuracies for gesture 'key tap' (78%) and 'poke' (89%), all other accuracies are 100%. It is clear that, at some moments, key tap and poke have similar hand shapes and the same motion trajectories in different directions. Different people make this gesture with different force and directions. For this reason, the key tap and poke are misclassified in some cases.

K-Fold and Leave-One-Out Cross-Validation
As a statistical analysis method, cross-validation is commonly used to verify the performance of the recognition model. In the work, we adopted the 5-fold cross-validation, 10-fold cross-validation, and Leave-One-Out cross-validation to verify the model. In the 5-fold cross-validation, the dataset is divided into 5 subsets. In each subset, there are 72 samples for the ASL dataset with 360 samples, 96 samples for the ASL dataset with 480 samples, and 60 samples for the Handicraft-Gesture dataset. In 10-fold cross-validation, the dataset is divided into 10 subsets. In each subset, there are 36 samples for the ASL dataset with 360 samples, 48 samples for the ASL dataset with 480 samples, and 30 samples for the Handicraft-Gesture dataset. In the Leave-One-Out cross-validation, for each dataset, one sample is used as the testing set and the others are used as the training set. Leave-One-Out cross-validation is robust, but it is computationally expensive. In order to reduce the computation time, we chose to use pre-training (Transfer Learning) [36] to conduct Leave-One-Out cross-validation. In our experiments, for each dataset, pre-training starts with a model resulted from about 200 epochs of computation on the same dataset. We noticed that the model converges quickly on these datasets. Therefore, we only ran nearly five epochs (nearly 150 iterations) for the Leave-One-Out cross-validation experiments to obtain comparably good results. The results are shown in Tables 3-5.

Comparisons
Firstly, we compared the proposed method with several significant works [21,23,37,38] on the ASL dataset. American Sign Language is one of the most important communication tools between the deaf-mute and contains many challenging dynamic hand gestures. Most of works chose a subset of the American Sign Language to build dataset and tested recognition systems. Table 6 shows that the proposed method surpasses the accuracies of three works on the ASL dataset and, compared to [23], we obtained similar results on the dataset. In particular, the overall accuracy in [23] is 96.4102%, but the dataset contains 18 static hand gestures (720 samples) and 12 dynamic hand gestures (480 samples). We removed the static hand gesture, and the accuracy of the dynamic hand gesture is calculated to be 95.2%. The recognition of static hand gestures has matured in recent years and, sometimes, the accuracy is up to 100%. Therefore, it is reasonable for us to compare with it. The other works are all based on the same or a smaller sample size, and they have different models and collection approaches of dynamic hand gestures. Furthermore, this work also compared the proposed method with the work in [21] on Handicraft-Gesture dataset and, as shown in Table 7, our method achieved better performance. In addition, in its confusion matrix, it has low accuracies in the recognition of gesture 'pinch' (82.5%) and 'pull' (87.5%). However, our work only gets a low accuracy in the recognition of 'key tap' (78%) (Figure 11). Table 7. Comparison of the accuracy among significant works on the Handicraft-Gesture dataset.

Method Dataset Accuracy
This paper's method Handicraft-Gesture (300 samples) 96.7% Hidden Conditional Neural Field on Depth Features [21] Handicraft-Gesture (300 samples) 95% The computation time of our experiments is shown in the Table 8.

Conclusions
In this work, a recognition system of dynamic hand gesture recognition method is presented. It is based on a two-layer Bidirectional Recurrent Neural Network. In addition, an efficient dynamic gesture acquisition framework is proposed, and 26 discriminative features based on angles, positions, and distances between fingers are used. The system was able to obtain high accuracy results and our method demonstrated similar or better performance than several works on ASL and Handicraft-Gesture datasets. Future work will be focused on larger and more difficult dynamic hand gestures, combining two or more LMCs to capture dynamic hand gesture sequences. Furthermore, we are also working on creating a new dataset that includes more challenging dynamic hand gestures from ASL and other sign languages which supports multiple formats of data, such as time-based sequences and depth images. This dataset will serve as benchmark for testing on different models.