Hand Posture Recognition Using Skeletal Data and Distance Descriptor

: In this paper, a method for the recognition of static hand postures based on skeletal data was presented. A novel descriptor was proposed. It encodes information about distances between particular hand points. Five different classiﬁers were tested, including four common methods and a proposed modiﬁcation of nearest neighbor classiﬁer, which can distinguish between posture classes differing mostly in hand orientation. The experiments were performed using three challenging datasets of gestures from Polish and American Sign Languages. The proposed method was compared with other approaches found in the literature. It outperforms every compared method, including our previous work, in terms of recognition rate.


Introduction
Automatic hand posture recognition is an important research topic in computer science [1]. The overall goal is to understand the body language and then create more functional and efficient human-computer interfaces. Application areas are vast; from driver support via hand-controlled cockpit elements [2], through to home automation with consumer electronics driven by gestures [3], gaming industry applications [4], interaction with virtual objects [5], and finally, technological support for people with disabilities [6]. Solutions available on the market are either limited, very simple, or have background and illumination requirements that are difficult to meet in real-life scenarios. Therefore, it is desirable to continue research on the automatic interpretation of hand gestures.
Available approaches can be divided into two groups: (i) using special gloves with sensors [7] or cameras and (ii) computer vision methods [8]. Vision-based solutions seem to be more attractive because they are more comfortable, do not require any additional equipment limiting the user's freedom of movement, mimic natural interaction, and avoid stigmatization. However, building a reliable vision-based system is quite a challenge. Color-based methods, frequently used to segment hands, fail in the cases of complex backgrounds containing other skin-colored objects or users wearing short-sleeved clothing [9]. It is also very difficult to achieve the color constancy under varying scene illumination. Fortunately, new depth-sensing devices recently appeared in the market. They combine the visible and near-infrared part of the spectrum to obtain good quality depth maps. Some of them, based on time-of-flight principle, can even work in a completely dark room.
Therefore, in recent literature on hand gestures recognition, a shift to depth modality has been observed [10]. The new devices can acquire good quality 3D data, which can be then used to extract the hand skeleton containing information about the spatial configuration of bones corresponding to fingers. The main advantage of skeletal data-compared to images, point clouds, and depth maps-is its small size. The features calculated based on skeletons can also be a good addition to typical image-based or depth-based features. We can also expect that there will be more and more accurate devices providing this type of data. Therefore, there is a need to develop new recognition algorithms based on hand skeletons.
In this paper, a problem of hand posture recognition based on skeletal data extracted by a depth sensor was tackled. The method is based on a novel hand descriptor combined with another one that was previously developed. The experimental tests were performed using four different classification methods and a proposed modification of the nearest neighbor classifier.
The main contributions of this paper are as follows: 1. The novel hand descriptor encoding information about distances between selected hand points; 2. The modified nearest neighbor classifier, suitable for posture recognition in the case where some of the classes differ mostly in hand orientation; 3. Experimental verification of the proposed methods using challenging datasets.
The remaining parts of this paper are organized as follows. The related works are characterized in Section 2. The proposed hand posture recognition method is presented in Section 3. Section 4 discusses the used datasets and performed experiments. Section 5 concludes the paper.

Related Work
One of the devices that can be used to obtain skeletal data for hands is the Leap Motion (LM) sensor [11]. In literature, there are several works devoted to the study of its usefulness for gesture recognition. In [12], the authors estimated the accuracy and repeatability of hand position measurement and found that this sensor outperforms competitive solutions with a similar price range. In [13], the sensor's usefulness for recognizing Australian Sign Language was assessed. Hand shapes for which the sensor does not work were identified. The authors concluded that the solution has great potential, but requires refining the API. In [14], the usefulness of the device for hand tracking was assessed. The authors noticed that further development of the sensor is needed to implement professional systems.
In [15], a subset of hand shapes from the American Finger Alphabet was recognized using features based on skeletal data: angle, distance, and elevation of the fingertips. The support vector machine (SVM) classifier was used. The recognition rate was 80.88%. After adding features obtained with the Kinect sensor-curvature and correlation-recognition efficiency increased to 91.28%.
The 26 letters of the American Finger Alphabet, shown by two people, were also recognized in [16]. The feature vector consisted of pinch strength, grab strength, average distance, spread and tri-spread between fingertips, determined from skeletal data. The recognition rate was 72.78% for the nearest neighbor (kNN) classifier and 79.83% for SVM.
In [25], 28 letters of the Arabic Finger Alphabet were recognized using 12 of 23 values measured by the LM sensor: finger length; finger width; average tip position with respect to x, y, and z-axis; hand sphere radius; palm position with respect to x, y, and z-axis; hand pitch, roll, and yaw. A 98% recognition rate was obtained for the Naive Bayes classifier, while it was 99% for Multilayer Perceptron.
Ten static hand shapes shown by 14 users were recognized in [18]. The following features were used: fingertip angles, distances, elevations, and positions. For the SVM classifier, the recognition rate was 81.5%. After adding features obtained from the Kinect's depth image, the recognition rate increased to 96.5%.
The Indian Finger Alphabet letters from A to Z and the numbers from 1 to 9, shown by ten users, were recognized in [29]. The feature vector consisted of the following distances: the fingertips-the middle of the palm, the index-the middle finger, the index-the ring finger, and the index-the little finger. For the kNN classifier, a recognition efficiency of 88.39% for the Euclidean metric and 90.32% for the cosine distance was obtained.
In [28], two LM controllers were used to prevent the individual fingers from being obstructed. The 28 letters of Arabic Finger Alphabet were recognized. The feature vector was a concatenation of the following features measured by two controllers: finger length; finger width; average tip position with respect to x, y, and z-axis; hand sphere radius; palm position with respect to x, y, and z-axis; hand pitch, roll, and yaw. The Linear Discriminant Analysis classifier was used. The recognition rate of 97.7% for a fusion of features and 97.1% for a fusion of classifiers was obtained.
Ten static hand shapes that can be used for rehabilitation after cerebral palsy were recognized in [35]. Features determined by the LM controller and three classification methods; decision tree, kNN, and SVM, were used. The obtained recognition rates of individual gestures ranged from 76.96% to 100%.
In [20], 26 letters of the American Finger Alphabet were recognized. The features measured by the LM controller and the Multilayer Perceptron (MLP) classifier were used. A recognition rate of 96.15% was achieved.
Forty-nine static gestures (twenty-six letters of the alphabet, ten numbers, and nine words) of Indian Sign Language were recognized in [31]. The feature vector was composed of the distances between the middle of the hand and the fingertips. The kNN classifier with four different measures of similarity: Euclidean distance measure, cosine similarity, Jaccard similarity, and Dice similarity was used. For gestures performed by ten people, recognition rates ranging from 83.11% to 90% were obtained.
In [27], forty-four static gestures (twenty-eight letters, ten numbers, and sixteen words) from the Arabic Sign Language were recognized. Two variants of the feature vector consisting of 85 and 70 scalar values measured by the LM controller were considered. The training set consisted of 200 performances of each gesture by two people. Tests were carried out on 200 executions of individual gestures by a third person. Three variants of the classifier were considered: SVM, kNN, and artificial neural network (ANN). The best recognition rate of 99% was obtained for the kNN classifier and the first considered feature vector variant.
Twenty-six letters and ten numbers of the American Finger Alphabet were recognized in [23]. Six different combinations of the following features were considered: standard deviation of palm position, palm curvature radius, the distance between the palm center and each fingertip, and the angle and distance between two adjacent fingertips. For gestures performed by twelve people and leave-one-subject-out protocol, the recognition rate was 72.79% for SVM and 88.79% for the deep network.
The 24 characters of the American Finger Alphabet shown by five people were recognized in [24]. Skeletal data in the form of angles between adjacent bones of the same finger and angles between adjacent fingers were used, as well as infrared images obtained with an LM sensor. For the classifier based on deep networks and leave-one-subject-out protocol, a recognition rate of 35.1% was obtained.
In [34], 48 static hand shapes from the Polish Finger Alphabet and Polish Sign Language were recognized. Gestures were shown 500 times by five users. Two different positions of the LM sensor and changes in the orientation of the hand were considered. Several classifiers, as well as their fusion, were tested. For the leave-one-subject-out protocol, the best recognition rate was 56.7%.
Twenty-four static gestures of the American Finger Alphabet, shown ten times by twelve people, were recognized in [22]. Features based on the skeletal data returned by the LM controller were used. The classification was carried out using hidden Markov models. For the leave-one-subject-out protocol, the recognition rate was 86.1%.
In [21], 12 dynamic and 18 static gestures of American Sign Language, shown by 20 people, were recognized. The feature vector was composed of the following: the internal angles of the joints between distal and intermediate phalanges, and intermediate and proximal phalanges; 3D displacements of the central point of the palm; 3D displacements of the fingertip positions; and the intrafinger angles. Recursive neural networks were used for the classification. The recognition rate was 96%.
The problem of fingers occlusion, occurring when performing gestures, was considered in [36]. For this purpose, three LM controllers were used. The feature vector consisted of a hand rotation matrix, y-component of the fingers directions, and distal phalanges rotation quaternion. Three classification methods available in the scikit-learn library were tested: logistic regression, SVC, and XGBClassifier. An 89.32% recognition rate was obtained for six selected hand gestures.
In [37], letters from the Israeli Sign Language (ISL) alphabet were recognized using the SVM classifier and feature vector consisting of the Euclidean distances between the fingertips and the center of the palm. The training dataset consisted of six letters performed 16 times by eight users. A system was able to translate fingerspelling into a written word with a recognition accuracy between 85%-92%.
The challenging problem of fistlike signs recognition, occurring in ASL, was described in [38]. In the proposed method, the area of several polygons defined by fingertip and palm positions, and estimated using the Shoelace formula, was used. The letters from the ASL alphabet were classified using the decision trees (DT). Seven letters performed 30 times by four persons were used as a training set. For 100 repetitions of each gesture by another user, the method achieved the recognition accuracy of 96.1%.
In [39], static hand postures recognition for a humanlike robot hand was described. Ten digits from ASL were recognized using the multiclass-SVM classifier. The feature vector consisted of the distances between the palm position and each fingertip and the distance between fingertips. The method was validated using a test dataset composed of 2000 static posture samples with an accuracy of 98.25%.
In [40], five one-handed and five two-handed static gestures of Turkish Sign Language, performed three times by two users, were recognized using artificial neural networks, deep learning, and decision trees. The 3D positions of all bones in the skeletal hand model, measured by the LM controller, were used. The recognition accuracies between 93% and 100% were achieved, depending on the classifier and the number of features used.
In [41], a deep-learning-based method for skeleton-based hand gesture recognition was described. The network architecture consists of a convolutional layer for extracting features and a long, short-term memory layer for modeling the temporal dimension. Ten static and ten dynamic hand gestures performed 30 times were recognized with an accuracy of 99%.
Based on the literature review, the following conclusions can be drawn: • Despite that the currently available devices for obtaining skeletal data are imperfect, we have recently observed a significant increase in interest of using this modality for gesture recognition. In the last two years, several new publications were recorded.

•
Most authors do not provide data used in experiments, which makes verification and comparative analysis difficult. Only three datasets available in works [15,18,24,34] are known to the authors.

•
Many cited works omit tests using the leave-one-subject-out protocol. These tests are more reliable because they show the method's dependence on the person performing gestures. • Some of the proposed feature vectors use features directly measured by the sensor. They are not independent of the size of the hand. • Some of the works relate to the recognition of dynamic gestures, in which the hand movement trajectory is a great help.
The problem of recognizing hand postures based on skeletal data has not been fully solved, and further work in this area is advisable.

Proposed Method
Skeletal data obtained using the Leap Motion sensor was used [42]. The feature vector was the concatenation of the Point Pair Descriptor (PPD) introduced in [34] and the Distance Descriptor (DD) proposed in this paper. Five different classifiers were tested. Four of them proved to be the best among the 18 tested in [34]. The fifth classifier is a novel modification of the kNN method proposed in this paper. Details are given in Sections 3.1-3.3.

Point Pair Descriptor
Let P c be the palm center, n c normal to the palm at point P c , P i the tip of the i-th finger, and n i the vector pointed by that finger (Figure 1).
where the vectors u, v i , and w i define the so-called Darboux frame [44]: and · denotes the scalar and ×-vector products. The Point Pair Descriptor consists of 15 features calculated for each finger using Formulas (1)- (3): The features were normalized to the interval [0-1].
Features α and Θ can be interpreted as pan and yaw angles between vectors pointed by fingers and palm normals. Feature φ is an angle between vectors pointed by fingers and the line connecting this vector with the initial point of palm normal. PPD is an alternative to other angular-based features describing hand skeleton. Such features are most often angles corresponding to the orientation of each fingertip projected onto the palm plane [15,18], between lines connecting the palm center with the fingers [45], between adjacent bones in the same finger, or between adjacent fingers [24]. Unlike PPD descriptor, these features do not use normals to the palm calculated at its center to determine angular relations.

Distance Descriptor
The Distance Descriptor is a novel method proposed in this paper. It encodes information about distances between hand points corresponding to fingertips and the palm center. It uses only information about the positions of these points. Normal vectors and vectors pointed by the fingers are not required. The descriptor can be computed as follows.
1. For each point P i : 1.1. Compute distances (using Euclidean or city block metric) between the other points P j , j = i. The purpose of step 3 is not only to reduce the number of features. After this step, for each point P i , the descriptor determines not only which of the remaining points P j are its nearest neighbors but also which of the P j points consider P i as their nearest neighbor. Features of the Distance Descriptor are normalized to the interval [0-1] by dividing them by 2(n − 1), where n = 6 is the number of points.
DD is an alternative to other positional-based and distance-based features describing hand skeleton. Such features are simple 3D or 2D coordinates or distances between fingertips and palm center [15,18,39,40,45] or between fingertips and palm plane [15,18]. Some of them are normalized with respect to the hand position and orientation, however, such normalization is not fully accurate. Moreover, even after the normalization, these features-unlike DD descriptor-are scale dependent, making methods difficult for recognizing gestures performed by people with different hand sizes (especially in the case of child hands). Some of these features are often not distinctive enough to differentiate between similar hand postures. It is because they do not include positional relations between fingertips (only between each fingertip and palm or plane center), whereas DD features include relations between each fingertip and between fingertips and palm center.
The Matlab codes for Point Pair Descriptor and Distance Descriptor can be downloaded from our website [46].

Classification
Four classifiers were tested: support vector machine with linear kernel function (SVM-Lin) [47], linear discriminant (LD) [48], ensemble of multiple decision trees-tree bagger (TreeBag) [49], and weighted k-nearest neighbors classifier with k = 10 (10NN-W) [50]. The parameters of the tested classifiers are listed in Table 1. We chose parameters starting from default values of our classification tool. We then tried to change each of them and observed if the changes caused improvement of results. Some of the recognized shapes differ only in the spatial orientation of the hand. Therefore, the following modification of the nearest neighbor classifier was proposed. When searching for the nearest neighbor, only those samples from the training set whose orientation is similar to the tested one are taken into account. The training sample is then marked as a potential nearest neighbor only if acos n cx · n ct |n cx ||n ct | ≤ γ t , where n cx , n ct are normals to the palm of the classified and training sample respectively, · denotes scalar vector product, and γ t is the threshold. We named the proposed method the nearest neighbor classifier with orientation restriction (NNOR). It should be considered as one of the novelties proposed in this paper.

Datasets
The experiments were performed using two datasets recorded by the Leap Motion sensor. Dataset 1 was introduced in [34] and can be downloaded from our website [46] (see Figure 2).  It consists of 48 hand posture classes from the Polish Finger Alphabet (PFA) and Polish Sign Language (PSL). Each gesture was performed 500 times by 5 people, which is 120,000 executions in total. During the recordings, the sensor was lying horizontally on the table.
To perform an additional evaluation of our method and compare the results with other works, two additional datasets were used: Dataset 2 provided by Marin et al. [18] and Dataset 3 provided by Tao et al. [24]. Dataset 2 consists of 10 posture classes corresponding to letters from American Sign Language. Each gesture was performed 10 times by 14 people, which is 1400 executions in total. Dataset 3 consists of 24 posture classes corresponding to letters from American Sign Language. Each gesture was performed 450 times by 5 people, which is 54,000 executions in total.
Gestures from each dataset are represented by the coordinates of fingertips, the palm center, the vector normal to the palm and the vectors coinciding with the fingers pointing direction.

Results
The results of 10-fold cross-validation obtained for Dataset 1 are shown in Table 2. Table 2. Results of 10-fold cross-validation obtained for Dataset 1. The proposed method achieved 100% accuracy in the case of weighted k-nearest neighbors and tree bagger classifiers. However, the results of leave-one-subject-out, 5-fold cross-validation, presented in Table 3, are significantly worse. Table 3. Results of leave-one-subject-out, 5-fold cross-validation tests obtained for the Dataset 1. Some posture classes from Dataset 1-e.g., 2, 100, TM, or H, U, or N, Nw-differ only in hand orientation. All of the used classifiers were unable to distinguish between them. Therefore, the NNOR classifier was proposed and tested in two variants: with Euclidean (NNOR-Euc) and city block (NNOR-CB) distance. γ t was experimentally set to 35 degrees. The results of leave-one-subject-out validation, obtained for Dataset 1, are shown in Table 4. The best accuracy, 63.9%, is 5.4% higher than in the case of previously tested classifiers. Table 4. Results of leave-one-subject-out, 5-fold cross-validation tests obtained for Dataset 1 using nearest neighbor classifier with orientation restriction (NNOR).

NNOR-Euc NNOR-CB
PPD + DD (Euclidean) 63.9 63.9 PPD + DD (city block) 56.7 58.1 Table 5 presents the comparison of the best results obtained by the method proposed in this paper and by the method from our previous work. The new method outperforms the previously proposed algorithm by 11.6%, which confirms the usefulness of the novel descriptor DD and the novel classifier NNOR. The authors know of only two publicly available sets of static hand skeletal data for which comparative analysis using our method is possible. They are Dataset 2 and Dataset 3, described in Section 4.1. Table 6 presents the comparison of the recognition rates obtained for Dataset 2 by our method (in two configurations) and by the methods proposed in two other works. The proposed algorithm outperforms other methods by more than 9%. Dataset 3 can be considered very challenging, since its authors [24] achieved a recognition rate of only 35.1% with skeletal data. Table 7 presents the comparison of the recognition rates obtained for this dataset by our method and the method proposed in other work. The experiments were performed using Matlab R2018b software with Classification Learner toolbox on a PC with an Intel Core i5-8300H, 2.3 GHz CPU, and 16 GB RAM. The average time of feature extraction is about 5 ms. The total recognition time (feature extraction and classification) does not exceed 70 ms. Therefore, the time delay measured from gesture execution to the predicted response of the program is barely noticeable by the user.

Conclusions
Hand posture recognition is a classical task in computer vision [1,6,8]. Despite many methods, which perform robustly under some limitations, the problem is still exciting. Challenging barriers persist while creating a recognition system able to function in real-world conditions. The most important among them are occlusion of fingers related to the presence of affine transformation while projecting the 3D scene on the 2D image plane, scalability of considered gesture dictionaries, different background illumination, high computational cost, and repeatability of gesture execution by potential users. Currently, new devices working in the field of visible and near-infrared light are being developed, which will allow us to obtain accurate 3D information about the observed scene. There is a chance that the usage of these devices will eliminate some of the restrictions mentioned above. Therefore, in recent literature on hand gesture recognition, a shift to depth modality is observed. In this paper, a novel descriptor was proposed, which encodes information about distances between particular hand points. It is a scale-independent and distinctive alternative to other positional-based and distance-based features describing hand skeleton. Its features include relations between each fingertip and between fingertips and palm center. Unlike most other works, the method has been tested on a data set containing 48 classes, among which many similar shapes can be identified. It has been observed that the independence of the proposed approach from hand orientation is not always desirable and can lead to difficulties in recognizing some hand configurations. Therefore, a modified version of the nearest neighbor classifier was proposed, which can distinguish between very similar or identical postures, differing only in hand orientation. The n-fold cross-validation tests were performed on three challenging datasets. For each dataset, the leave-one-subject-out protocol was used, which usually gives the worst results, but is the most trustworthy. It shows how the method deals with different gesture performances by individual users. The experimental results were compared with our previous work as well as with other methods found in the literature. A significant improvement of results over the compared methods was observed.
Summarizing, we did not find any methods for static hand posture recognition based only on hand skeletal data, with which we can compare our methods (mainly because of publicly available datasets) and which are better than our method in terms of recognition rate.
The proposed Distance Descriptor is invariant to position and scale. It is also invariant to rotation. However, this feature is desirable only for datasets not containing classes that differ only in orientation. Using the proposed nearest neighbor classifier with orientation restriction makes the recognition method partially dependent on orientation, enabling it to distinguish between such gestures. The threshold parameter γ t has to be experimentally set based on the orientation of similar postures of the considered sign language; it should be less than the least angle between palm normals of any similar postures classes. However, it also cannot be too small, since that would make the recognition method too strongly dependent on hand orientation. It is worth noting that descriptors and features proposed in the literature are not always invariant to hand size and orientation.
The features of PPD contain the information about angular relations of skeletal joints. The information of DD features is positional and relies on distance between the joints. The results from Tables 3 and 5   The proposed recognition method is fast and does not require a specific background, lighting conditions, or any special outfit-e.g., gloves. The main reason for the weaker results of leave-one-subject-out tests is the imperfection of the sensor, which has issues with proper detection of occluding fingers. Therefore, further work may include obtaining more accurate and reliable hand skeletal data using two calibrated depth sensors. Another future study topic may be recognition of letter sequences (finger spelling), understood as quick, highly coarticulated motions. Finally, the developed PPD and DD descriptors can be adopted in a method for recognition of human actions based on whole-body skeletons (e.g., obtained from Kinect camera).
Author Contributions: Conceptualization, methodology, D.W. and T.K.; software, D.W. and T.K.; datasets, T.K.; experiments design, discussion of the results, D.W.; writing-review and editing, D.W. and T.K. All authors have read and agreed to the published version of the manuscript.
Funding: This project is financed by the Minister of Science and Higher Education of the Republic of Poland within the "Regional Initiative of Excellence" program for years 2019-2022. Project number 027/RID/2018/19, amount granted 11 999 900 PLN.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

PPD
Point Pair Descriptor DD Distance Descriptor SVM-Lin support vector machine with linear kernel function LD linear discriminant TreeBag ensemble of multiple decision trees-tree bagger 10NN-W weighted k-nearest neighbors classifier with k = 10 NNOR-CB nearest neighbor with orientation restriction and city block distance NNOR-Euc nearest neighbor with orientation restriction and Euclidean distance