Recognition of American Sign Language Gestures in a Virtual Reality Using Leap Motion

Featured Application: We describe a system that uses a Leap Motion device to recognize the gestures performed by users while immersed in a Virtual Reality (VR). The developed system can be applied for the development of the VR applications that require identiﬁcation of the user's hand gestures for control of virtual objects. Abstract: We perform gesture recognition in a Virtual Reality (VR) environment using data produced by the Leap Motion device. Leap Motion generates a virtual three-dimensional (3D) hand model by recognizing and tracking user‘s hands. From this model, the Leap Motion application programming interface (API) provides hand and ﬁnger locations in the 3D space. We present a system that is capable of learning gestures by using the data from the Leap Motion device and the Hidden Markov classiﬁcation (HMC) algorithm. We have achieved the gesture recognition accuracy (mean ± SD) is 86.1 ± 8.2% and gesture typing speed is 3.09 ± 0.53 words per minute (WPM), when recognizing the gestures of the American Sign Language (ASL).


Introduction
Hand gesture recognition is widely researched as it can be applied to different areas such as human-computer interaction [1], robotics [2], computer games [3], education [4], automatic sign-language interpretation [5], decision support for medical diagnosis of motor skills disorders [6], recognition of children with autism [7], home-based rehabilitation [8,9], virtual training [10] and virtual surgery [11]. In industry, gesture recognition can be used in areas requiring very high precision such as to control devices such as robot hands [12] or industrial equipment.
Hand gestures can be employed to control a virtual reality (VR) environment using a Leap Motion Controller. The controller tracks the movements of a VR environment operator's hands and fingers moving over the Leap Motion device in a specific sequence. Then, an operation corresponding to the recognized gesture is executed on the system to which the Leap Motion device is connected. The operating principle of Leap Motion is similar to a computer mouse or a touch screen but its operation is based on video recognition. Using two infrared (IR) cameras, this device can recognize human hands and allows the user to explore the virtual world and interact with the elements of this world. Although the Leap Motion device is capable of recognizing human hands, it cannot directly recognize the gestures displayed by users. It is able to model the human hand and present its data in a three-dimensional space. There are software libraries that have features capable of recognizing some gestures such as grip but the VR environment requires recognition of many different gestures.
Leap Motion has been used before for the recognition of the Arabic [13,14], Indian [15,16], Turkish [17], Greek [18], Thai [19], Indonesian [20] and American [21][22][23][24] sign languages. Ameur et al. [25] used a Support Vector Machine (SVM) trained on spatial feature descriptors representing the coordinates of fingertips and the palm centre to recognize the gestures with an accuracy rate of about 81%, while Chuan et al. [21] achieved an accuracy of 79.83% using SVM trained on average distance between fingertips, spread distance between adjacent fingertips and tri-spread area between two adjacent fingertips. Fok et al. [26] achieved an average recognition rate of 93.14% using data fusion of two Leap Motion sensors and the Hidden Markov Model (HMM) classifier trained on orientation and distance ratio features (relative orientations of distal phalanges to the orientation of the palm; the ratio of the distance between a fingertip to the palm to the sum of distances between finger tips and the palm; the ratio of the distance between finger tips to the total distance among finger tips). Hisham and Hamouda [27] achieved accuracy 97.4% and 96.4% respectively on Arabic signs while using palm and bone feature sets and Dynamic Time Wrapping (DTW) for dynamic gesture recognition. Lu et al. [28] used the Hidden Conditional Neural Field (HCNF) classifier to recognize dynamic hand gestures, achieving 89.5% accuracy on two dynamic hand gesture datasets. Avola et al. [24] trained Recurrent Neural Network (RNN) on features defined as angles formed by the finger bones of the human hands, achieving over 96% of accuracy using the American Sign Language (ASL) dataset.
In VR applications, Leap Motion has been used in educational context to learn the laws of classical mechanics demonstrated by the application of physical forces on bodies and how the forces influence the motion of bodies [29], while McCallum and Boletsis [30] used Leap Motion for gesture recognition in an Augmented Reality (AR) based game for elderly. Sourial et al. [31] used Leap Motion in a virtual therapist system, which focused on helping the patient do perform physical exercises at home in an gamified environment and to provide guidance on exercising, observe the condition of a patient, adjust movement errors and evaluate the exercising achievements of the patient. Valentini and Pezutti [32] evaluated the accuracy of Leap Motion for the use in interactive VR applications such as virtual object manipulation and virtual prototyping. Pathak et al. [33] used the Leap Motion device to interplay with three-dimensional (3D) holograms by recognizing hand, finger movements and gestures. Jimenez and Schulze [34] used the Oculus Rift VR device with a Leap Motion controller for a continuous-motion text input in VR. Komiya and Nakajima [35] used Leap Motion for implementing text input in Japanese while reaching an average input speed of 43.15 CPM (characters per minute) for input of short words. Finally, Jiang et al. [36] used the fusion of signals captured using force myography (FMG) recognizing a muscular activity during hand gestures and the Leap Motion data, for grasping of virtual objects in VR. The use of moving hands for controlling the virtual space in VR games and applications has been confirmed as being important in making VR environments realistic and immersive [37]. This paper presents a system that uses a Leap Motion device to record the positions of user hands as they perform gestures and uses these data to recognize the corresponding gestures in a VR environment. The system uses the HMM classification [38], which is used to recognize gesture sequences in an unsupervised way. The system can be applied for the development of VR projects that require identification of the user's hand gestures for control of virtual objects in VR environments.

The Leap Motion Device and Gesture Recognition
Leap Motion has two monochrome IR cameras and three IR light emitting diodes (LEDs). These LEDs generate a 3D dot model, which is registered with the monochrome cameras. From two 2D images obtained with monochrome IR cameras Leap Motion generates a spatial pattern of user's hands. Unlike Microsoft's Kinect that treats a complete human skeleton, Leap Motion follows only the hands of users and can predict the position of the fingers, palm of your hand or wrists in case these are occluded. Leap Motion can handle a distance of 25 to 600 mm, 150 degrees wide, allowing the user to perform gestures freely in space.
Using the IR cameras, one can set the coordinates of each hand point. In order to recognize the hand gestures, it is necessary to process a large amount of data by determining the parts of the forearm, wrist, hand and fingers. The Leap Motion software receives a 3D spatial skeleton from the 3D image, analyses it and aggregates into certain objects that hold the corresponding hand part information. The Leap Motion controller has three main hand objects: full arm, palm and fingers. A full-arm object provides information about the position of a hand in space, its length and width. The hand object holds information about the hand (left or right) position and the finger list of that hand. The most important part of Leap Motion required for gesture handling is the fingertip object. This item holds the basic bone data for each person's fingertip.
Although the Leap Motion device is capable of recognizing human hands, it cannot directly recognize the gestures displayed by users. It is only able to simulate the spatial model of a human hand but this device does not have the functionality which, based on this data, could tell when a user shows a single-pointed finger gesture. The Leap Motion device presents a 3D spatial model of a human hand (see Figure 1). With this model, one can get coordinates, turning angles for each hand, their bones or palm centre and other necessary information. If this device is always located in the same position in front of the user and the user will display the same gesture, the device will provide almost the same data with only a small error. Leap Motion has two monochrome IR cameras and three IR light emitting diodes (LEDs). These LEDs generate a 3D dot model, which is registered with the monochrome cameras. From two 2D images obtained with monochrome IR cameras Leap Motion generates a spatial pattern of user's hands. Unlike Microsoft's Kinect that treats a complete human skeleton, Leap Motion follows only the hands of users and can predict the position of the fingers, palm of your hand or wrists in case these are occluded. Leap Motion can handle a distance of 25 to 600 mm, 150 degrees wide, allowing the user to perform gestures freely in space.
Using the IR cameras, one can set the coordinates of each hand point. In order to recognize the hand gestures, it is necessary to process a large amount of data by determining the parts of the forearm, wrist, hand and fingers. The Leap Motion software receives a 3D spatial skeleton from the 3D image, analyses it and aggregates into certain objects that hold the corresponding hand part information. The Leap Motion controller has three main hand objects: full arm, palm and fingers. A full-arm object provides information about the position of a hand in space, its length and width. The hand object holds information about the hand (left or right) position and the finger list of that hand. The most important part of Leap Motion required for gesture handling is the fingertip object. This item holds the basic bone data for each person's fingertip.
Although the Leap Motion device is capable of recognizing human hands, it cannot directly recognize the gestures displayed by users. It is only able to simulate the spatial model of a human hand but this device does not have the functionality which, based on this data, could tell when a user shows a single-pointed finger gesture. The Leap Motion device presents a 3D spatial model of a human hand (see Figure 1). With this model, one can get coordinates, turning angles for each hand, their bones or palm centre and other necessary information. If this device is always located in the same position in front of the user and the user will display the same gesture, the device will provide almost the same data with only a small error. This means that each hand gesture that does not carry any movement that can be seen both in the forearm and in the position of each finger in the space. If we had an indication of a motion gesture and we recorded the data from Leap Motion to the database, then we could use it as a pattern for recognizing that gesture. Only three Degrees of Freedom (DoF) are required for the recognition of static gestures: deviation, inclination and pitching. If the angles between the forearm and fingers are similar, the displayed static gesture is recognized. The recognition of dynamic gestures that leads to a certain movement is similar. The gesture database contains spatial data typical for each gesture. When a user is showing a dynamic gesture, the algorithm checks how the This means that each hand gesture that does not carry any movement that can be seen both in the forearm and in the position of each finger in the space. If we had an indication of a motion gesture and we recorded the data from Leap Motion to the database, then we could use it as a pattern for recognizing that gesture. Only three Degrees of Freedom (DoF) are required for the recognition of static gestures: deviation, inclination and pitching. If the angles between the forearm and fingers are similar, the displayed static gesture is recognized. The recognition of dynamic gestures that leads to a certain movement is similar. The gesture database contains spatial data typical for each gesture. When a user is showing a dynamic gesture, the algorithm checks how the spatial data for each image frame varies. If the change is similar to the data in the database, the dynamic gesture is recognized.

Network Service Gesture Identification System
The amount of data needed to recognize the gesture can grow exponentially depending upon the number of gestures. For ten or more gestures, algorithms of the gesture recognition system require a considerable amount of time, on average about half a minute. However, the alphabet of ASL has thirty two different gestures (signs). As it requires large computation resources for online retraining, the system was implemented using a cloud-based network service. For implementation, we have employed microservices-a case of the service-oriented architecture (SOA) that defines an application as a stack of loosely coupled services [39].
The connection of the gesture recognition system to the network service ( Figure 2), has made it possible to easily allocate the resources necessary for the system between several computers. All gesture identification data are stored in the remote gesture database. By launching several services which send requests to the same database, it has been possible to manage algorithmic training. At the same time, it is possible to record new gestures, carry out research and teach algorithms to recognize new gestures.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 16 spatial data for each image frame varies. If the change is similar to the data in the database, the dynamic gesture is recognized.

Network Service Gesture Identification System
The amount of data needed to recognize the gesture can grow exponentially depending upon the number of gestures. For ten or more gestures, algorithms of the gesture recognition system require a considerable amount of time, on average about half a minute. However, the alphabet of ASL has thirty two different gestures (signs). As it requires large computation resources for online retraining, the system was implemented using a cloud-based network service. For implementation, we have employed microservices-a case of the service-oriented architecture (SOA) that defines an application as a stack of loosely coupled services [39].
The connection of the gesture recognition system to the network service ( Figure 2), has made it possible to easily allocate the resources necessary for the system between several computers. All gesture identification data are stored in the remote gesture database. By launching several services which send requests to the same database, it has been possible to manage algorithmic training. At the same time, it is possible to record new gestures, carry out research and teach algorithms to recognize new gestures. The network service provides an easy access to the gesture recognition system from different environments. The Leap Motion device can be used in games created using Unity or Unreal Engine gaming engines and easily integrated into any Windows application or web page. The gesture recognition system launched as a network service allows very many different systems to communicate with it. For the time being, due to the easier installation, the Simple Object Access Protocol (SOAP) was used but the system can easily be expanded to accept the Representational State Transfer (REST) data requests. This functionality would allow the gesture recognition system to be accessed from any environment. Data recorded with the Leap Motion device is stored in the Microsoft Structured Query Language (MS SQL) database, which allows for creation of personalized gesture collections as in Reference [40].

Gesture Identification
Leap Motion continuously displays user hand frames. The problem arises when we want to filter out the sequence of these shots when the gesture is started. If we send all the data to the gesture recognition system, the system will recognize these gestures poorly. This phenomenon is due to the fact that certain gestures may consist of several other types of gestures. This is often seen with a motion gesture. The motion gesture consists of a large number of frames, among which there are gestures without motion. To solve this problem, the states of the system are defined as follows: • Start. The system is waiting for the user to start moving. If the hand started moving, the transition to the Waiting state will not start. The network service provides an easy access to the gesture recognition system from different environments. The Leap Motion device can be used in games created using Unity or Unreal Engine gaming engines and easily integrated into any Windows application or web page. The gesture recognition system launched as a network service allows very many different systems to communicate with it. For the time being, due to the easier installation, the Simple Object Access Protocol (SOAP) was used but the system can easily be expanded to accept the Representational State Transfer (REST) data requests. This functionality would allow the gesture recognition system to be accessed from any environment. Data recorded with the Leap Motion device is stored in the Microsoft Structured Query Language (MS SQL) database, which allows for creation of personalized gesture collections as in Reference [40].

Gesture Identification
Leap Motion continuously displays user hand frames. The problem arises when we want to filter out the sequence of these shots when the gesture is started. If we send all the data to the gesture recognition system, the system will recognize these gestures poorly. This phenomenon is due to the fact that certain gestures may consist of several other types of gestures. This is often seen with a motion gesture. The motion gesture consists of a large number of frames, among which there are gestures without motion. To solve this problem, the states of the system are defined as follows:  • Waiting until the state changes. If the system does not see the hand, the system returns to the start state. If the user does not move the hand, the system goes to the Stationary gesture lock state.
• Stationary gesture lock state. The user does not move his hand for two seconds and the gesture is fixed. Recorded hand model data is saved and converted to the gesture recognition state. If the user moves the hand in two seconds, the system's state changes to a Motion detection state.
• Motion detection state. If the device can not follow the user's hand, the recorded hand model data is saved and the system's state is changed to the Gesture recognition state.
• Gesture recognition state. Data captured in this state is sent to the gesture recognition subsystem. When the subsystem returns results, they are presented to the user and the system goes to the Data clearing state.
• Data clearing state. Clears unnecessary data and goes to the Start state.

Feature Extraction and Pre-Processing
The  The general approach for feature extraction presented here is shown in Figure 4. We extract 4 types of hand features, that is, the 3D positions of the, fingertip distances from the hand centroid, elevations of fingertips above the plane of the palm, angles between the fingertips-to-palm-centre vectors. The fingertip angles (adopted from [41]) are angles representing the fingertip orientation projected on the palm.
The Leap Motion Controller includes the 3D positions of 11 finger joints. For each gesture we calculate the Euclidian distances between the seven main hand vertices, representing the tip positions of thumb, index, middle, ring and pinky fingers, palm position and wrist position. In all there are 21 distances between the 7 vertices. Additionally, the angular features were generated, representing the angles between any of three different vertices, representing another 35 features. In total, 56 features (21 distance and 35 angular) are extracted. To make all features uniform, the z-score The general approach for feature extraction presented here is shown in Figure 4. We extract 4 types of hand features, that is, the 3D positions of the, fingertip distances from the hand centroid, elevations of fingertips above the plane of the palm, angles between the fingertips-to-palm-centre vectors. The fingertip angles (adopted from [41]) are angles representing the fingertip orientation projected on the palm.
The Leap Motion Controller includes the 3D positions of 11 finger joints. For each gesture we calculate the Euclidian distances between the seven main hand vertices, representing the tip positions of thumb, index, middle, ring and pinky fingers, palm position and wrist position. In all there are 21 distances between the 7 vertices. Additionally, the angular features were generated, representing the angles between any of three different vertices, representing another 35 features. In total, 56 features (21 distance and 35 angular) are extracted. To make all features uniform, the z-score based normalization is applied, which normalizes the data by subtracting the mean and dividing it by standard deviation.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 16 based normalization is applied, which normalizes the data by subtracting the mean and dividing it by standard deviation. Following [42], we describe the kinematic model of the hand movement as follows: cos cos sin sin cos cos sin cos sin cos sin sin cos sin sin sin sin cos cos cos sin sin sin cos sin sin cos cos cos here x, y, z are the 3D coordinate components, u, v, w are the velocity components, θ is the roll angle , ψ is the pitch angle and ϕ is the yaw angle of the hand.
The total velocity V of the hand's centroid is calculated as: The angles are calculated as follows: ( ) arccos cos cos γ = α φ (5) here α is the angle of attack, β is the angle of sideslip and γ is the angle of total attack.
The fingertip distances represent the distance of fingertips from the centre of the palm and are defined as:  Following [42], we describe the kinematic model of the hand movement as follows: cos θ cos ψ sin ϕ sin θ cos ψ − cos ϕ sin ψ cos ϕ sin θ cos ψ + sin ϕ sin ψ cos θ sin ψ sin ϕ sin θ sin ψ + cos ϕ cos ψ cos ϕ sin θ sin ψ − sin ϕ cos ψ − sin θ sin ϕ cos θ cos ϕ cos θ here X, Y, Z are the 3D coordinate components, u, v, w are the velocity components, θ is the roll angle, ψ is the pitch angle and ϕ is the yaw angle of the hand. The total velocity V of the hand's centroid is calculated as: The angles are calculated as follows: here α is the angle of attack, β is the angle of sideslip and γ is the angle of total attack. The fingertip distances represent the distance of fingertips from the centre of the palm and are defined as: here F i are the 3D positions of each fingertip; and C is the 3D position associated with the centre of the hand palm in the 3D frame of reference.

Markov Classification
A Hidden Markov Model (HMM) is formally defined as 5-tuple representing a given process with a set of states and transition probabilities between the states [43]: here N indicates the number of unique possible states which are not directly observable except through a sequence of distinct observable symbols M, also called emissions; β represents the discrete/continuous probabilities for these emissions, A indicates the probability of state transition and π are the starting probabilities. The state q t 1 = {q 1 , · · · · · · , q t } of the Markov chain is implicitly defined by a sequence y t 1 = {y 1 , · · · · · · , y t } of the observed data. Given the observation sequence y t 1 = {y 1 , · · · · · · , y t }, where y t i represents the feature vector observed at time i and a separate HMM for each gesture, then the sign language recognition problem can simply be solved by computing: here i corresponds to the i-th gesture. The probability of the observed sequence P y T 1 is found using the joint probability of the observed sequence and the state sequence P y T 1 , q T 1 as follows: here P(q 1 ) is the initial state probability distribution of q at time 1, P(q t |q t−1 ) is the probability of q at time t given q at time t + 1, P(y t |q t ) is the emission probability. We calculate the probability P y t 1 , q t of an observed partial sequence y t 1 for a given state q t using the forward-backward algorithm as a conditional probability in the product form: P y t 1 , q t = P y t y t−1 1 , q t P y t−1 1 , q t (10) Given that P y t 1 , q t , q t−1 = P(q t |q t−1 )P y t−1 1 , q t−1 (11) and P y t−1 we get the following equation: Appl. Sci. 2019, 9, 445 8 of 16 We define α q (t) = P y t 1 , q then the above equation is written as: here Q t is the state space at time t. We calculate the partial probability from time t + 1 to the end of the sequence, given q t as: here β q (t) = P y T t+1 |Q t = q and Q t is the state at time t. Then the probability of the observed sequence P y T 1 is calculated as: The most likely state sequence q T 1 corresponding to a given observation sequence y T 1 is defined by the probability, P q t y T 1 , which is the product of forward-backward variables and normalized by the joint distribution of the observation sequences as the follows: The most likely state is measured by maximizing P q t y T 1 for q t . After the sequence of observations is calculated, the HMM training is performed by applying the Baum-Welch algorithm [44] to calculate the values of the transition matrix A and the emission matrix B. Following the HMM training, the gesture Q with best likelihood corresponding to the feature vector sequence P y T i , q T i is found using the Viterbi algorithm [44].

Settings and Data
Participants in the pilot study were twelve (12) people, aged 20 to 48 (mean: 27.2), with a differing range of experience in the use of computer equipment. All subjects were healthy (with no known problems of vision or VR sickness) and inexperienced users of ASL, therefore, the subjects were given about 1 h of prior training to learn the signs of ASL as well as to familiarize with the developed system and the VR device used. For experiments, we used a conventional desktop computer with Microsoft Windows 10 and the Leap Motion device placed on the table at a normal room lighting conditions. Before the study, the participants were asked to remove the rings, watches, because they could affect the results. The output of the Leap Motion controller representing a three dimensional spatial model of subjects' hands was displayed to subjects using the Oculus Rift DK2 device.
The participants were asked to perform 24 gestures analogous to the letters of the ASL [45] (see Figure 5). The gesture of each letter was performed ten times resulting in a total of 2880 data samples. We recorded the gestures of the participants' hands in the Leap Motion environment and took pictures of real gestures shown by hand. Subsequently, the data from this study were analysed.
To evaluate the accuracy of the results, we divide the collected dataset in a train and a test sets by using the LOPO (leave-one-person-out) subject-independent cross-validation strategy. The results are averaged to obtain the resulting accuracy.
Processing and visualization of results presented in this paper was done using MATLAB R2013a (The Mathworks, Inc., Natick, MA, USA).
To evaluate the accuracy of the results, we divide the collected dataset in a train and a test sets by using the LOPO (leave-one-person-out) subject-independent cross-validation strategy. The results are averaged to obtain the resulting accuracy.
Processing and visualization of results presented in this paper was done using MATLAB R2013a (The Mathworks, Inc., Natick, MA, USA). For the text input experiments using the ASL, we used 18 pangrams (i.e., sentences using every letter of an alphabet at least once) as follows: A quick movement of the enemy will jeopardize six gunboats. The pangrams have been used for typing using gesture recognition by Leap Motion before [46].

Results
An example of gestures and their representation by Leap motion is shown in Figure 6. For the text input experiments using the ASL, we used 18 pangrams (i.e., sentences using every letter of an alphabet at least once) as follows: A quick movement of the enemy will jeopardize six gunboats. All questions asked by five watched experts amaze the judge. Amazingly few discotheques provide jukeboxes. Back in June we delivered oxygen equipment of the same size. The wizard quickly jinxed the gnomes before they vaporized. Woven silk pyjamas exchanged for blue quartz.
The pangrams have been used for typing using gesture recognition by Leap Motion before [46].

Results
An example of gestures and their representation by Leap motion is shown in Figure 6. The experiments were implemented using stratified 10-fold cross validation and assessed using the macro-accuracy (averaged over classes and folds) performance measure [47]. The results of the gesture recognition are presented in Figure 7. The averaged recognition accuracy (mean ± SD) achieved is 86.1 ± 8.2%. The confusion plot of classification is presented in Figure 8. We have obtained a true positive rate (TPR) of 0.854, an F-measure of 0.854 and a Cohen's kappa [48] value of 0.987. The experiments were implemented using stratified 10-fold cross validation and assessed using the macro-accuracy (averaged over classes and folds) performance measure [47]. The results of the gesture recognition are presented in Figure 7. The averaged recognition accuracy (mean ± SD) achieved is 86.1 ± 8.2%. The experiments were implemented using stratified 10-fold cross validation and assessed using the macro-accuracy (averaged over classes and folds) performance measure [47]. The results of the gesture recognition are presented in Figure 7. The averaged recognition accuracy (mean ± SD) achieved is 86.1 ± 8.2%. The confusion plot of classification is presented in Figure 8. We have obtained a true positive rate (TPR) of 0.854, an F-measure of 0.854 and a Cohen's kappa [48] value of 0.987. The confusion plot of classification is presented in Figure 8. We have obtained a true positive rate (TPR) of 0.854, an F-measure of 0.854 and a Cohen's kappa [48] value of 0.987. In the typing experiment, during the training pre-stage the subjects learned how to use the research system consisting of a software application and the Leap Motion Controller. Then their task was to type three times each of the pangrams. The pangrams were presented in a random order. In case of error, the subjects were instructed to ignore errors and keep typing the phrases.
We used the words per minute (WPM) as a performance measure and the minimum string distance (MSD) as an error rate as suggested in Reference [49]. The obtained results are presented in Figure 9 and Figure 10 and are summarized as follows (mean ± SD): 3.09 ± 0.53 WPM and 16.58 ± 5.52 MSD.  In the typing experiment, during the training pre-stage the subjects learned how to use the research system consisting of a software application and the Leap Motion Controller. Then their task was to type three times each of the pangrams. The pangrams were presented in a random order. In case of error, the subjects were instructed to ignore errors and keep typing the phrases.
We used the words per minute (WPM) as a performance measure and the minimum string distance (MSD) as an error rate as suggested in Reference [49]. The obtained results are presented in In the typing experiment, during the training pre-stage the subjects learned how to use the research system consisting of a software application and the Leap Motion Controller. Then their task was to type three times each of the pangrams. The pangrams were presented in a random order. In case of error, the subjects were instructed to ignore errors and keep typing the phrases.
We used the words per minute (WPM) as a performance measure and the minimum string distance (MSD) as an error rate as suggested in Reference [49]. The obtained results are presented in Figure 9 and Figure 10 and are summarized as follows (mean ± SD): 3.09 ± 0.53 WPM and 16.58 ± 5.52 MSD.  We performed the linear regression analysis of relationship between gesture typing speed and error rate and found the following linear relationship, which is within 95% of confidence: 5.2 * 32.7 = − + msd wpm (18) here msd is the minimum string distance and wpm is the words per minute.
The results, demonstrated in Figure 11, show that more proficient users demonstrate both higher performance and lower error rate and vice versa. Figure 11. Linear regression analysis of relationship between gesture typing performance and error rate with linear trend and 95% confidence ellipse shown. We performed the linear regression analysis of relationship between gesture typing speed and error rate and found the following linear relationship, which is within 95% of confidence: msd = −5.2 * wpm + 32.7 (18) here msd is the minimum string distance and wpm is the words per minute. The results, demonstrated in Figure 11, show that more proficient users demonstrate both higher performance and lower error rate and vice versa. We performed the linear regression analysis of relationship between gesture typing speed and error rate and found the following linear relationship, which is within 95% of confidence: 5.2 * 32.7 = − + msd wpm (18) here msd is the minimum string distance and wpm is the words per minute.

Evaluation
The results, demonstrated in Figure 11, show that more proficient users demonstrate both higher performance and lower error rate and vice versa. Figure 11. Linear regression analysis of relationship between gesture typing performance and error rate with linear trend and 95% confidence ellipse shown. Figure 11. Linear regression analysis of relationship between gesture typing performance and error rate with linear trend and 95% confidence ellipse shown.

Evaluation
We have achieved 86.1% accuracy of recognition of ASL signs. These results are in the range of accuracy achieved by other authors as indicated by a survey in Reference [50]. Note that the subjects participating our study were not experienced users of sign languages, therefore, the quality of sign gesturing could have adversely affected the accuracy of recognition. Other authors used stand-alone letters or words in ASL for training, whereas we used complete sentences (pangrams), which has been a more difficult task for subjects. Moreover, the view of gestures was presented to subjects as 3D models of hands using a head-mounted Oculus Rift DK display, so the subjects were not able to view their physical hands during the experiment and this could have made the gesturing task more difficult as well.
After analysing the recorded gesture recognition data, we have observed that there are problems in detecting the gap between the fingers. Small (one centimetre or less) gap is poorly understood and, for example, it is difficult to discriminate between the gestures of C and O signs. Gestures that require a precise thumb position are also more difficult to determine. Thumbs are often covered with other fingers, which decreases the accuracy of the recognition of the E, M, N, T, H, K, S, V and X signs. The recognition of some gestures requires a very precise 3D image of the hand. This is evident in the gesture of the P sign, when the fingers are only partially folded (not scrambled in the fist) but the device from the palm of the hand has identified the fingers as completely curved. This problem also occurred with the letter R gesture, that is, the fingers have to be crossed but they are presented as a concave and such a gesture corresponds to the letter U in the sign language. In some cases, partially folded fingers are treated as completely folded. Our study revealed the gaps in the algorithm of the Leap Motion device used for gesture analysis. Problems occur when the Leap Motion device does not see some fingers. Then the fingertip positions cannot be captured and the gestures are identified incorrectly.
The gesture recognition was implemented as a microservice over the Internet. The sending of data from the Leap Motion device over the network to the microservice does not significantly increase the duration of the gesture recognition. On average, the size of Leap Motion gesture data batch ranges from 500 to 1500 bytes. Transmission of this amount to a network service does not require a lot of resources or speed. The greatest slowdown occurs at the network service itself by filtering these data and performing the gesture recognition functions. The entire process takes no more than 200 ms.

Conclusions
Gesture recognition can be applied to various areas that are not suitable for typical data entry such as VR environments. In this paper, we have presented a system that can learn gestures by using the data from the Leap Motion device and the Hidden Markov classification (HMC) algorithm. We have achieved the gesture recognition accuracy (mean ± SD) is 86.1 ± 8.2% and the gesture typing speed of 3.09 ± 0.53 words per minute, when recognizing the gestures of the American Sign Language (ASL).
We have identified several problems of using the Leap Motion technology for gesture recognition. First of all, if the user's own fingers are invisible to IR cameras, Leap Motion makes mistakes in predicting their position, for example, the hand is depicted with folded fingers when they are stretched. Similarly, the position of the thumb when it is pressed against the palm or between the other fingers is poorly defined and cannot be reliably used to identify the gesture.