Fingertip Gestures Recognition Using Leap Motion and Camera for Interaction with Virtual Environment

: The emergence in computing and the latest hardware technologies realized the use of natural interaction with computers. Gesture-based interaction is one of the prominent ﬁelds of natural interactions. The recognition and application of hand gestures in virtual environments (VEs) need extensive calculations due to the complexities involved, which directly affect the performance and realism of interaction. In this paper, we propose a new interaction technique that uses single ﬁngertip-based gestures for interaction with VEs. The objective of the study is to minimize the computational cost, increase performance, and improve usability. The interaction involves navigation, selection, translation, and release of objects. For this purpose, we propose a low-cost camera-based system that uses a colored ﬁngertip for the fastest and accurate recognition of gestures. We also implemented the proposed interaction technique using the Leap Motion controller. We present a comparative analysis of the proposed system with the Leap Motion controller for gesture recognition and operation. A VE was developed for experimental purposes. Moreover, we conducted a comprehensive analysis of two different recognition setups including video camera and the Leap Motion sensor. The key parameters for analysis were task accuracy, interaction volume, update rate, and spatial distortion of accuracy. We used the Standard Usability Scale (SUS) for system usability analysis. The experiments revealed that camera implementation was found with good performance, less spatial distortion of accuracy, and large interaction volume as compared to the Leap Motion sensor. We also found the proposed interaction technique highly usable in terms of user satisfaction, user-friendliness, learning, and consistency.


Introduction
Task performance directly depends on the effectiveness and accuracy of the user interaction type (technique and/or device) which is an essential part of any realistic VE. Human-computer interaction

Related Work
Interaction in a simple fashion assures naturalism and affordance [43]. Interaction using a keyboard, mouse, and touch screen cannot attain user engagement due to the unrealistic 2D nature [44]. Gesture-based interaction is more suitable to virtual reality (VR) applications due to its flexibility and naturalism [45,46]. In VR, different gesture-based interaction techniques have been proposed so far. Kiyoshi et al. [47] used magnetic tracking-based two hand interaction system and performed object selection and release operations. Lee et al. [48], used CyberGlove and Polhemus Fastrak for some basic interaction tasks. Kim et al. [49] also used CyberGlove for navigation purposes.
A marker-based glove was used by Chen et al. [50] for navigation in a VE while fiducial markers are used for interaction in VE by Rehman et al. [51][52][53][54].
Shao [55] used Leap Motion controller for two-hand gesture-based interaction system and performed rotation and navigation tasks. Kerefeyn et al. [56] also used Leap Motion-based gestures for interaction in VEs but these gestures were hard to learn. Khundam et al. [15] proposed different palm-based gestures for navigation in VEs using the Leap Motion sensor.
Raees and Ullah [57] used the fingertips-based interaction system for VEs using an RGB camera. They used colored tips made gestures and performed different interaction tasks. These gestures were complex and hard to learn. Rehman et al. [58] also used colored fingertips-based gestures for navigation in VE. The relative distance between two fingertips was used for controlling the navigation speed of the virtual object.
The previous research mostly proposed gestures with complex nature that needs more computational cost in recognition, as well as being hard to learn and pose, which increases the cognitive load on users. To cope with these problems, the research must focus on careful selection of gestures that are easy in recognition for machine and learning and operation for users. In this paper, we proposed an interaction technique that uses a color fingertip-based simple gestures with the aim of ease of recognition, learning, and operation. This study aims to propose an ordinary camera-based system for recognition and interaction of the proposed gestures and then compare it with a commercially available Leap Motion controller using different parameters such as task accuracy, spatial distortion, interaction volume, and update rate.

Proposed System
The main objective of the proposed system is the use of a single index tip's position and state (up/down)-based gestures for interaction in VE. The spatial movement of the index finger with an open state is used for free navigation in the VE. The index finger with a closed state is used for the selection and release of the object. Movement of the finger is used for free navigation and translation of the selected object. The VE (see Figure 1 ) is designed in OpenGL while interaction with VEs is made via two different systems, including the Leap Motion sensor and RGB camera (see Figure 2).  OpenGL (as a front-end tool) receives the position of fingertip's pose and state. Based on this position and state, different gestures are identified which lead to 3D interaction in the VE. The VE consists of different 3D objects, as shown in Figure 1. The identification of different gestures leads the virtual object/camera to navigate accordingly.

Camera-Based System
The camera-based system uses an ordinary laptop camera and image processing for the detection of colored hand's fingers. The system recognizes different gestures from the relative positions of the fingers and interacts accordingly in the VE. The proposed system uses OpenCV as a backend and OpenGL as a frontend tool. OpenCV is used for image processing that consists of different phases such as image acquisition, conversion of image to HSV (Hue, Saturation, Value), thresholding for the green color, and finally the calculation of 2D position (x, y) of the specified color as shown in Figure 3. OpenGL is used for designing and interaction of the VE. OpenGL, based on the position and detected area of color (Green), identifies various gestures that lead to navigation and selection tasks in the VE. OpenCV performs image acquisition via ordinary webcam. The finger cap of green color is used for the index finger of the right hand. Skin color is omitted to achieve the best results as it varies from person to person. A range of Hue, Saturation and Values are selected for green color to detect the finger in stable lighting conditions. First, the region of interest (RI_img) is extracted from the Frame Image (F_img) to avoid false detection of the background green color. The IR_img is then segmented from the F_img based on the skin area which is the most probable area to get the fingers. The YCbCr model can distinguish between the skin and non-skin colors [59].
After getting the binary image of F_Img (see Figure 4a), RI_Img with rows m and columns n is extracted from F_Img using algorithm [60] as: where Lmr, Rmr, and Dmr represent Left-most, Righ-most, and Down-most skin pixels of the hand. The segmented image RI_Img is then thresholded for green color using HSV color space (see Figure 5c,d).
Finally, the 2D position (x, y) of the fingertip is calculated dynamically from the image. The system calculates the z-axis from the variation of detected area of the fingertip. The detected area of the fingertip increases with inward movement toward the camera eye while it decreases with outward movement. This variation is mapped to the z-axis, which changes its value accordingly with movement of the fingertip.

a.
b.

Interaction Mechanism
Interaction with the proposed system can be done using camera-based gesture recognition. We used a colored index-tip to perform different interactions such as free navigation, selection of an object, translation of object, and release of an object.

Free Navigation
Navigation is carried out before any other task. The spatial movement of a colored index-tip (open) pointing towards the camera eye is used for navigation in VEs (see Figure 6a). The system calculates the fingertip pose (x, y-axis) and area (in pixels) from the recognized colored region (RA) as described earlier. The z-axis is measured from the variation of pixels due to the movement of color-tip near and far. A virtual hand (VH) follows the movement of the finger.
The gesture with an open index-tip (RH Index.Tip.Open ), i.e., when the detected area (DA) is less than the threshold (T1) leads to the assignment of fingertip pose (RH IndexTip(x,y,z) ) to virtual hand (V H (x,y,z) ) in the VE. VH follows the spatial movement of fingertip. b.

Object Selection
The selection of an object is done for further manipulation. In the proposed system, the object (Obj (x,y,z) ) is selected via intersection of VH (V H (x,y,z) ) along with index-tip (closed) gesture, as shown in Figure 6b.

Object Translation
The translation is the change of location. After selection, the object can be translated (TR(Obj (x,y,z) ) anywhere using spatial movement of the index-tip TR(RH IndexTip(x,y,z) ). We can write as: TR(Obj (x,y,z) ) = TR(RH Index.Tip(x,y,z) )

Release of Object
After manipulation, the object can be released at a specific position via an index-tip (closed) gesture (RH Index.Tip.Close ). After releasing the object, the VH freely moves with the movement of the index-tip.

Leap Motion Controller
Leap Motion controller is an innovative hand gesture recognition device which gives accuracy in the sub-millimeter range [61]. The Leap Motion sensor can be considered as an optical tracking system based on stereo vision (see Figure 7).

Interaction Mechanism
The proposed system allows users to perform different interaction tasks such as free navigation, selection, and release of objects and translation of objects using simple gestures in the VE.

Free Navigation
Free navigation is very important for the exploration and searching of objects in the VE. The open index fingertip (RH Index.Tip.U p ) gesture is used for free navigation in the VE (see Figure 8a). A virtual hand (V H (x,y,z) ) in the VE follows the real-world movement of the index fingertip (RH Index.Tip(x,y,z) ). We can describe it as: where (RH Index.Tip.U p ) represents right hand open index fingertip while RH Index.Tip(x,y,z) and V H (x,y,z) represent 3D positions of index fingertip and virtual hand.

Selection of Object
The selection of an object is the process of selecting an object for further manipulation. In the proposed system, object (Obj (x,y,z) ) selection is done using the intersection of VH (V H (x,y,z) ), which follows the movement of a finger's tip (RH Index.Tip(x,y,z) ) and bending that finger, as shown in Figure 8b. We can describe it mathematically as: The RH Index.Tip.Down represents the right hand index fingertip with bend (closed) state and k (x,y,z) is the distance between centers of virtual hand (V H (x,y,z) ) and object (Obj (x,y,z) ).

Translation of Object
The translation is the change of position of the object. It is one of the most important tasks in any VE. In the proposed system, translation of object (TR(Obj (x,y,z) ) is done using physical movement of the index fingertip (RH IndexTip(x,y,z) ).

Release of Object
After manipulation, the object is released just by bending (closing) the finger again (see Figure 8b). After object release, the VH (V H (x,y,z) ) freely navigates in the VE following movement of finger (RH Index.Tip(x,y,z) ). Mathematically, we can write as: if (Abs(Obj (x,y,z) = RH Index.Tip(x,y,z) ) AND RH Index.Tip.Down ) Obj (x,y,z) = RH Index.Tip(x,y,z) V H (x,y,z) = RH Index.Tip(x,y,z)

Experiments and Evaluation
For the evaluation of the proposed interaction technique, we conducted two experimental studies. First, we analyzed the proposed technique using Leap motion controller and the camera-based system in terms of different parameters such as task accuracy (via task performance and errors), interaction volume, update rate, and accuracy of spatial distortion. Second, for subjective assessment, we use the SUS [62] scale to evaluate the usability of the proposed technique based on learning, user satisfaction, and friendliness and consistency.

Experimental Setup
For the experimental study, we used Visual Studio 2013 with core-i3 laptop having 1.7 GHz processor, 4 GB RAM, and 1280 × 720 pixels resolution (720p HD). We used the low-cost built-in camera for the camera-based system. The Leap Motion was an alternative implementation.
We designed a 3D VE for experiments The VE consists of different 3D objects such as a teapot, tables, and virtual hand (as shown in Figure 2). The system allows users to freely navigate the virtual hand in the VE along with relative virtual camera movement. The system also allow users to select, translate, and release virtual objects using his/her finger. Visibility of the fingers (with both camera-based and Leap Motion) displays the text "DETECTED" to inform users about system activation. A text along with a short beep indicates activation of the interaction.

Protocol and Task
For experimental purposes, forty (40) users participated in the experimental study. All the users were male having ages in the range of 21 to 35 years. All the users were already experienced in computer games using keyboard, mouse, and touch screen but they had no experience with gesture-based VEs. These users were randomly divided into two groups (i.e., G1 and G2), each one having twenty users. Users of G1 used the Leap Motion controller while G2 used the camera-based system. Both groups first demonstrated the use of different gestures for object selection, translation, and release. They performed three pre-trials of selection, translation, and release of a virtual object (teapot) via hand gestures using their assigned systems. A closed index fingertip gesture is used for object selection and release, while an open fingertip is used for free navigation and object translation.
After the training phase, each user of both groups performed the actual task using their specified system (Leap Motion or camera-based). The task consisted of three trials of gestures for selection, translation, and release operation making a total of 60 trials for each operation. Users of each group posed a total of 180 gestures. Misdetection of a gesture is counted as an error. After that, each user of both groups was asked to perform another task having three trials using their assigned system. The task was to select the object (teapot), move the selected object to the target position (table), and finally place (release) the object on the table via their specified gestures. The task time started from the selection of the object, followed by its translation to the target destination, and ends with object release. The task completion time, errors, and hands' positions (3D) were recorded in a file by the system for further analysis. After completion of the experiments each user of both groups was asked to fill the SUS questionnaire (see Table 3) for the analysis of the usability of the proposed interaction technique.

Task Accuracy (Task Completion Time and Errors)
In the first phase, both groups performed the task of gesture posing via their assigned system. The task accuracy of the Leap Motion and the camera-based system is shown in Table 1. The overall results of the proposed interaction technique using both interaction tools confirms the first part of our first hypothesis (H1), i.e., the proposed interaction technique results in high user performance. The translation and release operations are more accurate than object selection for both systems while the camera-based system has slightly good results as compared to Leap Motion. The selection and release operations using the camera-based method are also slightly more accurate than the Leap Motion controller. Mostly, the users do not completely bend their finger, which leads to misdetection on the Leap Motion controller while the proposed camera-based system can easily detect a bend in the finger. Furthermore, another reason may be the low update rate of the Leap Motion controller. For the statistical analysis of both systems, we used the independent sample t-test to determine whether there is statistical evidence that their associated means are significantly different. The results of the t-test show that there is a significant difference in mean task execution time between the camera-based and Leap Motion controller (t(118) = 5.325, p = 0.000). Furthermore, the mean task execution time and Standard Deviation (SD) for the Camera-based system and the Leap Motion controller is (Mean = 14.15 seconds, SD = 1.16787) and (Mean = 13.0576 seconds, SD = 1.09363), as shown in Figure 9. This means that interaction using the proposed camera-based system is faster than the Leap Motion controller. A possible reason for the high task completion time of the Leap Motion may be due to the high update rate and low spatial distortion of the accuracy of Leap Motion.

Update Rate
The second parameter of the study is to assess the update rate of both systems based on the display and calculation of different information in the VE. For this purpose, we performed a simple task of object selection, translation, and release on both systems. To assess the update rate, the system calculated the 3D position of the fingertip and displayed each position on the console and then calculated the total displayed positions per second. The results show that the camera-based system has a high update rate of 61.91/sec as compared to 48.6/sec of Leap Motion. The low update rate of Leap Motion may be the reason behind the high task execution time.

Spatial Distortion of Accuracy
The change of accuracy in different areas of sensing space of a gesture recognition system/device is referred to as spatial distortion of accuracy, which adversely affects device performance [27]. It is of foremost importance as some system's position accuracy degrades while moving away from its center. We studied both systems experimentally to assess their spatial distortion. The x-axis = 0, y-axis = 0, and z-axis = 0 lie at the center for Leap Motion coordinate system, while x = 0 and y = 0 are at the upper left corner of the camera coordinate. The experimental task consists of moving the hand from the origin (x = 0) to both positive and negative sides on the x-axis while keeping the y and z-axis constant using Leap Motion and moving the hand from left to right (x = 0,1,2,. . . ) with y-axis fixed using the camera-based system. As the z-axis for the camera-based system is deduced from the colored thumb, it is also kept fixed/constant.
The position accuracy of the Leap Motion and camera-based system is shown in Figures 10 and 11, respectively. Each sub-figure displays results for each 100 mm in the y-axis, starting from 200 mm to 600 mm. The results show that the positions of the y and z-axis for Leap Motion are unreliable/irregular, also reported by [27], especially when the x and y-axis increase from 300 mm. In the case of the camera-based system, results show that there is a relatively small distortion of accuracy when moving towards edges of the interaction boundaries, as shown in Figure 10. So, this means that the positioning accuracy of Leap Motion degrades when moving away from its center, as supported by [27,55,63].  1  16  31  46  61  76  91  106  121  136  151  166  181  196  211  226  241  256  271  286  301  316  331  346  361  376  391  406  421  436  451  466  481  496  511  526  541 1  17  33  49  65  81  97  113  129  145  161  177  193  209  225  241  257  273  289  305  321  337  353  369  385  401  417  433  449  465  481  497  513  529  545  561 Finger tip position x, y, z-axis (mm) a. b.

Interaction Volume
Interaction volume is one of the most important attributes of any gesture recognition device as it gives spatial operation freedom. To assess interaction volume and field of view of the camera-based system, we calculated it by measuring the area of visible/detected hand finger in all directions. The horizontal and vertical fields of view are 53.13 • and 42.61 • , while the system can detect a hand up to 102 cm in the z-axis. It produces a pyramid with 102 cm height, 102 cm base length, and 77 cm base width, as shown in Figure 12. We can mathematically calculate the total interaction volume as Interaction volume = 1/3 (width × length) × height = 1/3 (77 × 102) × 102 = 267,036 cm 3 = 9.43 feet 3 .
The Leap Motion has an 8 feet 3 (roughly) [64] interaction volume. According to Guna et al. [27], the measured interaction volume has a volume of 99958.47 cm 3 (3.53 feet 3 ) (−250 mm < x < 250 mm, −250 mm < z < 250 mm and 0 mm < y < 400 mm). The Leap Motion interaction volume is shown in Figure 13. The interaction volume of the Leap Motion is 120 • deep and 150 • wide, with 2 square feet located 2 feet above the Leap Motion [65]. The overall field of view and interaction volume of both the Leap Motion and the camera-based system is given in Table 2.  Results of the comparative study of both interaction tools confirm our second hypothesis [H2] that the camera-based recognition system has good results in terms of task accuracy (task completion time and error as shown in Table 1 and Figure 9), update rate, spatial distortion of accuracy (see Figures 10 and 11), and interaction volume (see Figure 12) as compared to the Leap Motion sensor.

Conclusions
In this paper, we proposed a new interaction technique which uses single fingertip-based gestures for interaction in VEs, with the focus of getting high performance using less computation and high usability via its simple nature. The interaction involves navigation, selection, translation, and release of objects. We also presented a comparative analysis of the Leap Motion and camera-based systems for hand gesture recognition in VEs. The key parameters for analysis were task accuracy, interaction volume, update rate, and spatial distortion of accuracy. The experiments revealed that the camera-based system has relatively high update rate and task performance, less spatial distortion of accuracy, and relatively large interaction volume as compared to the Leap Motion. The Leap Motion provides numerous ways of interacting due to the recognition of both hands' fingers, along with all bones. One of the main limitations of the proposed camera-based system is the need for the colored cap over the fingertip, which places an extra burden on the user, as well as mapping of the fingertip color during recognition also being a tedious work. Furthermore, the camera-based system is also vulnerable to light variations and background colors. The results of the SUS scale showed high usability of the proposed interaction technique in terms of learning, user-friendliness, user satisfaction, design, and consistency. The results confirm both hypotheses, i.e., the proposed interaction technique resulted in high performance and usability [H1] and the interaction using the camera produced better results as compared to the Leap Motion sensor.