Innovative Collaborative Method for Interaction between a Human Operator and Robotic Manipulator Using Pointing Gestures

: The concept of “Industry 4.0” relies heavily on the utilization of collaborative robotic applications. As a result, the need for an effective, natural, and ergonomic interface arises, as more workers will be required to work with robots. Designing and implementing natural forms of human– robot interaction (HRI) is key to ensuring efﬁcient and productive collaboration between humans and robots. This paper presents a gestural framework for controlling a collaborative robotic manipulator using pointing gestures. The core principle lies in the ability of the user to send the robot’s end effector to the location towards, which he points to by his hand. The main idea is derived from the concept of so-called “linear HRI”. The framework utilizes a collaborative robotic arm UR5e and the state-of-the-art human body tracking sensor Leap Motion. The user is not required to wear any equipment. The paper describes the overview of the framework’s core method and provides the necessary mathematical background. An experimental evaluation of the method is provided, and the main inﬂuencing factors are identiﬁed. A unique robotic collaborative workspace called Complex Collaborative HRI Workplace (COCOHRIP) was designed around the gestural framework to evaluate the method and provide the basis for the future development of HRI applications.


Introduction
Collaborative robotic applications are, nowadays, well-established and commonly used concepts of technology. The main goal of such applications is to combine the strengths of both humans and robotic systems to achieve maximum effectiveness in completing a specified task while minimizing the risks imposed on the human worker. In the manufacturing process, these applications are crucial in enabling the concept referred to as "Industry 4.0". Industry 4.0 focuses on developing so-called cyber-physical systems or CPS for short, aiming to create highly flexible production systems capable of fast and easy changes, addressing the need for individualized mass production of the current markets [1].
Collaborative applications are becoming increasingly popular on the factory floors due to their various benefits, such as lower deployment costs, compact sizes, and easier repurposing, than standard robotic systems [2]. As a result, more factory workers will be required to work in close contact with robotic systems. This, however, introduces new arduous challenges to overcome, primarily how to secure the safety of a human worker, while ensuring high work effectiveness. The former problem focuses mainly on minimizing potential risks and avoiding accidents from collaborative work between the human worker and a robot, not considering the work efficiency. The latter aims to find the methods capable of maximizing the overall productivity of such collaboration. Albeit the safety of the human worker is an absolute priority, also being the most currently researched topic [3], the economic benchmarks are vital in determining further deployment of these applications. Thus, it is necessary to pay more attention to the interaction between the human and robotic systems, also referred to as HRI, researching the impact and influence of the whole human-robot relationship, developing new approaches capable of meeting today's production demands.
In today's conventional robotic application, usually involving robust industrial manipulators, the robot works in a cell that must be wholly separated from the workers' environment by a physical barrier [4]. The human only comes in contact with the robot during maintenance and programming, carried out by highly qualified personnel. All of this is not only expensive, but also time-consuming. Alternatively, in collaborative applications, factory workers may often get into contact with parts of the robot and are even encouraged to influence the robot's actions based on the current circumstances to secure maximum work flexibility. This could pose a severe problem, where technically unskilled workers may suffer great difficulties communicating with the robotic system or conducting some form of control. A solution to this problem lies in developing such HRI methods, which would enable natural and ergonomic human-robot communication without the need for prior technical knowledge of the robotic systems. Hence, the workers could solely focus on the task at hand and not waste time figuring out the details of the application's interface.
For humans, the most natural way of communication is through speech, body, or facial gestures. Implementing these types of interfaces into collaborative applications could yield many benefits, which lead to increased productivity and overall worker contentment [5].
This paper presents a gesture-based collaborative framework for controlling a robotic manipulator with flexible potential deployment in the manufacturing process, academic field, or even a commercial sector. The underlying basis of the proposed framework is based on the previous work of Tölgyessy et al. [6], which introduced the concept of "linear HRI," initially designed for mobile robotic applications.This paper further develops this concept, widening its usability from mobile robotic applications to robotic arm manipulation while utilizing a new state-of-the-art visual-based sensory system. The main goal of the proposed framework is to provide systematic foundations for gesture-based robotic control, which would support a wide variety of potential use-cases, could be applied universally regardless of the specific software or robotic platform while providing a basis on which other more complex applications could be built.

Related Work
Gestural HRI is a widely researched topic across the whole spectrum of robotics. Allowing the robot to detect and recognize human movements and act accordingly is a powerful ability enabling the robot and the human worker to combine their strengths and achieve various challenging tasks. In mobile robotics, gestures can be used to direct and influence the robot's movement. Cho and Chung [7] used a mobile robotic platform, Pioneer 3-DX, and a Kinect sensor to recognize a human body and follow the operator's movement. Tölgyessy et al. [6] proposed a method for controlling a mobile robotic platform, iRobot Create, equipped with a Kinect sensor via pointing gestures. Chen et al. [8] used a Leap Motion Controller to control the movement of the mobile robotic platform with a robotic arm. Both the chassis and the robotic arm could be controlled via hand gestures. Gesture-based control was also used in projects RECO [9] and TeMoto [10], which focused on intuitive multimodal control of mobile robotic platforms. Besides mobile robotic platforms, force-compliant robotic manipulators are ideal for implementing various forms of gestural interaction, potentially deployable in multiple scenarios, such as object handling, assembly, assistance, etc. A significant part of the research is focused on teleoperation, in which the robot mirrors the human operator's movements. Hernoux et al. [11] used a Leap Motion sensor and a collaborative manipulator UR10 to reproduce the movement of the operator's hands. G. Du et al. [12] proposed a similar approach using Leap Motion to control a dual-arm robot with both hands. They used the interval Kalman filter and improved particle filter methods to improve hand tracking. Kruse et al. [13] proposed a gesture control dual-arm telerobotic system, in which the operator controlled the position of an object held by the robotic system. Microsoft Kinect was used to track the human body. Many other researchers used Microsoft Kinect or a Leap Motion controller to control the robotic manipulator [14][15][16][17][18][19][20]. Tang and Webb [5] studied the feasibility of gesture control to replace conventional means of direct control through teach-pendant. Zhang et al. [21] proposed a gesture control for the delta architecture robot. Cipolla and Hollinghurst [22] investigated a gestural interface for teleoperating a robotic manipulator based on pointing gestures. Using stereo cameras, collineations, and an active contour model, they were able to pinpoint a precise location in the 40 cm workspace of the robot at which the user would point at with his finger. Object picking is another promising application area. A human hand can pick different objects of various sizes and shapes. Therefore, several works were conducted to mimic the human hand, to grasp objects [23][24][25]. An interesting concept was proposed by Razjigaev et al. [26], the authors proposed a gestural control for a concentric tube robot. The work focused on the potential use of such an interface in noninvasive surgical procedures. A gestural interface may be a promising concept in the control of UAV drones; some works [27,28] show potential for future development. Social robotics is another vast research area where gestural interfaces could yield many benefits. Natural interfaces could be especially advantageous in interaction with humanoid robots. Yu et al. [29] proposed a gestural control of the NAO humanoid robot. Cheng et al. [30] used this robot and a Kinect sensor to facilitate a gestural interaction between humans and the humanoid robot.
In Table 1, there is a comparison of key works related to the interaction method designed by us. The vast majority focus mainly on teleoperation of the robotic manipulator used, while the user has visual feedback of the resulting robot motion. Only three approaches present some form of direct interaction of the operator with the manipulator's workspace and two of these allow the user to point and select objects present in the workspace. The major novelty and contribution of our design is that the operator can select objects on the planar surface; furthermore, she or he can point to any spot of the workspace and subsequently navigate the end effector to the desired destination.

Our Approach
In a conventional manufacturing process, a worker usually performs repetitive manual and often unergonomic tasks. Workers do most of these tasks mainly using their hands, such as assembly, machine tending, object handling, material processing, etc. Manual work, however, has many limitations which consequently influence the whole process efficiency. Integrating a collaborative robotic solution could significantly improve efficiency while, at the same time, it can alleviate a human worker's physical and mental workload.
Our proposed concept for the gestural HRI framework for the collaborative robotic arm aims to create such interaction, where the human worker could interact with the robot as naturally and conveniently as possible. Secondly, besides ergonomics, the framework focuses on flexibility, providing the user with functionality that could be used in numerous application scenarios. Lastly, the framework lays basic foundations for further application development, pushing the flexibility even further.
The fundamental principles of the proposed framework are based on the so-called "linear HRI" concept introduced by Tölgyessy et al. [6], which formulates three simple laws for HRI that state: 1.
Every pair of two joints of a human sensed by a robot form a line.

2.
Every line defined by the first law intersects with the robot's environment in one or two places.

3.
Every intersection defined by the second law is a potential navigation target or a potential object of reference for the robot.
The proposed framework was developed under the laws of linear HRI, where the core functionality lies in the ability to send the end effector of the robotic arm to a specific location on a horizontal/vertical plane using the operator's hand pointing gestures.
Humans naturally use pointing gestures to direct someone's attention to the precise location of something in space. Combined with speech, it can be a powerful communication tool. Pointing gestures have many great use-cases among people to signify the importance of a particular object ("That's the pen I was looking for" (pointing to the specific one)), express intention ("I'm going there!" (pointing to the place)), or specify the exact location of something ("Can you hand me that screwdriver please?" (pointing to the specific location in space)). Pointing gestures are incredibly efficient when the traditional speech is insufficient or impossible due to circumstances, such as loud and noisy environments.
Let us imagine a collaborative application scenario where the worker performs a delicate assembly task that requires specific knowledge. A collaborative robot would fulfill the role of an intelligent assistant/co-worker. The worker could point to the necessary tools and parts out of his reach, which the robot would then bring to him; thus, he could specifically focus on the task while keeping his workspace clutter-free. Due to the natural character of the whole interaction, the worker could control the robot's behavior conveniently without any prior technical skills. Such application would be accessible and easy to use across the entire worker spectrum, with various technical backgrounds. Our gestural framework aims to enable the described interaction in real-world conditions using the principles of linear HRI and state-of-the-art hardware. However, first, the following main challenges need to be addressed:

•
Choosing suitable technology and a method of detecting the position and orientation of human joints. • Calculating the exact position of the intersection between the formed line and the environment surface. • Selecting the pair of joints to form and calculate the line following the first two principles of linear HRI.
Solving these challenges is crucial for ensuring efficient, reliable, and natural interaction. Additionally, we aimed to make the framework universal, not dependent on the specific robotic platform. For that reason, we chose the Robot Operating System or ROS as our software environment. ROS supports a wide variety of robotic platforms and is considered standard for programming robots in general. For the robot's control and trajectory planning, the ROS package MoveIt was used.
In summary, the primary objectives of our approach lie in natural interaction, application flexibility, and scalability, while following the underlying concept of so-called linear HRI.

The Human Body Tracking and Joint Recognition
Precise recognition and localization of joints are vital in gestural interaction. Several technologies and approaches exist, providing the user with tracking capabilities. According to Zhou and Hu [34], human motion tracking technologies can be divided into non-visual, visual, and robot-aided tracking.
In non-visual tracking, the human motion and joints are mapped via various sensors placed onto the human body; typically, MEMS IMU or IMMU devices are used. These sensors are usually a part of a suit or so-called "data glove", which must be worn by the user [35][36][37][38][39]. Although these systems are precise and relatively cheap, they suffer from many shortcomings, such as the need for the user to wear the sensory suit, which may be uncomfortable and often calibrated for the specific user. The sensors on the suit may require extensive cabling, which can limit the user's range of motion. The sensors themselves may suffer from several issues, such as null bias error, scale factor error, noise, or interference. Due to this, the general application focus shifted towards the use of visual-based human body tracking.
This approach uses optical sensors, such as RGB, IR, or depth cameras to extract the user's spatial position. Complex image processing algorithms are then applied to pinpoint the precise position and orientation of human joints. These sensors do not require any equipment that the user needs to wear, thus not limiting him in any way. They can be easily calibrated to the specific user, which greatly improves their flexibility. Today's camera technologies and image processing algorithms make them fast and accurate and relatively cheap. These attributes made these sensors popular and widely used in various commercial applications, mainly in HCI and video-gaming industry. However, visual-based sensors still have multiple drawbacks, such as the sensitivity to the lighting conditions and worsened tracking capabilities when the human body is occluded. The most widely recognizable sensor for visual body tracking is the Microsoft Kinect, released in 2010, initially intended for video gaming. However, due to its capabilities, the sensor became widely used in other applications. The Kinect's principle of body tracking relies on capturing the depth data from the depth sensors and then applying their image processing methods to produce a so-called "skeletal stream" representing the human figure. The first iteration of the sensors used an IR projection pattern to acquire the depth data. The latter versions used the so-called "time-of-flight" technology or TOF. According to Shotton et al. [40], body part classification (BPC) and offset joint regression (OJR) algorithms, specially developed for Microsoft Kinect, are used to determine the parts of the human body. Following the success of Microsoft Kinect, other similar sensors were released, such as Intel RealSense or ASUS Xtion.
Another widely popular vision-based motion tracking sensor is the Leap Motion controller. This compact device was specially designed to track the human hands and arms at very high precision and accuracy. Alongside gaming and VR applications, the controller's design was initially meant to replace conventional peripheral devices, such as a mouse and keyboard, and provide a more sophisticated and natural HCI. The Leap Motion controller uses two IR cameras, capturing emitted light from three LEDs with a wavelength of 850 nm. The depth data are acquired by applying unique algorithms on the raw data, consisting of infrared brightness values and the calibration data [41]. According to the manufacturer, the accuracy of position estimation is around ±0.01 mm. However, several studies show [42,43] that this is highly dependent on the conditions.

Sensor Choice for the Proposed Concept
Due to the advantages of vision-based body tracking, we decided to use this technology in our concept. We believe that the proper HRI should not rely on "human-dependent devices" as data gloves or body-mounted sensors, but on the perceptive ability of the robot itself, as this, in our opinion, most accurately represents natural and ergonomic interaction.
As for the specific sensor, the Leap Motion controller was picked as the most suitable option. The main reason is the sensor's application focus. The core of the proposed concept centers around the hand gesture interaction between the robot and the human worker, as most manufacturing tasks are done by hand. Furthermore, the whole gestural framework is built on pointing gestures, performed by the arrangement of individual fingers. Leap Motion provides accurate and precise human hand tracking, explicitly focusing on the position and orientation of fingers. Other commercially available sensors on the market are not yet capable of such precision and generally focus on tracking the whole human body. The controller depicted in Figure 1 can track hands within a 3D interactive zone that extends up to 60 cm (24 ) or more, extending from the device in a 140 × 120°typical field of view, which is illustrated in Figure 2. The controller produces an output in the form of gray-scale images captured by the IR cameras and the skeletal representation of the human hand. The data are shown in Figure 3. Leap Motion's software can differentiate between 27 distinct hand elements, such as bones and joints, and track them even when other hand parts or objects obscure them. The positions of joints are represented in the controller's coordinate system depicted in Figure 4. The connection with the PC is facilitated via USB 2.0 or USB 3.0 connection. The manufacturer also provides a software development kit or SDK for the Leap Motion controller, allowing the developers to create custom applications.

Method Design
The core functionality of the proposed framework follows specific consecutive steps. The operator first performs a pointing gesture, pointing to the particular location in the (planar) workspace of the robot. Then, the robot computes the specified location defined by the planar surface and the half-line formed by the operator's joints. When the operator is satisfied with the pointed place, he performs a command gesture, triggering the signal to move the robot's end effector to the desired location. The whole process is illustrated in Figure 5. The half-line is defined by the direction of the index finger, as it is the most common way to represent a pointing gesture. The following command gesture was specifically designed to be performed naturally, fluently connecting with the previous gesture while not significantly influencing the pointing precision. The command gesture is achieved by extending the thumb, forming a so-called "pistol" gesture. Both gestures are depicted in Figure 6. Custom ROS packages and nodes were created for gesture recognition and intersection computation. The ROS-enabled machine connected to the robotic arm manages these nodes, ensuring proper communication between the individual parts of the whole application. The method's architecture in ROS framework is depicted in Figure 7. The flowchart of the method's process is in Figure 8. The most prominent geometrical features for the mathematical description of the concept are illustrated in Figure 9. The global coordinate system G defines the planar ground surface π.   The Leap Motion sensor determines the position of the joints of the human hand in its coordinate system L; furthermore, the robotic manipulator operates in its coordinate system R. The unification of those coordinate systems is vital to obtain points A and B coordinates defining the half-line p, which ultimately determines the coordinates of intersection I.
All of the calculations are done in the global coordinate system G. Hence, the following homogeneous transformations between R and G, L, and G are defined: Analytical geometry and vector algebra are used to calculate the desired coordinates. The data acquired from the Leap Motion sensor are transformed based on Equation (1). The points and vectors used in the calculation are illustrated in Figure 10.
The direction of vector v is calculated by dividing the coordinates of vector v with its magnitude.v Now, it is possible to define the half-line p by parametric equations.
where t is the parameter of p. The plane π is defined by the x and y axes with an arbitrary normal vector n. Equation (5) can mathematically describe plane π.
By substituting Equation (5) into Equation (4), the parameter t can be calculated as follows.
Now, using substitution, the X and Ycoordinates of the intersection I(X, Y) can be calculated by the following equations.
Respectively, it is possible to calculate the intersection on the plane ω perpendicular to π mathematically defined by the Equation (7).
where d is the distance of the plane on y axis. The scenario is illustrated in Figure 11.

Method Implementation
The main application focus of the proposed concept is to create a natural HRI workplace where humans and robots can work together efficiently. For this reason, a specialized robotic workplace was built around the core concept's functionality, supporting the ergonomy of the whole interaction between human and the robot, trying to maximize the efficiency and convenience for the worker. Furthermore, it also acts as a modular foundation for implementing, testing, and evaluating other HRI concepts.
The whole workplace, called COCOHRIP, an abbreviation for Complex Collaborative HRI WorkPlace, is depicted in Figure 12. The COCOHRIP consists of three main parts, the sensors, the visual feedback, and the robotic manipulator. The sensory part contains the various sensors that gather data about the human worker and the environment. The visual feedback part consists of two LED monitors, providing the user with visual feedback from the sensor. The robotic part comprises the force-compliant robotic manipulator. All these parts are connected through the ROS machine, which manages all the communication and logic of the application. Transformation matrices between coordinate systems were:

Experimental Evaluation
In this section, we present two experimental scenarios to evaluate the basic concept principles, which quantify the overall usability of the concept and serve as a baseline for further development. The scenarios of each of the two experiments are derived from the proposed gesture pointing method, in which the operator points to a specific location on the workspace plane and performs a command gesture, sending the robot to the desired location. Each of the experiment scenarios has two variants. In the first scenario, the operator points to the designated horizontally placed markers; in the second scenario, the operator points at vertically placed markers. In the first variant, the operator has no visual feedback about where she or he is pointing. His task is to rely solely on his best guess. In the second variant, the user receives visual feedback from the interactive monitors showing him the exact place he is pointing. The illustration of the experimental setup and the precise positions of the markers in the global coordinate system is depicted in Figures 13 and 14. The position values are in millimeters. The process of obtaining one measurement is as follows: 1.
The operator's hand is calibrated for optimal gesture recognition.

2.
The operator points to the first point, and when he is satisfied with his estimation, he performs a command gesture.

3.
When the operator performs a command gesture, the spatial data are logged, and the robot's TCP moves to the pointed location. 4.
The operator repeats these previous two steps, iterating through each target point in order, as illustrated in Figures 13 and 14.
In total, the operator points ten times to each of the target markers for the current experiment. During the pointing, the only requirement on the user is to keep the hand approximately 200-250 mm above the Leap Motion sensor, as this is believed to be the optimal distance according to [42] and the empirical data acquired during the method pretesting. The user is encouraged to perform the pointing gestures as naturally as possible, allowing him to position his body and arm as he sees fit. Seven male participants of different height, physical constitution, and height executed the experiment scenarios.

Results
The measured coordinates for each target marker relative to the desired position are depicted in Figures 15 and 16. The blue markers represent the first variant, where the user had no visual feedback. The red markers represent the second variant of each experiment. The euclidean distance between each measured point and the desired position was computed to quantify the precision of pointing for each experiment. Figures 17 and 18 show the boxplots of the euclidean distance data for each experiment scenario. The mean deviation in position and the overall deviation are shown in Tables 2 and 3.
Additionally, the dispersion of measured points in each axis was calculated to evaluate the repeat-ability of pointing. The results are shown in Tables 4 and 5. All the data are represented in millimeters.

Discussion
The data show that the error of pointing to the specific location, relying solely on the user's estimation, is approximately 50 mm. The user's consistency of pointing is around 30-50 mm, depending on the user's distance to the target. The results also show significant improvement in both accuracy and consistency when visual feedback is provided. These results outline the potential application use regarding the possible size of manipulated objects as mentioned previously in a real-life scenario. They also highlight the importance of some feedback that should be implemented to potential real use-case applications. However, the data are not yet sufficient to derive general conclusions. Further, more thorough analysis is needed. During the experiments, several factors were identified that could significantly influence the accuracy of pointing. The first factor is the posture and position of the human body and the distortion caused by the user's point of view. The second factor is the tracking capabilities and precision of the LMC sensor. During experiments, numerous glitches of LMC controller were reported. The LMC occasionally misinterpreted the position of joints and fingers, which led to inaccuracies in the measurements. Another factor influencing the results is the joints selected to represent the half-line intersecting with the working plane. The last identified factor is the influence of the command pistol gesture on pointing accuracy, as extending the thumb may change the intended pointing location. Further evaluation of the method, considering all of the mentioned factors, must be done in the future to determine the full potential of the proposed gestural framework. In all, we believe that the experiments and the data served their purpose. The main factors influencing the accuracy were identified, and the obtained data can be used as the base for further experiments and evaluation of the method.

Conclusions
In this work, we proposed and successfully implemented a framework for gestural HRI using pointing gestures. The core concept of the method was based on the previous work by Tölgyessy et al. [6], particularly on the idea of linear HRI. In the work, they proposed a similar concept of controlling a mobile robotic platform via pointing gestures. This article further improves the work by extending the overall potential of this innovative concept from mobile platforms to other fields of robotics. Additionally, it promotes the universal nature of linear HRI as the concept uses a different sensory system and a different type of robotic system. The work contains a general overview as well as the mathematical background for the core method. A concept of the collaborative robotic workplace, also known as COCOHRIP, was built around the proposed framework. Practical experiments in a laboratory environment were carried out to prove the described method. The obtained data provided the basis for further development of the whole gestural framework.
The main contribution of this work is a novel approach to the control of the collaborative robotic arm using pointing gestures. In the majority of the cited work, the researchers focus on teleoperation of the robotic arm either through movement mirroring or by binding the control of the robot to a specific gesture. The operator is not directly interacting with the (shared) workspace. The robot is controlled directly by the user, and its movements are entirely subordinate to the operator's movements. In our approach, the operator interacts with the robot through the shared workspace, thus not teleoperating the robot directly. The robot is in the role of an assistant rather than a subordinate. Consequently, this dynamic could facilitate more effective collaboration between humans and the robotic systems. Cippolla and Hollinghurst [22] presented a gesture-based interface method that is most comparable to our approach. They similarly use pointing gestures to direct the robot's movement. In their work, hand tracking is achieved using stereo monochromatic CCD cameras. However, their method suffers shortcomings such as the need for high contrast background and specific light conditions. Additionally, the method's workspace is much smaller compared to our method. Our proposed concept has far more robust gesture recognition and tracking performance, additionally being able to recognize multiple gestures. Other cited works in which the user interacts with a workspace require some sort of wearable equipment. We believe the future of HRI lies in the utilization of multiple interconnected visual-based sensors with the ultimate purpose to rid humans of any additional restrictive equipment, allowing them to interact with robots as naturally and ergonomically as possible, without the need for superior technical skills. Furthermore, the core mathematical method in the proposed framework is universally applicable, not relying on the specific hardware or software.
The practical utilization of the proposed framework was outlined in the form of a collaborative robotic assistant that could potentially be used in the manufacturing process, academic field, or research and development. This paper further develops this concept, widening its usability from mobile robotic applications to robotic arm manipulation while utilizing a new state-of-the-art visual-based sensory system. Our framework, along with the built COCOHRIP workplace, could serve as a basis for more complex applications, such as naturally controlled object picking and manipulation, robot-assisted assembly, or material processing. Our future research will focus on improving the system's accuracy based on the identified factors, and designing and implementing more complex algorithms and methods to further increase the usability and flexibility of the proposed gestural framework.