An Integrated Real-Time Hand Gesture Recognition Framework for Human–Robot Interaction in Agriculture

: Incorporating hand gesture recognition in human–robot interaction has the potential to provide a natural way of communication, thus contributing to a more ﬂuid collaboration toward optimizing the efﬁciency of the application at hand and overcoming possible challenges. A very promising ﬁeld of interest is agriculture, owing to its complex and dynamic environments. The aim of this study was twofold: (a) to develop a real-time skeleton-based recognition system for ﬁve hand gestures using a depth camera and machine learning, and (b) to enable a real-time human– robot interaction framework and test it in different scenarios. For this purpose, six machine learning classiﬁers were tested, while the Robot Operating System (ROS) software was utilized for “translating” the gestures into ﬁve commands to be executed by the robot. Furthermore, the developed system was successfully tested in outdoor experimental sessions that included either one or two persons. In the last case, the robot, based on the recognized gesture, could distinguish which of the two workers required help, follow the “locked” person, stop, return to a target location, or “unlock” them. For the sake of safety, the robot navigated with a preset socially accepted speed while keeping a safe distance in all interactions.


Introduction
Toward rendering agricultural practices more targeted, efficient, profitable, and safe, robotic systems seem to be a game changer, contributing also to meet the challenges of lack of qualified labor [1,2]. To achieve the aforementioned goals, agri-robotic applications are taking advantage of the remarkable technological advancement of information and communications technology (ICT), including artificial intelligence, sensors, and computer vision [3,4]. Robotic solutions have found fertile ground in a number of agricultural applications, such as harvesting, spraying, disease detection, and weeding [5,6]. Usually, robots can carry out programmed actions driven by restricted, task-specific, and noninteractive commands [7]. These prearranged actions can work sufficiently for structured environments, such as industrial stable settings. However, agriculture involves dynamic ill-defined ecosystems, which are vulnerable to heterogeneity and unforeseeable situations, as well as involve sensitive live produce [8]. As a means of addressing these challenges, the collaboration of humans and robots constitutes a promising solution through combining the cognitive human capabilities with the repeatable accuracy and strength of robots. The so-called human-robot interaction (HRI) in the agricultural sector is anticipated to provide As far as the incorporation of hand gesture recognition in agricultural human-robot collaborative systems is concerned, the literature is still scarce. A part of the study presented by Vasconez et al. [46,47] investigated this nonverbal means of communication in conjunction with a faster region-based convolutional neural network (R-CNN) intended for improving the process of avocado harvesting in a simulated workspace. In particular, in [46], a robot could detect the workers and judge if they required assistance in one of the following ways; detection of hand gestures, flag detection, and human activity recognition. Moreover, Zhang et al. [48] studied a virtual assembly system regarding a harvester that was able to recognize dynamic hand gestures relying on a CNN/LSTM (long short-term memory) network.
Taking into account the great potential of using hand gesture recognition for enhancing the efficiency of agricultural collaborative systems [9,10,49], as well as for achieving fluid and safe interaction [50], the scope of the present study was to create a robust innovative paradigm. More specifically, it was aimed at developing an integrated automated communication framework between an unmanned ground vehicle (UGV) and the worker via hand gesture recognition such that the former "collaborator" assists humans when they require its help. Moreover, it focused on testing this synergistic ecosystem in different scenarios in an orchard. This study is an extension of the papers recently published on HRI that included a worker and a UGV for optimizing crop harvesting [51][52][53]. In particular, this application is based on the synchronization of the actions of the UGV and workers, with the UGV following the workers while harvesting and undertaking the transport of crates from the harvesting site to a cargo vehicle. In [51], the height of the robot deposit height was evaluated in the framework of occupational health prolepsis, while, in [52,53], wearable sensors were used for the identification of human activity signatures. We now add real-time hand gesture recognition to this syllabus for the sake of optimizing natural communication and safety by combining depth videos and skeleton-based recognition. It is the first time, at least to authors' knowledge, that such a system is verified in an orchard and not in a simulated environment [46][47][48], proving that it is able to face the aforementioned challenges related to vision sensors and outdoor settings.
The remainder of the present paper is structured as follows: Section 2 is divided into two main subsections. The first one describes the developed framework for achieving automatic hand gesture recognition along with the data acquisition process and information about the ML approach. The second subsection elaborates on the methodology regarding the "translation" of the automated hand gesture recognition into practical commands to be executed by a UGV in different outdoor instances. Subsequently, the results are presented in Section 3 regarding the performance of the examined ML algorithms for capturing the hand gestures (Section 3.1), as well as the efficiency of the proposed system in different situations (Section 3.2). Lastly, in Section 4, concluding remarks are drawn in tandem with a discussion in a broader context along with future research directions.

Data Acquisition
The data acquisition process is an essential element in order to properly train an ML algorithm. In the present study, the data acquisition was performed using a ZED 2 depth camera (Stereolabs Inc., San Francisco, CA, USA) placed at a fixed height of 88 cm on a UGV (Thorvald, SAGA Robotics SA, Oslo, Norway). Five participants were involved in the present acquisition of data on both sunny and cloudy days at different sites of an orchard with Pistacia vera trees in the region of Thessaly, in central Greece. Additionally, the distance between the participant and the camera was changed in favor of increasing variability. Concerning the speed of the tracking, it was set equal to 60 fps, while the focus was on the left hand. The predefined hand gestures, which are depicted in Figure 1a-e, are commonly known as "fist", "okay", "victory", "flat", and "rock". For the purpose of identifying the position of hands, the MediaPipe framework was implemented [54], similarly to recent studies such as [55][56][57]. MediaPipe is an open-source framework created by Google that offers various ML solutions for object detection and tracking. An algorithm was developed using Python that implements this framework for gesture recognition, by using the holistic solution including the hand detection (for the left hand) and excluding the face detection. As described in [58], a model regarding the palm detection works on the entire image and returns an oriented bounding frame of the hand. A model of hand landmarks then works on the cropped picture and returns 3D keypoints via a regression process. Each of the identified landmarks has distinct relative x, y coordinates to the picture frame. Consequently, the requirement for data augmentation, including translation and rotation, is remarkably reduced, while the previous frame can be used in case the model cannot identify the hand so as to re-localize it. Moreover, the MediaPipe model "learns" a consistent hand pose representation and turns out to be robust even to self-occlusions and partially visible hands. Lastly, the developed algorithm computes the Euclidian distance between all the identified landmarks along with shoulder and elbow angles. The aforementioned angles refer to the body side that corresponds to the tracked hand. In Figure 2, the aforementioned landmarks are depicted pertaining to the examined hand gestures, with the images taken in the laboratory for the sake of better distinguishing the landmarks.
The dataset, which was used for training, consisted of 4819 rows created from images taken from the RGB-D camera in the orchard. Each row of the dataset represents the generated landmarks from the MediaPipe hand algorithm for each video frame. On the other hand, each column (except the last three) indicates the Euclidian distance between two landmarks to the given frame. The last three columns represent the shoulder angle, the elbow angle, and the class, respectively. As mentioned above, the calculation of the Euclidian distance and angles was carried out by the developed algorithm, while the classification of the hand gesture was performed manually for each case. In general, when using an ML algorithm, it is of central importance to train it by utilizing a dataset with about the same number of samples for each class. In this analysis, since the participants carried out the five preset hand gestures in the same time period, the classes resulted in approximately the same number of instances and, consequently, in balanced classes, as depicted in Figure 3.

Data Preprocessing
Due to the type of the aggregated data, only two preprocessing actions were required. The first was the oversampling procedure using the SMOTE algorithm [59] for the Scikitlearn library [60]. It regards a Python module incorporating a variety of ML algorithms for both supervised and unsupervised problems. The second preprocessing action dealt with data normalization in which a min-max scaler was used in the same library.

Machine Learning Algorithms Tested for Classification Predictive Modeling
Classification commonly refers to problems aiming at recognition and grouping of objects and data into preset categories. The algorithms, which are used as classifiers in ML, make use of input training data in order to predict the probability that the data fall into one of the predetermined classes. There are a plethora of classification algorithms available in the literature [61][62][63]. In the present analysis, the following algorithms were used:

•
Logistic regression (LR): Applied to estimate discrete values from a set of independent variables for predicting the likelihood of an event through fitting data to a logit function; • Linear discriminant analysis (LDA): A technique whose objective is to project the features of a higher-dimensional space onto a lower one with the intention of avoiding dimensional costs; • K-nearest neighbor (KNN): A pattern recognition algorithm, which utilizes training datasets so as to find out the closest relatives in future samples; • Classification and regression trees (CART): A tree-based model relying on a set of "if-else" conditions; • Naïve Bayes: A probabilistic classifier, which assumes that the existence of a specific feature in a class is independent of the existence of any other feature; • Support vector machine (SVM): A methodology in which raw data are plotted as points in an n-dimensional space (n represents the number of features). Subsequently, each feature's value is tied to a specific coordinate so as to enable data classification.

Real-Time Human-Robot Interaction Based on Hand Gesture Recognition
In this subsection, the methodology pipeline regarding the implementation of the automated hand gesture recognition is described as a means of achieving robust HRI in different cases. In brief, a UGV can detect the worker requiring assistance and lock onto them ("fist" gesture). Once the person has been locked onto, they are the only authorized worker who can collaborate with the UGV by using one of the other four commands/gestures, thus minimizing potential safety concerns [6]. Subsequently, on the basis of the detected hand gesture, the UGV can (a) unlock the selected worker ("okay" gesture), (b) follow the authorized worker considering a safe distance ("victory" gesture), (c) stop the current action ("flat" gesture), or (d) return to a predetermined target location ("rock" gesture). This location can be a cargo vehicle similarly to [46] or any other site of the field according to the application at hand. The five available gestures that correspond to a single class and a unique action to be executed by the UGV are summarized in Table 1. Table 1. Summary of the examined hand gestures along with the corresponding classes and actions to be executed by the UGV.

Recognized Hand Gesture
Corresponding Class UGV Action Return to the target location "Victory" 4 Follow The implemented UGV was equipped, apart from the RGB-D camera needed for gesture recognition, with several sensors to facilitate the present agricultural application. In particular, a 3D laser scanner (Velodyne Lidar Inc., San Jose, CA, USA) was used in order to scan the surrounding environment and generate a two-dimensional map, and an RTK GPS (S850 GNSS Receiver, Stonex Inc., Concord, NH, USA) and inertial measurement units (RSX-UM7 IMU, RedShift Labs, Studfield, Victoria, Australia) were used for providing information about velocity, positioning of the UGV (with latitude and longitude coordinates), and time toward optimal robot localization, navigation, and obstacle avoidance. To cope with the current requirements in computational power, all the visual oriented operations were orchestrated by a Jetson TX2 Module [64] with CUDA support. In Figure 4, the UGV platform used in the present study is depicted equipped with the necessary sensors. The selection of the above sensors was made on the basis of the capability for Robot Operating System (ROS) integration [65], which is an open-source framework related to robotic applications and constitutes an emerging framework in various fields, including agriculture [66,67]. When reconfiguring a robot, the implemented software should be modular and capable of being adapted to new configurations without requiring recompilation of the robot's code. Toward that direction, ROS was selected to be the software of the present study. All processes running on UGV are registered to the same master ROS, constituting nodes within the same network [68]. The ROS framework was also used to navigate the UGV in the orchard. The accurate position was provided by the onboard RTK-GNSS system. In Figure 5, the operation flow of the developed system is presented. In brief, as an input, video frames from the depth camera are used. The image data are then transferred from an ROS topic to an ROS node that extracts the RGB channels from the frame along with the depth values from each frame. As a next step, the MediaPipe algorithm node incorporates the RGB channels for extracting all the essential landmarks to feed into the developed ML algorithm for gesture recognition. In addition, in conjunction with the previous step, the RGB image data are imported via a topic to a node associated with human detection. To that end, the pretrained YOLO v3 model was used with the COCO dataset [69]. To communicate with the rest of the system, an ROS node was developed that requires as an input topic the RGB image of the depth camera. The output of this node is a topic with a custom message that consists of the relative coordinates of the bounding box of the identified person and the distance between the UGV and the collaborating person. In the gesture recognition node, similarly to human recognition, the input topic consists of the RGB image. The gesture recognition node implements the MediaPipe algorithm and predicts the hand gestures. The output topic constitutes, in practice, the probability of the identified gesture. Once a gesture is detected and published, the HRI node relates it with the identified person and publishes an ROS time frame (tf) between the robot and the person using a distinct ROS node.
The tf publisher/subscriber is a tool provided by the ROS framework that keeps the relationship among coordinate frames in a tree structure, allowing the user to transform vectors, points, and so on. The tree structure enables dynamic changes to this structure requiring no additional information, apart from the directed graph edge. In case an edge is published to a node referencing a separate parent node, the tree is going to resolve to the new parent node. The main purpose of this tool is to create a local coordinate system from the robot and its surrounding environment. In most cases, the "0, 0" point of each local coordination system is either the first registered point of the operational environment during the mapping procedure or the center of the robot. In more complex entities (for example, the UGVs), the entity consists of multiple tf publishers (one for each district part). For the scope of this study, the human entities consist of one tf publisher. Further details about the tf library can be found in [70].
The main scope of the HRI node is to control the UGV behavior. On the basis of the identified hand gesture (class), summarized in Table 1, the algorithm performs the following steps: • Waits until the identified person performs a gesture. When the gesture is registered, the tf publisher is initialized (class 0). For each identified human, a unique identifier (ID) is assigned. When one of the identified persons performs the lock gesture, the HRI node is activated. Finally, the person who performed the lock gesture has the remit to control the UGV and collaborate with it. In order for a person with a different ID to be able to control the UGV, the "unlock" command must be detected; • "Unlocks" the person and removes the tf publisher (class 1); • Enables a tf following sequence and obstacle avoidance. The UGV moves autonomously in the field while keeping a predefined safe distance. This must be greater than or equal to 0.9 m, similarly to [46], in order to be within the so-called social acceptable zone (class 2); • Disables the tf following sequence, but does not "unlock" the person (class 3); • Navigates the UGV to a specific predefined location (class 4).
In order for the UGV to successfully keep a safe distance and velocity from the identified people and possible obstacles, the open-source package "ROS Navigation Stack" [71,72] was used similarly to previous studies [67,73]. An ROS node creates a tf publisher from the robot for the identified obstacles and locked person and, according to the tfs values, a velocity controller keeps a safe distance. This package prerequires that the UGV runs ROS, has a tf transform tree, and publishes sensor data by utilizing the right ROS message types [71].

Results
This section elaborates on the results obtained from the six ML algorithms utilized for detecting the five different hand gestures. In addition, preliminary results are summarized for testing step by step the efficiency of this nonverbal communication in real field conditions for three different scenarios.

Comparison of the Machine Learning Algorithms Performance for Classification of the Hand Gestures
Confusion matrices constitute a useful and popular measure for evaluating and visualizing the performance of ML algorithms while solving both binary and multiclass classification problems by comparing the actual values against those predicted by the ML model. In particular, a confusion matrix can be an N × N matrix, where N stands for the number of classes. In the present analysis, N was equal to the examined hand gestures, i.e., five, as depicted in Figure 6, while the correspondence of the number of classes with the hand gesture is presented in Table 1. The confusion matrices derived using the six ML algorithms, namely LR, LDA, KNN, CART, NB, and SVM, are presented in Figure 6a-f, respectively. The diagonal cells show the number of hand gestures that were correctly classified, whereas the off-diagonal cells show the number of misclassified gestures. Higher diagonal values in Figure 6 indicate a better performance of the corresponding algorithm. Overall, the most problematic hand gestures were the "rock" (label 3) and "victory" (label 4), especially for LR, LDA, CART, and SVM algorithms.
The classification report depicted in Table 2 summarizes the individual performance metrics for each ML algorithm. In simple terms, these metrics indicate the proportion of, (a) true results among the entire number of examined cases (accuracy), (b) the predicted values that turned out to be truly positive (precision), and (c) actual positives that turned out to be correctly predicted (recall). Lastly, the F1-score combines the precision and recall by considering their harmonic mean. As can be deduced from Table 2, KNN, LDA, and SVM seemingly outperformed the other investigated algorithms, while the NB algorithm gave the worst results.
Lastly, Figure 7 depicts the box plots pertaining to the accuracy of the investigated classifiers as a means of offering a visual summary of the data regarding the identification of the mid-point of the data (the orange line), as well as the skewness and dispersion of the dataset. As an example, if the median is closer to the bottom or to the top of the box, the distribution tends to be skewed. In contrast, if the median is in the middle of the box, and the length of the whiskers is approximately the same on both sides, the distribution is symmetric. Lastly, the interquartile range (IQR), i.e., the range of the box, includes 50% of the values; thus, a longer box indicates more dispersed data.  Concerning the present analysis, the NB classifier appeared to be positively skewed, while outliers were also observed (plotted as circles beyond the whiskers) for this case. Focusing on the KNN, LDA, and SVM, which demonstrated the best performance, the obtained accuracy values had about the same median and were not symmetrical, spreading out more on the upper side. Moreover, KNN and LDA had approximately the same range of data, larger as compared to the corresponding range of SVM. Lastly, the dispersion of accuracy values was the same for the aforementioned three algorithms, as can be deduced from their IQR values.

Demonstration of the Proposed System in Differrent Scenarios
For safety purposes, a step-by-step approach was performed. We initially tested the efficacy of the proposed framework in a case where only one person existed in the same working environment with the UGV. The latter was still in a specific farm location waiting to lock its "collaborator", in case the "fist" gesture was detected. Taking into account the above results concerning the performance of the investigated ML algorithms in predicting the five hand gestures, only the KNN algorithm was used for brevity. The UGV was able to respond to all the commands given via the hand gestures (Table 1), recorded by the RGB-D camera and recognized by the proposed framework. The fact that the UGV aborted its current task whenever the "flat" gesture was recognized is very important considering that a possible failure to comply with this command can provoke harmful consequences [10]. Lastly, a maximum speed of 0.2 m/s and a maximum safe distance of 0.9 m [46], which were programmed in the developed model, were necessary prerequisites in all occasions.
In the next experimental set, a second person was added in the working region to investigate how the proposed framework could cope with this challenge. It can be concluded that, again, the robot was able to accurately "lock" onto the selected participant in spite of the presence of the second person while also obeying the other four commands. During this scenario, although the UGV could obey only the authorized person, it was able to recognize the second person and avoid them similarly to the other existing obstacles of the orchard. Figure 8 illustrates the case where the UGV was in the "following" mode for the locked person in the case of a single person (Figure 8a) or when both persons were in the field of view of the RGB-D camera (Figure 8b). Figure 8 was generated by RVIZ [74], which is a 3D visualization tool for the ROS software. Lastly, the proposed synergistic HRI system was tested in a more complex scenario. In this case study, two persons were present in the same working space with the UGV. The first person asked to be locked onto by the UGV ("fist" gesture) (Figure 9a), and then requested to be followed ("victory" gesture) (Figure 9b) before performing a stop command ("flat" gesture) (Figure 9c). Afterward, the locked person asked to be unlocked ("okay" gesture) (Figure 9d). A second person in the field of view of the RGB-D camera asked to be locked onto (Figure 9e). By locking onto the second person, the UGV followed the participant (Figure 9f), and was then ordered to go to the target site of the farm (Figure 9g), where it arrived after navigating a few meters (Figure 9h) following the set of gestures presented in Table 1. Some indicative images taken from the pilot study are shown in Figure 9a-h, while a video is also provided in the Supplementary Materials (Video S1).

Discussion and Main Conclusions
In the present study, a very crucial aspect of HRI was investigated to meet the needs of enabling natural communication. In this context, we proposed a nonverbal interaction via real-time hand gesture recognition captured by a depth camera. This combination has the potential to overcome common problems associated with illumination changes and complex backgrounds, providing reliable results [13]. In the direction of additionally facing the challenges of dynamic agricultural environments, a skeleton-based recognition methodology was designed on the basis of ML. The accuracy of six ML algorithms (LR, LDA, KNN, CART, NB, and SVM) was tested in this study on five certain hand gestures representing different actions that the UGV, equipped with an RGB-D camera and other sensors, could carry out on the basis of the detected commands. Among the examined classifiers, KNN, LDA, and SVM demonstrated the best performance.
Firstly, the UGV must detect the assigned hand gesture to "lock" onto the person with whom it is going to collaborate in a relatively crowded environment. To that end, a distinct hand gesture is used, which represents the first class of the ML implementation. Once the person is located, the UGV can perform four other actions in order to assist them, namely, following its collaborator by keeping the preset safe distance and speed, returning to the destination site, which may be the cargo vehicle zone similarly to [46] or any other site depending on the application, or stopping its current task at any time, once the corresponding hand gesture is detected.
In summary, the present integrated hand gesture recognition framework incorporated and combined several kinds of ICT. In particular, the data acquisition was performed using a ZED 2 depth camera (Stereolabs Inc., San Francisco, CA, USA) mounted on a UGV (Thorvald, SAGA Robotics SA, Oslo, Norway). The open-source MediaPipe framework [56] was used for object detection and tracking in conjunction with an algorithm developed for the present study. For the preprocessing of the data, the SMOTE algorithm [57] was implemented. Furthermore, the UGV used in this study was equipped with a variety of sensors including (i) the depth camera mentioned above, (ii) a 3D laser scanner (Velodyne Lidar Inc., San Jose, CA, USA), (iii) an RTK GPS (S850 GNSS Receiver, Stonex Inc., Concord, NH, USA), and (iv) IMUs (UM7 IMU, RedShift Labs, Studfield, Victoria, Australia). To combine these technologies, an algorithm was developed within the ROS framework using Python, whose operational flow is described in Figure 5. Moreover, different tf publishers were developed within the tf ROS package [70], while the ROS navigation stack package [72] was used for safe navigation. To meet the increased requirements necessary for the detection ML model, a Jetson TX2 Module [64] with CUDA support was embedded in the robotic platform.
Overall, the challenge of properly classifying the five hand gestures was successfully fulfilled, while the preliminary results on real conditions are encouraging enough. In fact, three different scenarios took place considering either one or two persons. In all cases, the KNN classifier was used for the sake of brevity. It was demonstrated that a fluid HRI in agriculture can be successful in sharing the same working space with the robot being navigated within a predetermined safe range of speed and distance.
Obviously, toward establishing a viable solution, each aspect of HRI design (e.g., human factors, interaction roles, user interface, automation level, and information sharing [10]) should be tackled separately, so as to identify and address possible issues. We tried to add human comfort, naturalness, and the sense of trust to the picture, which are considered very important in HRI and have received limited attention in agriculture so far. Comfort and trust can contribute, to a great extent, to the feeling of perceived safety, which is among the key issues that should be guaranteed in synergistic ecosystems [6,75,76]. In other words, users must perceive robots as safe to interact with. On the other hand, naturalness refers to the social navigation of the robot though adjusting its speed and distance from the workers.
To conclude, the present study investigated the possibility of using nonverbal communication based on hand gesture detection in real conditions in agriculture. Obviously, this framework can also be implemented in other sectors depending on the specific application at hand. Possible future work includes the expansion of the present system to incorporate more hand gestures for the purpose of other practical applications. Toward that direction, recognition models that can identify a sequence of gestures could also be useful. In addition, it would be interesting to examine the potential of combing hand gesture recognition with human activity recognition [52,53] in order to propose a more fluid HRI and test the framework in more complex applications. More experimental tests should be carried out in different agricultural environments to identify possible motion problems of the UGV, which may lead to the application of cellular neural networks [77] to address them, similar to previous studies [78]. Lastly, future research could involve workers using their own pace in real conditions and investigation of not only the efficiency of the system, but also their opinion in order to provide a viable and socially accepted solution [6].
Supplementary Materials: The following supporting information can be downloaded at https://drive.google.com/file/d/1qHJXW5NFognF3mOVRxV1y09juVIHb4yz/view. Video S1. Video recording from an experimental demonstration session for testing the efficacy of the proposed synergistic framework.