GazeEMD: Detecting Visual Intention in Gaze-Based Human-Robot Interaction

: In gaze-based Human-Robot Interaction (HRI), it is important to determine human visual intention for interacting with robots. One typical HRI interaction scenario is that a human selects an object by gaze and a robotic manipulator will pick up the object. In this work, we propose an approach, GazeEMD, that can be used to detect whether a human is looking at an object for HRI application. We use Earth Mover’s Distance (EMD) to measure the similarity between the hypothetical gazes at objects and the actual gazes. Then, the similarity score is used to determine if the human visual intention is on the object. We compare our approach with a ﬁxation-based method and HitScan with a run length in the scenario of selecting daily objects by gaze. Our experimental results indicate that the GazeEMD approach has higher accuracy and is more robust to noises than the other approaches. Hence, the users can lessen cognitive load by using our approach in the real-world HRI scenario.


Introduction
Mobile eye-tracking devices, i.e., eye-tracking glasses usually comprise eye camera(s) for detecting pupils and a world camera for capturing the image of the scene. Gaze is calculated from the pupil images and projected to the image of the scene, which can reveal the information of a human being's visual intention. Gazes can be identified as different eye movements. Fixation and saccade are two of the most common types of eye movement event. Fixation can be viewed as gaze being stably kept in a small region and saccade can be viewed as rapid eye movement [1]. They can be computationally identified from eye tracking signals by different approaches, such as Identification by Dispersion Threshold (I-DT) [2], Identification by Velocity Threshold (I-VT) [2], Bayesian-methodbased algorithm [3] and machine-learning-based algorithm [4].
In gaze-based HRI, fixation is often used as an indication of the visual intention of a human. In [5], when a fixation was detected, an image patch is cropped around the fixation point and fed to a neural network to detect a drone. In [6], fixation is used to determine if a human is looking at Areas of Interests (AOIs) in a mixed-initiative HRI study. In [7][8][9], fixations were used for selecting an object to grasp. They were also used in the selection of a grasping plane of an object in [7]. However, there are limitations to using fixation to select an object for further actions. Consider a scenario such as that displayed in Figure 1, where a human is wearing a mobile eye-tracking device and can select detected objects on the table by gaze and then let a robot pick the object up for him or her. The robot will receive the information of the selected object and plan the grasping task; it does not share the human gaze points.
One approach to determine if the human visual intention is focused on an object is to use fixation. When a fixation event is detected, the gaze points in the event are in a small region in the world image. If the fixation center is on the object, the visual intention is  Intuitively, considering all gaze points can overcome the problem that the saccadic gazes on the object are lost. HitScan [11,12] uses all gaze points when determining if the visual intention is on the object. However, there exists one more kind of noise which we refer to as gaze drift error. Both fixation based method and HitScan have this problem. On some occasions, the center of a gaze point may fall out of the bounding box while the human is still looking at the object. This kind of noise has various sources. First is the fluctuation of the size of the bounding box caused by the object detection algorithm. Figure 2b gives one example. The figure shows two consecutive frames in one sequence. The gazes in two frames are located at the same position of the object, but the bounding box of the object changes. Second, a poor calibration would also result in this error. Moreover, the head-mounted mobile eye tracking device may accidentally be moved after calibration and the detected gaze will be shifted. Both fixation based method and HitScan based on checking if the gaze/fixation center is inside the object bounding box. Both will suffer from the gaze drift error. In addition, fixation based method will have information loss due to the saccades inside the bounding box.
When using gaze to select an object to interact with a robot, one issue is that the robot does not know if the human has decided on an object for interaction, even if the human visual intention is on the object. This is the Midas problem [13]. One solution is to use a long dwell time to confirm the selection [14]. Using fixation or HitScan with a long dwell time will also be less efficient due to the saccades and gaze drift error.
We propose GazeEMD, an approach to detect human visual intention which can overcome the limitations mentioned above. We form the question of detecting visual intention from a different perspective than checking if the gaze points are inside a bounding box. We compare the hypothetic gaze distribution over an object and the real gaze distribution to determine the visual intention. For a detected object, we generate sample gaze points within the bounding box. They can be interpreted as a hypothesis of where a human being's visual focus is located. These sample points are formed as the hypothetic gaze distribution. The gaze signals from the mobile eye-tracking device provide information of the actual location a human is looking at. The actual gaze points are formed as the actual gaze distribution. The similarity between hypothetic gaze distribution and actual gaze distribution is calculated by Earth Mover's Distance (EMD) distance. The EMD similarity score is used to determine if the visual intention is on the object. We conduct three experiments and compare GazeEMD with a Fixation-based approach and HitScan. The results show that the proposed method can significantly increase accuracy in predicting human intention with the presence of saccades and gaze shift noise.
To our best knowledge, we are the first to deploy EMD similarity to detect if the visual intention is on an object. Until recently, the fixation-based method and the method checking all gazes are still widely used in HRI applications, although both have the problem of gaze drift error. Little research focusing on solving this problem has been reported. The novel contributions of our work compared to the state-of-the-art are:

1.
We proposed GazeEMD, which can overcome the problem of gaze drift error which the current state-of-the-art methods do not solve and capture saccadic eye movements when referring objects to a robot; 2.
We show that GazeEMD is more efficient than the fixation-based method and HitScan when confirming selection, using a long dwell time which has not been reported in the literature before. The eye gaze is not required to be held in a small region.
The rest of the paper is organized as follows, In Section 2, we review the related work. In Section 3, we explain our method in detail. We describe the experimental setup and evaluation in Sections 4 and 5. In Section 6, the experimental procedure and evaluation method are presented, and Section 7 is the discussion.

Related Work
The I-DT and I-VT [2] are two widely used algorithms to identify fixation events. I-DT detects fixation based on the location of gazes. If the gazes are located within a small region, i.e., the coordinates of gazes are under de ispersion threshold, the gazes are considered as a fixation event. I-VT detects fixation based on the velocities of gazes. Gazes with velocities are under the velocity threshold are considered as a fixation event.
When a fixation event is detected, the fixation center is compared with the bounding box of the object to determine the visual intention. Some works use all gaze points to detect visual intention on objects. In [14,15], the accumulated gazes on objects are used to determine the object at which the user is looking. HitScan [11] detects the visual intention by counting the number of gazes entering and the number exiting a bounding box. If the counts of gazes inside a bounding box is higher than the entering threshold, then an event is started. If the count of gazes consecutively located outside of a bounding box is higher, then the event is closed.
In gaze-based HRI, an important issue in referring to objects is the Midas Touch problem [13]. If the gaze dwell-time fixating on an object is too short, then a selection is activated even if that is not the human intention. To overcome this problem, an additional activation needs to be made to confirm the human intention. The activation can be additional input devices [16,17], hand gestures [18,19] and eye gestures [20,21]. A common solution to overcome the Midas Touch problem is using a long gaze dwell-time [14,22] to distinguish the involuntary fixating gazes, which are rarely higher than 300 ms. Several HRI works have adopted this solution. In [23,24], dwell time is set to 500 ms to activate the selections of AOIs. In [8,9], the 2D gazes obtained from eye-tracking glasses are projected into 3D with the help of an RGB-D camera. The fixation duration is set to two seconds to confirm the selection of an object in [8]. A total of 15 gaze points on the right side of the bounding box are used to determine the selection in [9]. In [21], gaze is used to control a drone. A remote eye-tracker is used to capture the gazes on a screen. Several zones are drawn on the screen with different commands to guide the drone. Dwell time from 300 ms to 500 ms is tested to select a command to control the drone. In [25], a mobile eye-tracker and a manipulator are used to assist surgery in the operating theatre. A user can select an object by looking at the object for four seconds. Extending gaze dwell time to overcome the Midas Touch problem in HRI has proven valid regardless of the type of eye-tracking device and application scenario. However, a long dwell time would increase the user's cognitive load [17]. The users deliberately increase the duration time, which means extra effort is needed to maintain the gaze fixation on the object or AOI. Furthermore, if a user fails to select an object, the long dwell time will make the process less efficient and also increase the user's effort. Such disadvantages are critical to users who need to use the device frequently, such as disabled people using gaze to control wheelchairs.
Dwell time has also been evaluated with other modalities of selection. In [26], dwell time is compared with clicker, on-device input, gesture and speech in VR/AR application. Dwell time is preferred as a hands-free modality. In [17], dwell time obtains a worse performance than the combination of dwell time and single-stroke gaze gesture in wheelchair application. When controlling drone [21], using gaze gestures is more accurate than using dwell time, although it takes more time to issue a selection. Depending on the different dwell times and applications, the results of these works differ. There is no rule of thumb to select the best specific modality; dwell time still has the potential to reduce the user discomfort and increase efficiency, provided that the problems mentioned in Section 1 are overcome.
Our proposed approach uses EMD as the metric to measure the similarity of two distributions. EMD was first introduced into the computer vision field in [27,28]. The EMD distance was also used in image retrieval [27,29]. The information of histograms of images were derived to construct the image signatures P = {(p 1 , w p 1 )...(p m , w p m )} and Q = {(q 1 , w q 1 )...(q n , w q n )} where p i , w p i , m and q j , w q j , n are the cluster mean, weighting factor and number of clusters of the respective signature. Distance matrix D is the ground distance between P and Q and flow matrix F describes the cost of moving "mass" from P to Q. EMD distance is the normalized optimal work for transferring the "mass". In [29], EMD is compared with other metrics, i.e., Histogram Intersection, Histogram Correlation, χ 2 statistics, Bhattacharyya distance and Kullback-Leibler (KL) divergence to measure image dissimilarity in color space. EMD had a better performance than the other metrics. It was also shown that EMD can avoid saturation and maintain good linearity when the mean of target distribution changes linearly.
A similar work [30] also uses EMD distance to compare the gaze scan path collected from the eye tracker and the gaze scan path generated from images. The main differences between their work and ours are: (i) In [30], the authors use a remote eye tracker to record gaze scan paths when the participants are watching images. We use First Person View (FPV) images and gazes record by a head-mounted eye tracking device. (ii) We want to detect if the human intention is on a certain object, and their work focuses on generating the gaze scan path based on the image.

Methodology
Our methodology will be applied in the scenario described in Section 1. There are three objects, namely, cup, scissors and bottle, in the scene ( Figure 1). All objects are placed on a table. The user can select one of the objects and a robotic manipulator can pick up the desired object. Figure 3 shows the overview of the GazeEMD visual intention detection system. We use a head-mounted eye tracking device that provides the world image I w and gaze point g(x, y), where x and y are the coordinates in I w . We first detect all the objects by feeding the world image I w to the object detector. Then, we generate hypothetic gaze samples on the detected objects and compare them with actual gazes obtained from the head-mounted eye tracking device. Finally, the similarity score between the hypothetic gaze distribution and actual gaze distribution is used to determine if the human visual intention is on an object.

Object Detection
We use deep-learning-based object detector YOLOv2 [31] to detect the objects in our scene. The network of YOLOv2 is trained on a COCO dataset [32]. YOLOv2 detector Y = [B, C] takes world image I w as the input of network and predicts the bounding boxes B and class labels C.   For each detected object, we crop an image patch I obj from I w by the size of the object bounding box. From all pixels of I obj , k pixels are sampled, following a Gaussian distribution. The sampled pixels are interpreted as hypothetic gaze points. Next, we calculate the Euclidean distance between each of the k pixels and the center of the bounding box. This distance distribution is denoted as hypothetic gaze distribution π s . To form the actual gaze distribution, we collect k gaze points from the eye tracking device and calculate their Euclidean distances to the center of the bounding box. The resulting distance distribution is denoted as actual gaze distribution π g .

Similarity between Distributions
EMD is used as the measure of the similarity between distributions π s and π g . In order to use EMD, the distributions need to be transformed into signatures. We first calculate the geometric distance histograms H s = ∑ m i=1 b i s for π s and H g = ∑ n j=1 b j g for π g , where m and n are the number of bins. The range of histogram value depends on the size of the world image. If the image size is 640 × 480, the maximum value will be 800. The signatures s s and s g are calculated similarly to [29] where b s and b g are the bin values from H s and H g ; the weighting factors w s and w g are the middle values of the respective bin intervals [29]. The distance matrix D sg = [d ij ] is the ground distance between π i s and π j g . The flow matrix F sg = [ f ij ] is the cost of moving the "mass" from π s and π g . The work function is The EMD distance is calculated as Then, the similarity score, i.e., EMD distance, is used as a metric to determine whether the visual intention is on an object, given the detected object bounding boxes B and a set of consecutive gaze g. The EMD visual intention V EMD (B, g) is calculated as where C i and T i is the ith label of object and the threshold for ith object. The EMD visual intention V EMD (B, g) is the object a human is looking at. V EMD (B, g) = 0 represents that the human is not looking at any object. The threshold T i is required for binary classification. We use Receiver Operating Characteristic (ROC) curve to select appropriate threshold values for GazeEMD. The threshold T i for ith object is calculated by where TPR j and FPR j are the jth True Positive Rate and the jth False Positive Rate in the ROC curve.

Fixation-Based Method
The GazeEMD is compared with the fixation-based method and the HitScan. We describe the fixation-based method here, and the HitScan in Section 3.4. The I-DT algorithm detects fixation events and calculates the fixation center f p = IDT(t d ) for each detected fixation event. The parameter t d is the fixation duration. The rest of the gaze instances are all considered as saccadic events. The fixation visual intention V f is calculated as where n is the number of detected objects. If the f p is within the ith bounding box B i , the intention is assigned to the object.

HitScan with Run Length Filtering
We implement the HitScan with run length filtering approach proposed in [11]. For a given gaze sequence g and the bounding box of one object B, HitScan H i checks if a gaze point g i is located inside a bound box Run length filter defines two constraints, T 1 and T 2 . T 1 is the minimal number of consecutive gaze points, which are located inside a bound box. Similarly, T 2 is the minimal number of consecutive gaze points located outside a bound box. Run length filter uses T 1 and T 2 to define if a "look" L is on an object. L is equivalent to our visual intention for single objects and it meets the condition that, for a set of gazes, g L , where n is the length of g L . A HitScan event consists of the gaze points in the look L. In the case of multiple objects, the visual intention V hr is calculated by iterating the HitScan and run length filter over all objects.

Experiment
We conduct three experiments to evaluate the performance of our proposed algorithm. First, we see the performance on single objects. Second, we evaluate the case with multiple objects and last is free viewing. In all experiments, three daily objects, i.e., bottle, cup and scissors, are used. Although we only test with three objects, GazeEMD can generalize well on different objects, since the hypothetic gaze distribution is generated based on the size of the bounding box. The objects are placed on the table. The participant wears the eye-tracking glasses and sits next to the table and conducts the experiments.

Single Objects
In this experiment, each participant performs three experiment sessions. In each session, a different object is used. At the beginning of one session, the participants look at the object first and then look away from the object. During the "look away" period, the participants can freely look at any place in the scene except the object.

Multiple Objects
In this experiment, each participant performs one experiment session. In the session, all objects are placed on the table at the same time. Participants look at objects one by one, in order, and repeat the process several times. For instance, a participant looks at the scissors, bottle and cup sequentially, and then looks back at the scissors and performs the same sequence.

Free Viewing
The scene setup of the experiment is the same as Multiple Objects: instead of looking at objects with a sequence, the participants can freely look at anything, anywhere in the scene.

Data Collection and Annotation
We asked seven people to participate in the experiments. All participants are aged between 20 and 40, and all of them are researchers with backgrounds in engineering. All the people voluntarily participated in the experiments. One of the participants has experience in eye tracking. The rest had no prior experience in eye tracking.
A researcher with eye-tracking knowledge and experience labelled the dataset. The annotator used the world image to label the data. The object-bounding boxes and gaze points are drawn in the world images. All datapoints are annotated sample by sample. For the Single Objects experiment, an algorithm can be viewed as a binary classifier, i.e., whether the visual intention is on an object or not. The annotations are clear since the visual intention on and off the object is distinguishable. In the phase of looking at the object, even if gazes are outside of the bounding box, they can be labeled as "intention on object". Conversely, in the phase of looking away from the object, all data can be labeled as "intention not on object". In the Multiple Objects and Free Viewing experiment, an algorithm acts as a multi-class classifier. If a participant looks at none of the objects, it is treated as a null class. The labeling in the Multiple Objects experiment is also clear, since the sequence of shifting visual intention between objects is distinguishable. During the free viewing period, the annotator subjectively labels the data by experience.

Implementation
We used Pupil Labs eye-tracking glasses [33] for eye tracking. The frame rate of the world image and the eye-tracking rate are both set to 60 fps. The YOLOv2 object detector is implemented by [34].
For the GazeEMD, we calculated the optimal thresholds for each object by Equation (5). The calculation used the data from the Single Objects experiments. The thresholds were applied in the Multiple Objects and the Free Viewing cases too. The number of bins and the histogram range in GazeEMD is 10 and 715. They were used in all experiments. The I-DT for fixation detection we used is also implemented by [33]. The dispersion value for the I-DT is three degrees. It was used in all experiments. The selection of parameter T 1 and T 2 are described in Section 5.1.

Evaluation
For all experiments, we carry out sample-to-sample analysis and event analysis. We compared our algorithm with the fixation-based approach and HitScan With Run Length Filtering proposed in [11].
The Single Objects experiment serves three purposes. First, the optimal thresholds are determined. As described in Section 3.2.2, our algorithm needs a threshold value for the binary classification in GazeEMD. We first evaluated the performances with different thresholds with the data from the Single Objects experiment. For each object in the experiment, we selected one threshold for further evaluation and the threshold value is also used in the Multiple Objects experiment and the Free Viewing experiment. Second, we evaluated the sample-to-sample accuracy with different gaze lengths to show that GazeEMD can deal with the gaze drift error better than fixation and HitScan. Third, the event analysis with a long gaze length is equivalent to using a long dwell time to confirm the object selection. We evaluated the performance on the event level to show that GazeEMD is more efficient in confirming the selection.
The Single Objects experiment setup is constrained. The participants are asked to look at the object and look away. There is only one object in the scene. We conduct the same evaluation to Multiple Objects experiment and the Free Viewing experiment to see whether GazeEMD can overcome gaze drift error when the constraints are fewer.

Selection of Event Length
All algorithms are evaluated with three different event lengths: 90 ms, 1000 ms and 2000 ms. They correspond to a short, medium and long dwell time. The short dwell time is within the normal fixation duration [35], and medium and long dwell times can be used to distinguish the involuntary fixations. Since the algorithms detect events differently, it is not possible to set the event lengths exactly the same. We set the parameter of each algorithm so that they have approximately the same event length. The event length of our algorithm depends on the size of distribution k. k is set to 5, 60 and 120, respectively. The fixation duration t d determines the event lengths of the fixation-based approach, which iareset to 90 ms, 1000 ms and 2000 ms. For detection by HitScan with run length filtering, the parameter T 1 decides the event length. T 1 is sample related; thus, it is also set to 5, 60 and 120, respectively. T 2 is set to 17 according to the experiment in [11].

Sample-to-Sample Analysis
The prediction of algorithm contains a set of gaze points. For labeling the ground truth for sample-to-sample analysis, each gaze in the data is assigned a label. To analyze the results sample-wise, the algorithm predictions are compared with the ground truth sample by sample, i.e., the predicted label of each gaze point in one prediction is compared with the ground truth label of each gaze point. We use Cohen's Kappa to evaluate sample-to-sample accuracy instead of other commonly used metrics such as Precision-Recall and F1 score. Cohen's Kappa measures the agreement between two sets of data. The value zero means no agreement and a value of one means perfect agreement. Cohen's Kappa is commonly used to evaluate the agreement between different annotators. However, it can also be used to compare the predictions by algorithms with ground truth [4]. Then, the Kappa score can be interpreted as a measure of accuracy. As pointed out in [4] Cohen's Kappa is a better option than Precision-Recall and F1 score when evaluating imbalanced data. In our experiment, we will compare the results of algorithms with annotated data. When determining visual intention using the fixation-based method, the data is imbalanced due to the high number of fixations. Thus using Cohen's Kappa will give a better understanding of the results.

Event Analysis
We first define the events for the event analysis. An event consists of a set of consecutive gaze points whose label is on the object. For the data in a session, two sets of events-the events in the predictions and the events in the ground truth-are obtained. The calculations of predictions of GazeEMD, fixation-based approach and HitScan events are described in Sections 3.2-3.4. The event analysis metrics are similar to the ones in [36,37]. We derive the segments from the events in the predictions and the events in the ground truth and score the segments for evaluation. Segments are the partitions of a sequence that have one-to-one relations between the events in predictions and ground truth. A segment is either partitioned by the boundary of the events in prediction or the boundary of the events in ground truth. Figure 5 shows how segments are partitioned from the events.
The segments are scored with True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). In the single-object case, the segment scoring is a binary case. A TP segment means that, in this segment, the algorithm has an intention event detection and there is also an intention event in the ground truth. An FP segment has algorithm detection but no detection in ground truth. TN means neither algorithm nor ground truth has a detection in segment, and FN means that the algorithm fails to detect the intention event in ground truth. In the multiple-object case, all scoring is the same as in the single-object case, except for FN. An FN in multiple objects still holds the definition in the single-object case. In addition, if an algorithm detects an intention event but its label is not the same as in the ground truth, the segment is also an FN segment. Then, the scores of all segments are summed and we calculate the F1 score for the full sequence. We use F1 instead of Cohen's Kappa for the following reason. For long gaze length (2000 ms), the predictions made by Fixation and Hitscan are extremely low. There are cases where it is not possible to calculate the Kappa score. (4) and (5), the GazeEMD acts as a binary classifier. We use Area Under the Curve (AUC) scores under the ROC curve to see the binary classifier performance. Table 1 shows the means and standard deviations of AUC scores of all participants. As shown in the table, our algorithm performs almost perfectly on all objects with different gaze lengths. This implies that there are clear boundaries in the EMD scores.  Figure 6 displays the EMD distances and Euclidean distances of the object bottle from one participant with different gaze lengths. The blue dots are the EMD distance between the gaze distribution and the hypothetic gaze distribution. The Euclidean distance is represented by red dots, showing the geometric distance between actual gaze points and the center of the bounding box. Figure A1a-c shows the distances with all tested objects in Appendix A. The phase of "looking at the object" and "looking away" can be observed from the figures. For instance, in Figure 6a, the participant switches from "looking at the object" to "looking away" at the 1570th sample. The EMD distance values can be easily separated between the two phases. With a longer gaze length, it is easier to distinguish if a participant is looking at an object. Moreover, the EMD distances demonstrate a good correlation with the Euclidean distances, which means that, in addition to its use as similarity score, the EMD value also indicates the information of geometric distance between gaze and bounding box, i.e., a higher EMD value means that the gazes in a gaze distribution are farther from the center of the bounding box.  Table 2 shows the means and standard deviations of the Kappa scores in sample-tosample analysis. The Kappa scores suggest that our algorithm generally performs better than the other two algorithms. For all three objects with all gaze lengths, the GazeEMD has the highest mean Kappa scores, which are all above 0.9 and lowest standard deviations. GazeEMD is less affected by the gaze length. As shown in Table 2, the mean Kappa scores for all objects in all gaze lengths are comparable. The fixation-based method is severely affected by the gaze length. For the object bottle, when the gaze length is increased from 90 ms to 1000 ms and 2000 ms, the mean Kappa score is dropped to 0.398 and 0.143. For the object cup and scissors, it is dropped to 0.389, 0.13 and 0.539, 0.283, respectively. The mean Kappa of HitScan is also decreased to 0.657 and 0.559 for bottle, 0.75 and 0.67 for cup, 0.794 and 0.733 for scissors. For each object, it contains a certain percentage of gaze points which locate outside of the bounding box while the human intention is still on the object, 19.8% of the gazes when looking at the bottle is located outside of the bounding box. For cup and scissors, the percentages are 11.2% and 8%. The analysis in 90 ms shows that GazeEMD can better deal with the gaze drift error. The predictions with 90 ms assign sample labels on the basis of five samples, which is more precise than 1000 ms and 2000 ms. Although the mean Kappa of all algorithms decreases with more gaze drift error, GazeEMD has the highest mean Kappa and lowest standard deviation. This shows that GazeEMD is more accurate with the presence of the gaze drift error. In this experiment, the participants are asked to look at the object and then look away. The majority of gazes either belong to the period of looking at the object or the period of looking away, and the gazes are easily distinguishable. The factor affecting the performance is the gaze drift error. The higher Kappa of GazeEMD indicates that it can better overcome the drift error compared to Fixation and the HitScan.   Table 3 shows the means and standard deviations of the event F1 scores. The GazeEMD has the best mean and the best standard deviation for all objects with all gaze lengths, and the effect of a longer gaze length is trivial. Similarly to the sample-to-sample analysis, the mean F1 scores of the fixation and the HitScan decrease with the increase in gaze length.   Table 4 shows the numbers of detected events by all algorithms. The number of detected events of all participants for each object and each gaze length are summed together. HitScan is not affected by the gaze length. The number of detected events is close in all three gaze lengths. HitScan decides when an event is closed by checking if the gaze remains located outside of the bounding box for T 2 frames. This merges the gazes inside the bounding box into a long event, and thus the total number of events is lower. Furthermore, our algorithm detects more events than the fixation algorithm in all cases ( Table 4). The fixation algorithm cannot capture the saccadic gaze movements while the GazeEMD does. The additional events are mostly detected during saccades. In addition, when using a long dwell time to confirm the selection, more events and higher event accuracy means that GazeEMD is more efficient than fixation and HitScan. Fewer events in Fixation means that the remaining gazes are detected as saccade events. When a saccade event occurs inside the object bounding box, the fixation is interrupted and the participant needs to start a new fixation event in order to confirm the selection. For the HitScan, fewer events are caused by the algorithm itself. A HitScan event contains more gazes than the GazeEMD, which means that it will take a longer time to detect the event. Fewer detected events and a lower F1 in Fixation and HitScan means that a participant may attempt to confirm the selection.

Multiple Objects
In this experiment, we extend the evaluation from Single Objects to Multiple Objects. The means and standard deviations of the sample-to-sample Kappa are shown in Table 5. The GazeEMD has the best mean Kappa scores in all gaze lengths. It also has the lowest standard deviations in gaze lengths of 90 ms and 1000 ms. Although the lowest standard deviation in gaze length 2000 ms is the fixation-based method, the mean Kappa of GazeEMD is still 0.307 higher than the fixation-based method. In the Single Objects experiment, we showed that GazeEMD outperformed fixation and HitScan with the presence of gaze drift error. When there are multiple objects in the scene, the GazeEMD also has the best mean Kappa, showing that it can deal with gaze drift error in a multi-object scene.  The means and standard deviations of the event F1 are shown in Table 5. GazeEMD has the best means and standard deviation in all gaze lengths. The numbers of detected events are displayed in Table 6. HitScan can detect considerably more events than in the case of a single object when the gaze length is 90ms. The detected events are 1559 in this experiment. In the Single Objects experiment, the detected events of bottle, cup and scissors are 25, 24 and 19, respectively. The reason for this is that the participants keep switching the visual intention onto different objects in this experiment. The exit parameter T 2 can, therefore close an event accordingly. When the gaze lengths are 1000 ms and 2000 ms, the enter parameter T 1 for entering an event is set to 1000 ms and 2000 ms, which means a participant needs to look at an object for 1000 ms and 2000 ms to start an event. Thus, the number of detected events is 92 and 29, which is significantly lower than the events when the gaze length is 90 ms. Overall, GazeEMD still has the highest number of detected events and event F1, which indicates that the confirmation of object selection is more efficient in the multiple-objects scenario.  Table 7 shows the means and deviations of the Kappa and F1 scores. In the sample-tosample analysis, GazeEMD has the highest mean Kappa for all gaze lengths. HitScan has the fewest standard deviations on 90 ms and 1000 ms. Fixation has the fewest standard deviations when the gaze length is 2000 ms. The sample-to-sample results show that GazeEMD is more accurate in dealing with gaze drift error when the participants are not instructed. On the event level, GazeEMD has the highest mean F1 on 1000 ms. On 90 ms and 2000 ms, HitScan has the highest mean F1 scores. However, on 2000 ms gaze length, the mean F1 of GazeEMD is 0.334, close to the 0.365 of HitsScan. The event analysis does not represent the scenario of confirming selection by a long dwell time, since the participants are freely looking at anything, without instruction to look at a particular object for a long time.  On both sample-to-sample level and event level, the Kappa scores and F1 scores in Free Viewing (Table 7) are lower than the ones in Multiple Objects ( Table 5). One of the causes of this is confusion in annotation. The annotations in Single Objects and Multiple Objects are clear, since whether the gaze is on an object is distinguishable. However, in Free Viewing, the gaze intention is not as clear as in the other two experiments. Especially when the gazes are close to the bounding boxes, whether a participant is looking at the edge of an object or deliberately looking at the area around the object cannot be determined. Although the annotations of the gazes contain uncertainties, they create a scenario with noisier data. All algorithms will have the same uncertainties and we can see how they perform on the noisier data. The higher Kappa of GazeEMD on sample level shows it has better performance not only in the constrained experiments (clean data), but also in the experiment without constraints (noisier data). This demonstrates that GazeEMD can generalize well.
The numbers of detected events are displayed in Table 6. Similar to the Multiple Objects, the number of events detected by the HitScan when the gaze length is 90 ms is significantly higher than 1000 ms and 2000 ms. The reason for this is described in Section 6.2.

Discussion and Conclusions
In this work, we propose a new approach to determine visual intention for gaze-based HRI applications. More specifically, our algorithm GazeEMD determines which object a human is looking at by calculating the similarity between the hypothetical gaze points on the object and the actual gaze points acquired by mobile eye tracking glasses. We evaluate our algorithm in different scenarios by conducting three experiments: Single Objects, Multiple Objects and Free Viewing. There are two constraints in the Single Objects. The scene is rather clean-only one object exists at a time-and the participants are asked to look at the object first and then look away. We use this constrained setting for three reasons. First, it is easier to evaluate the performance of GazeEMD as a binary classifier and select appropriate threshold values for different gaze lengths. Second, we can remove the noise from annotation to better evaluate the noises caused by gaze drift error and variations in bounding boxes. Finally, we can create sequences with medium and long gaze lengths (1000 ms and 2000 ms), which are essential for confirming the selection of an object by long dwell time. Evaluating the long gaze lengths is equivalent to solving the Midas problem with a long dwell time in HRI applications.
The results demonstrate that the GazeEMD has excellent performance, as well as the ability to reject the gaze drift error. Tables 2 and 3 show that GazeEMD has the highest mean Kappa and F1 scores on both the sample-to-sample level and event level. When the gaze length is 1000 ms and 2000 ms, the mean Kappa and F1 scores are significantly higher than the fixation-based method and HitScan. For the bottle, when the participants look at the object, 19.8% of the gazes are outside of the bounding box. This indicates that GazeEMD has a better performance than the fixation-based method and HitScan, when gaze drift errors occur. The same conclusion can be drawn for the cup and scissors. We extend the evaluation from the single-object case to the scene with multiple objects and free viewing. GazeEMD still has higher Kappa and F1 scores (Tables 5 and 7) than Fixation and HitScan, except the cases of 90 ms and 2000 ms in the Multiple Objects experiment, where the F1 scores are 0.055 and 0.031 lower than the HitScan. Nevertheless, the results are still comparable in these two cases.
In a lot of gaze-based HRI applications, a human needs to interact with objects. A common case is the selection of an object to be picked up by a robotic manipulator. One key issue in this kind of interaction is confirming the selection of an object. The robot knows which object a human is looking at, but does not know that he or she has confirmed the selection of a certain object without additional information, i.e., the Midas problem. One approach to this is looking at the intended object for a longer time, i.e., a long dwell time. This scenario is equivalent to the 2000 ms in the Single Objects experiment. If the gaze dwells on the object for 2000 ms, the object is considered to be selected and confirmed. The robot can perform further steps. For the fixation-based approach, voluntarily increasing the fixation duration will be helpful to increase the number of successful confirmations, but it will also increase the cognitive load [38]. Another downside of fixation is that it cannot capture the saccadic gazes located within the bounding box. This means that using the fixation-based approach will miss the information from gained from the human gaze moving to different parts of the object. This will interrupt a long fixation; hence, the human needs to try harder to select an object for interaction, which will increase the cognitive load. By using an algorithm such as HitScan, which considers all gazes within the bounding box, this problem could be eliminated. However, both the fixation-based approach and HitScan still cannot deal with the gaze drift problem. The trials would potentially be increased to confirm the selection. However, GazeEMD can overcome these problems. GazeEMD detects more events and has higher accuracy than fixation and HitScan (Tables 3  and 4). This indicates that GazeEMD has more successful confirmations of the selection than the other two methods. This is important in the real application, where the users constantly use gaze for interaction, such as disabled people who will need wheelchairs and manipulators to help with daily life. The GazeEMD also has an excellent performance with a short dwell time (90 ms). It can be applied to the cases in which the gaze and object need to be evaluated but interaction with the object is not required, such as analyzing gaze behavior during assembly tasks [39], or during the time in which an object is handed between humans or human and robot [40].
We propose using GazeEMD to detect whether the human intention is on an object or not. We compared GazeEMD with the fixation-based method and HitScan in three experiments. The results show that GazeEMD has a higher sample-to-sample accuracy. Since the experimental data contain gaze drift error, i.e., the intention is on the object while the gaze points are outside of the object bounding box, the higher accuracy of GazeEMD indicates that it can overcome the gaze drift error. In HRI applications, a human often needs to confirm the selection of an object so that the robot can perform further actions. The event analysis with long gaze lengths in Single Objects experiments shows the effect of using a long dwell time for confirmation. The results show that GazeEMD has higher accuracy on the event level and more detected events, which indicates that GazeEMD is more efficient than the fixation. The proposed method now can detect the human intention in the scenario where the detected bounding boxes of the objects are not overlapped. One future research direction could be further developing the algorithm so that the gaze intention can be detected correctly when two bounding boxes are overlapped.