An Investigation on the Feasibility of Uncalibrated and Unconstrained Gaze Tracking for Human Assistive Applications by Using Head Pose Estimation

This paper investigates the possibility of accurately detecting and tracking human gaze by using an unconstrained and noninvasive approach based on the head pose information extracted by an RGB-D device. The main advantages of the proposed solution are that it can operate in a totally unconstrained environment, it does not require any initial calibration and it can work in real-time. These features make it suitable for being used to assist human in everyday life (e.g., remote device control) or in specific actions (e.g., rehabilitation), and in general in all those applications where it is not possible to ask for user cooperation (e.g., when users with neurological impairments are involved). To evaluate gaze estimation accuracy, the proposed approach has been largely tested and results are then compared with the leading methods in the state of the art, which, in general, make use of strong constraints on the people movements, invasive/additional hardware and supervised pattern recognition modules. Experimental tests demonstrated that, in most cases, the errors in gaze estimation are comparable to the state of the art methods, although it works without additional constraints, calibration and supervised learning.


Introduction
Gaze tracking plays a fundamental role in understanding human attention, feelings and desires [1]. Automatic gaze tracking provides several application in the fields of human-computer interaction (HCI) and human behavior analysis, therefore several techniques and methods have been investigated in recent years. When a person is in the field of view of a static camera, gaze can give information about the focus of attention of the subject, allowing for gaze-controlled interfaces for disabled people [2], driver attention monitoring [3], pilot training [4], provision of virtual eye contact in conferences [5] or marketing strategies analysis [6].
A survey of existing works and a detailed classification of extant methods can be viewed in [7]. Most gaze tracking methods are based on Pupil Center Corneal Reflection (PCCR) technique, [8][9][10][11][12][13][14]. They obtain the pose of the eye using the center of pupil contour and corneal reflections (glint) on the corneal surface from point light sources, usually one or multiple infrared (IR) lights. Normally this kind of approaches are not quite appropriate for generic interactive applications since a high-resolution camera is needed and a careful calibration is required for coupling IR lights and camera. Less invasive solutions that do not use IR are also available. The work of [15] proposes a method to achieve gaze estimation from multimodal Kinect data that is invariant to head pose, but it needs a learned person-specific 3D mesh model. In [16], after a one-time personal calibration, facial features are tracked and then used to estimate the 3D visual axis, proposing a 3D geometrical model of the eye. The method requires to accurately detect eye corners in order to create a complete 3D eye model. In [17] gaze tracking is performed using a stereo approach to detect the position and the orientation of the pupil in 3D space. A calibration procedure must be provided and, moreover, the reconstruction of the elliptic eye model cannot be well defined for all the gaze orientations. A low-cost system for 2D eye gaze estimation with low-resolution webcam images is presented in [18]: binary deformable eyeball template is modeled and 2D gaze estimation is performed depending on the displacement in eye movements and after a rigid calibration procedure. A method that, after an initial calibration, enables tracking motion of user's eye and gaze by using a single webcam-in a simplified, special case, i.e., when the face is still while the eyes move-is instead proposed in [19]. Valenti et al. [20] combine head pose and eye location informations to accurately estimate gaze track. Their results are suitable for several applications but, unfortunately, a calibration phase that makes use of a target plane is needed in order to get reference positions to extract eye gaze directionality.
All the aforesaid methods operate in constrained condition (e.g., a very short range of head pose variations) and they need a learning phase by which manually labeled data are used to train one or more classifiers. Furthermore, a calibration phase is often needed to set up the parameters performing the mapping between the real word and the computational models embedded in the algorithmic procedures. The predominant idea behind the works in this research field was that head pose estimation information can supply only a rough estimation of the human gaze. In other words, only the area of interest of the person can be retrieved and, therefore it is inadequate, if considered alone, to obtain accurate estimation of the gaze direction and to allow applications such as remote device control or in rehabilitation scenarios. For example, in [21] authors assert that the head pose contributes to about 70% of the visual gaze, and focus of attention estimation based on head orientation alone can get an average accuracy of 88.7% in a meeting application scenario. The influence that head orientation exerts on the perception of the eye-gaze direction is investigated in [22], where the authors conclude that an image-based mechanism is responsible for the influence of head profile on gaze perception, whereas the analysis of nose angle involves the processing of face features. In [23,24] head (and eventually body) pose information is used for estimating where a person is looking at. In [25,26] the visual focus of attention is recognized by evaluating only head pose information. More recently, the authors in [27] introduced a scene-specific gaze estimator for visual surveillance: it models the interactions between head motion, walking direction and appearance in order to recover gaze directions. Anyway, in the last years this point of view is being changing due to the new perspectives emerged from the exploitation of the most advanced sensorial technologies that allow the pose estimation to become more and more accurate. For example, in [28] the pioneering attempt to use head pose information extracted by a complex supervised algorithm working on 2D images in order to control a mouse is performed: the experimental results qualitatively showed the promise of the algorithm. To overcome the drawbacks of the related works, an early study on the estimation of visual focus of attention using fuzzy fusion of head rotations and eye gaze directions has been recently introduced in [29]. Two novel techniques to estimate head rotations, based on local and appearance information, are introduced and then fused in a common framework. Anyway, this framework does not focus on inferring exact gaze estimation but, rather, it detects degrees of confidence, through fuzzy logic, regarding hypotheses that a person is looking towards a specific point.
Unfortunately, no studies have still been performed on the feasibility of an accurate gaze estimator based only on head pose information. To fill this gap, in the proposed work, an innovative approach to achieve the exact position of gaze tracking ray from data acquired from a low cost depth sensor device is introduced. The proposed solution estimates the head pose of a subject freely moving on the environment, requiring only the presence of his head in the field of view of the sensor, in order to directly derive his 3D gaze ray. Neither training nor calibration phase are required to accomplish the gaze estimation task. This is another important contribution of this paper with respect to leading approaches in the state of the art. In our work, quantitative evaluations of the gaze estimation accuracy have been achieved by a large experimental phase, with several different distance ranges and different people with diverse levels of knowledge of the performing task. The remaining sections of the paper are organized as follows: Section 2 discusses the methodological steps of the proposed approach whereas the experimental setup is introduced in Section 3. Finally, the experimental outcomes are reported in Section 4, while their discussion is reported in Section 5.

Overview of the Proposed System
The proposed solution works as follows: the input data are acquired from a commercial depth sensor providing as output both RGB and depth data, which are the input for the following algorithmic steps. First of all, face detection is performed on the RGB image by matching appearance with predetermined models. The detected faces are then tracked over time using local features and topological information. The available depth information is then used to iteratively match the tracked points with a 3D point cloud representing the face geometry. The 3D head pose is finally estimated in terms of yaw, pitch and roll angles, and the gaze vector is computed as the vector having the origin in the average point between the two detected 3D positions of the eyes and direction according to the estimated head pose. A block diagram of the overall system is showed in Figure 1.

Face Tracking and 3D Head Pose Estimation
The first algorithmic step performed on the acquired RGB images is aimed at detecting the human faces. This is done by an approach that consists of three steps [30]. A linear pre-filter is at firstly used to increase the detection speed. Then, a boosting chain [31] is applied to remove most of the non-faces from the candidate list and, finally, a color filter and a SVM filter are used to further reduce false alarms. When the system detects a face, characteristic points are identified and a parameterized face mask is automatically overlapped on the human face. After the first detection, it is then possible to track the detected face over time, reducing this way the computational load needed to process the input images. Detection of the characteristic points of the face and their temporal tracking are based on the Active Appearance Model (AAM) [32]. AAM contains a statistical model of the shape and its representation as grey-level appearance. The core of the algorithm is the matching procedure that involves finding the model parameters which minimize the difference between the given appearance and the synthesized model example, projected into the image. In order to improve tracking performances, temporal matching constraint and color information are included in the model, as suggested in [33]. The next step consists in building a 3D model of the detected face. This is done by the Iterative Closest Point (ICP) [34] technique by which a 3D point cloud model is iteratively aligned with the available 2D facial features (target). The algorithm revises the transformation, i.e., combination of translation and rotation, needed to minimize the distance between the model and the target. The used 3D face model is the Candide-3 [35], a parameterized mask specifically developed for model-based coding of human faces. It allows fast reconstruction with small computing overhead. It is invariant to operating conditions and it does not depend on a specific person. This model is based on 121 linked feature points which are stored in a vector¯ g containing their (x, y, z) coordinates. The model is reshaped by the equation: where g t+1 is the updated vector, S and A are the Shape and Animation Units and σ, α contain shape and animation parameters. When the distance between the Candide-3 model and the target face is minimized the depth information of the 121 feature points is extracted from the available depth map and it will represent the input of the head pose estimation block.
The head pose estimation supplies the information about rotation angles in terms of yaw, pitch and roll, and translations, in meters, that in this paper are assumed to be expressed considering as reference point the center of the sensor. Head pose estimation is a problem with 6 Degrees of Freedom (DoF), and it can be represented with the parameter vector p = [ω x , ω y , ω z , t x , t y , t z ], where ω x , ω y , ω z are the rotation parameters and t x , t y , t z are the translation parameters. They define the 3-DoF rotation matrixes R 3×3 as: and the 3-DoF translation vector T 3×1 as: The rigid motion of a head point X = [x, y, z, 1] T between time t and time t + 1 is: where M is defined in [36] as : Let point X(t) be projected on the image plane in u = u x u y T . The explicit representation of the perspective projection function in terms of the rigid motion vector parameters and the coordinates of the point at t + 1 is: where f L is the focal length. In order to fuse rotation and translation information into the Candide-3 model, Equation (1) is modified as: where s represents the scale. Thus, vector p has now components: At the end, the position of the user's head is expressed in world coordinate X, Y, and Z which are reported based on a right-handed coordinate system with the origin at the sensor, Z pointed towards the user and Y pointed up. Figure 2 shows the 3D mask overlapped to the 2D facial image in three different frames. From the figure it is possible to observe that the face tracker works also in presence of non frontal views.

Gaze Estimation
Gaze estimation step is based on the 3D model coming from the previous blocks. It geometrically models the gaze ray direction with regard to the 3D position of the sensor and thus it can be categorized as model based method. First of all, the 3D positions of the eye centers are extracted from the actual position of the overlapped 3D face model. After that, in order to define a point on the face from which the computed gaze vector takes its origin, a conventional point in the middle of the segment connecting the 3D eye center positions is taken. This point approximatively corresponds to the nose septum and it is used as the origin of the gaze track. Note that small occlusions are handled by the system, and that the eye center point is always estimated when the overlapping with the face successes. Moreover, in this way it is not necessary to use a precise pupil detector, since the computed point on the face is enough to completely solve the geometric problem. At this point, exploiting available head pose information, the direction of the gaze track is derived from the angles ω x and ω y , corresponding respectively to pitch and yaw of the vector p. Then the intersection of the gaze track with a vertical plane, parallel to the image plane of the sensor, is computed. Actually, the intersection point is computed separately for the x and y axes. In Figure 3 the procedure along the x axis is shown and it is described in the followings. The depth sensor is able to give the information about the length of the segment AC as the component t z of the translation vector T . It follows that, knowing a side and an angle, we can completely solve the right-angled triangle ABC. In particular, AB = AC cos ωy and BC = AB 2 − AC 2 . Using the same coordinate system, it is possible to compute also the cartesian equation of the gaze ray as the straight line passing for points A = (x A , y A , z A ) and B = (x B , y B , z B ) expressed as: with z A = 0 for the particular plane under consideration.
In case of translations on the x and y axes, the vector can be algebraical summed up with the computed value, in order to translate the gaze vector to the right position. Finally, in order to represent on a monitor the actual intersection point between the gaze vector and the plane, the world coordinates are normalized to image plane according to: where: (X, Y ) and (x, y) are the world and image plane coordinates, respectively; L, R, T , B are the left, right, top and bottom bounds of the considered user's field of view; I x and I y are the width and height of the displaying area on the monitor (in pixels).

Experimental Setup
The experimental setup was defined as follows: a Microsoft Kinect device was used as depth sensor and it was positioned at a height of 150 cm from the ground. Behind the sensor a square panel (2 m per side) was positioned and 15 circular markers were stuck on it. The markers were distributed on three rows, 5 markers on each row, with a distance of 50 cm from each other. Markers were divided into subsets as showed in Figure 4 in order to group together points that presented the same distance from the sensor in terms of x,y or both axes, from P1 to P5, while P0 corresponds to the depth sensor position. For example, P4 are the points with a distance of 1 meter from the sensor along the x axes and aligned along y axes, and so on. The depth sensor was placed in correspondence of the marker at the centre of the panel.     A scheme of the experimental setup illustrating the three different users positions is shown in Figure 7. Notice that, the above operating ranges allow the user to freely move the head in all the directions and in particular, despite of most of the state-of-the-art methods making use of the Viola-Jones face detector [31], also large rolling movements can be handled.

Experimental Results
The proposed method was tested with nine different persons. According to different similar works in this research field (for example in [38]), in order to get a comprehensive study, persons were divided into three groups, three persons for each group. The first group was composed by experienced persons, i.e., persons that knew how the system works and that already had tried the system before the test session. The second group was composed by persons that were trying the system for the first time but that had been informed about how the system works. In light of that knowledge, it was more probable that they would have moved the head even in case of sequential pointing of close markers onto the panel. Finally, in the last group there were unaware people who were just placed in front of the sensor and they were asked to point towards the markers. No constraints were given to the participants in terms of eyeglasses, beard or hairstyle and, in order to allow for wild settings, no panel or uniform background color were put behind the participants. These three experimental benchmarks permitted to verify the system's accuracy in relation to different levels of awareness of the users, which may be encountered in different applicative contexts.
The experiment was made as follows for all three groups: persons were asked to look at each of the markers onto the panel, in a predefined order. The gaze direction relative to a given marker, was that one estimated by the system when the person confirmed, by an oral feedback, that the marker was its point of regard. The estimated gaze direction was projected onto the panel and then compared with manually computed ground truth data. Therefore the errors were measured as the distances between the estimated intersection points and the ground truth data. The errors were expressed both in centimeters, as well as by the difference between the angles described by the estimated and ground truth rays.
The outcomes of the experimental tests are shown in Tables 1-3. The first column reports the labels of the markers under consideration (see Section 3 for label assignment) whereas the second column shows the tested distances, i.e., 70 cm, 150 cm and 250 cm. Errors were computed separately for each group and for yaw and pitch angles. Also error standard deviation, reported in degrees, was taken into account and reported in tables. Note that "n.a." stands for not available data, corresponding to missing overlapping of the Candide-3 model on the current face. From Table 1, it can be observed that the results for the first group of persons were very accurate: the average error was at maximum about 3 degrees, encountered when the persons stand at 70 cm from the panel, with a standard deviation of the error that was about 1.5 degrees in both directions. This is the prove that the proposed system is well suited for those application contexts where the users can be learned to exploit at best its functionalities: e.g., remote control of the device's cursor in cases of physical impairments. Table 2 demonstrates that the system reported encouraging results also on the second benchmark of persons, i.e., the informed ones: the average errors slightly increased, compared with those achieved for the group 1, to about 4.5 degrees and some peaks in the standard deviation were accounted due to the slowness of the corresponding person to become familiar with the system. The results reported on this group demonstrated that the proposed system can be exploited in all those application contexts where users can be preliminarily informed about the right modalities to interact (e.g., for gaming purposes, where a quick information session generally precedes the start of the game).  Finally, from Table 3, reporting the results on the third group of persons , it is possible to derive that the errors in accuracy still remained under 12 degrees (often around 5 degrees) with a standard deviation ranging from 2 to 6 degrees. In our opinion, this is a very interesting result considering that this group of persons are completely unaware about the system. This demonstrates that the proposed system can be exploited also in those applications where it is not possible to constraint the user behavior, e.g., in assistive applications involving persons with neurological impairments or audience measurements.
Notice also that the gaze estimation accuracy, independently from the benchmark under investigation, remained satisfactory even when the user's distance increased from the depth sensor (in particular the errors encountered at a distance of 150 cm were very encouraging). By our knowledge, this paper represents the first investigation of the performance of a gaze estimation in those challenging conditions: all the state-of-the-art methods, in fact, reports the accuracy achieved at a distance less than 1 m between user and target. In light of this, in order to evaluate the accuracy of the proposed method with respect the leading approaches in the literature, a further experiment was performed. In this further experiment the testing scenario mostly used in the promising works in the literature was set up. In that scenario the users are put in front of a screen, at a distance in the range [54 cm, 67 cm], and they are asked to observe a list of point on the screen.
In Table 4 the errors in estimating the gaze direction, of the proposed approach and of the leading methods mentioned in Section 1, are reported. Here we point out again that all the compared approaches make use of supervised algorithms and/or invasive device and/or a calibration and/or they limit the head movements. From the table, it can be seen that in the case of experienced or informed users, only the state-of-the-art methods making use of additional training and/or calibration phases outperform in accuracy the proposed approach. In the case of unaware users, the experienced accuracy is a little bit lower than that of some comparing approaches but, it remains well suited for most of the main attractive application domains for gaze tracking systems (e.g., remote rehabilitation, healthcare, monitoring, audience measurement, etc.). This is a very encouraging results that open a new way to deal with the gaze estimation issue: it demonstrated that new sensorial technologies combined with robust head pose estimation algorithms can bring to a relaxation of the environmental constraints and to a simplification of the algorithmic steps involved in the gaze estimation approaches.
During the above experimental phases a further evaluation was done on the users in groups 1 and 2 in order to check the actual possibility of using the proposed system to remotely control a device. After each experimental session, each person was asked to look at a screen and to try to control the mouse pointer by using his gaze and finally to give a feedback about usability and familiarity. All of the participant feel comfortable and able to easily use the system as a control device.
Concerning computational remarks, the involved algorithms have been implemented using Microsoft Visual C++ developing environment and, running on an Ultrabook Intel i3 CPU @ 1.8 GHz with 4 GB of RAM, with RGB and depth images taken at a resolution of 640 × 480, 30 fps, during the experimental phase the system was able to work in real-time.

Conclusions
This work presented an investigation on the feasibility of a gaze estimation system working in an unconstrained and noninvasive environment and that does not require any additional hardware (like IR light sources, wearable devices, etc.). The proposed solution makes use of a low cost commercial depth sensor and it estimates head pose information by combining RGB and depth data. This method has then been tested with both trained and untrained persons in an unconstrained setting, and errors have been quantitatively measured. In addition, it has been also carefully compared with the leading approaches in the state of the art, showing that their errors are comparable even if the considered approaches work well only under particular conditions and/or if specialized hardware (sometime invasive) is available. For this reason, the proposed approach is more suitable for most of the main attractive application domains concerning unconstrained gaze tracking systems like remote rehabilitation, therapy supply or ambient assistive living. Another advantage of the proposed solution is that it makes use of commercial hardware and no calibration phase is required. This could make it exploitable also from non experts user and then it can become a technological support in related research fields, for example for studying social human behaviors or being used in socially assistive robotics during human-robot interaction. Furthermore, we are aware that in particular application domains, e.g., for studying particular neurological diseases or to realize hand-free control of mobile devices, is also indispensable to combine head pose information with an accurate pupil center locator and this investigation will be the subject of our future research works.

Author Contributions
Dario Cazzato conceived the proposed approach, designed the architecture and prepared the manuscript. Marco Leo carried out the experiments, processed and interpreted results and contributed to the figures and manuscript preparation. Cosimo Distante co-designed the experimental strategies, contributed to the interpretation and discussion of results at all stages, and critically edited the manuscript. They all read and approved the final draft.