Highly Accurate and Fully Automatic 3D Head Pose Estimation and Eye Gaze Estimation Using RGB-D Sensors and 3D Morphable Models

This work addresses the problem of automatic head pose estimation and its application in 3D gaze estimation using low quality RGB-D sensors without any subject cooperation or manual intervention. The previous works on 3D head pose estimation using RGB-D sensors require either an offline step for supervised learning or 3D head model construction, which may require manual intervention or subject cooperation for complete head model reconstruction. In this paper, we propose a 3D pose estimator based on low quality depth data, which is not limited by any of the aforementioned steps. Instead, the proposed technique relies on modeling the subject’s face in 3D rather than the complete head, which, in turn, relaxes all of the constraints in the previous works. The proposed method is robust, highly accurate and fully automatic. Moreover, it does not need any offline step. Unlike some of the previous works, the method only uses depth data for pose estimation. The experimental results on the Biwi head pose database confirm the efficiency of our algorithm in handling large pose variations and partial occlusion. We also evaluated the performance of our algorithm on IDIAP database for 3D head pose and eye gaze estimation.


Introduction
Head pose estimation is a key step in understanding human behavior and can have different interpretations depending on the context. From the computer vision point of view, head pose estimation is the task of inferring the direction of head from digital images or range data compared to the imaging sensor coordinate system. In the literature, the head is assumed to be a rigid object with three degrees of freedom, i.e., the head pose estimation is expressed in terms of yaw, roll and pitch. Generally, the previous works on head pose estimation can be divided into two categories: (i) the methods based on 2D images; and (ii) depth data [1]. The pose estimators based on 2D images generally require some pre-processing steps to translate the pixel-based representation of the head into some direction cues. Several challenges such as camera distortion, projective geometry, lighting or changes in facial expression exist in 2D image-based head pose estimators. A comprehensive study of pose estimation is given in [1] and the reader can refer to this reference for more details on the literature.
Unlike the 2D pose estimators, the systems based on 3D range data or their combination with 2D images have demonstrated very good performance in the literature [2][3][4][5][6][7]. While most of the work on 3D pose estimation in the literature is based on non-consumer level sensors [8][9][10], recent advances in production of consumer level RGB-D sensors such as the Microsoft Kinect or the Asus Xtion has

Related Work on 3D Gaze Estimation Using RGB-D Sensors
Based on the context, the term Gaze Estimation can be interpreted as one of the following closely related concepts: (i) 3D Line of Sight (LoS); (ii) 3D Line of Gaze (LoG); and (iii) Point of Regard (PoR). Within the eyeball coordinate system, the LoG is simply the optical axis, while the LoS is the ray pointing out from fovea and eyeball rotation center. The PoR is a 3D point in the scene to which the LoS points. Figure 1 demonstrates a simplified schematic of human eye with LoS, GoS and PoR. In the literature non-intrusive gaze estimation approaches generally fall into one of the following categories: (i) feature-based approaches; and (ii) appearance-based approaches [14]. Feature-based approaches extract some eye specific features such as eye corner, eye contour, limbus, iris, pupil, etc. These features may be aggregated with the reflection of external light setup on the eye (called glints or Purkinje images) to infer the gaze. These methods are generally divided into two categories: (i) model-based (geometric); and (ii) interpolation-based.
Model based methods rely on the geometry of the eye. These methods directly calculate the point of regard by calculating the gaze direction (LoS) first. Next, the intersection of the gaze direction and the nearest object of the scene (e.g., a monitor in many applications) generates the point of regard. Most of the model-based approaches require some prior knowledge such as the camera calibration or the global geometric model of the external lighting setup [15][16][17][18][19][20].
Unlike the model-based methods, the interpolation-based approaches do not perform an explicit calculation about the LoS. Instead, they rely on a training session based on interpolation (i.e., a regression problem in a supervised learning context). In these methods, the feature vector between pupil center and corneal glint is mapped to the corresponding gaze coordinates on a frontal screen. The interpolation problem is formalized using a parametric mapping function such as a polynomial transformation function. The function is used later to estimate the PoR on the screen during the testing session. A calibration board maybe used during the training session [21][22][23][24][25][26][27]. The main challenge with the interpolation-based approaches is that they can not handle the head pose movements [14]. Notice that feature-based approaches in general need high resolution images to precisely extract the eye specific features as well as the glints. Moreover, they may require external lighting setups which are not ubiquitous. This motivates the researchers to train appearance-based gaze estimators, which rely on low quality eye images (a holistic-based approach instead of feature-based). However, appearance-based approaches generally have less accuracy.
As opposed to feature-based approaches, appearance-based methods do not rely on eye-specific features. Instead, they learn a one-to-one mapping from the eye appearance (i.e., the entire eye image) to the gaze vector. Appearance-based methods do not require camera calibration or any prior knowledge on the geometry data. The reason is that the mapping is made directly on the eye image, which makes these methods suitable for gaze estimation from low resolution images, but with less accuracy. In this context, they share some similarities with interpolation-based approaches. Similar to the the interpolation-based methods, appearance-based methods do not handle the head pose.
Baluja and Pomerleau [28] first used the interpolation concept from image content to the screen coordinate. Their method is based on training a neural network. However, their method requires more than 2000 training samples. To reduce such a large number of training examples, Tan et al. [29] proposed using a linear interpolation and reconstructed a test sample from the local appearance manifold within the training data. By exploiting this topological information of eye appearance, the authors reduced the training samples to 252. Later, Lu et al. [30] proposed a similar approach to exploit topological information encoded in the two-dimensional space of gaze space. To further reduce the number of the training data, Williams et al. [31] proposed a semi-supervised sparse Gaussian process regression method S3GP. Note that most of these methods assumed a fixed head pose. Alternatively, some other researchers used head mounted setups, but these methods are no longer non-intrusive [32,33].
With the main intention of designing a gaze estimator robust to head pose, Funes and Odobez [7,34] proposed the first model-based pose estimator by building a subject-specific model-based face tracker using Iterative Closest Point (ICP) and 3D Morphable Models. Their system is not only able to estimate the pose, but is also able to track the face and stabilize it. A major limitation of their method is the offline step for subject specific 3D head model reconstruction. For this purpose, they manually placed landmarks (eye corners, eyebrows, mouth corners) on RGB image of the subject, and consequently added an extra term to the cost function in their ICP formulation. In other words, their ICP formulation is supported by a manual term. Moreover, the user has to cooperate with the system and turn their head from left to right. Recently, the authors proposed a more recent version of their system in the work of [11] without the need for manual intervention.

Contribution of the Proposed Approach
Unlike [2,3,7,34], our proposed system does not require any commercial system to learn a subject's head or any offline step. A key contribution of our approach is to propose a method to automatically learn a subject's 3D face rather than the entire 3D head. Consequently, we no longer need subject's cooperation (i.e., turning their head from left to right), which is important in previous works for model-based pose estimation systems. In addition, unlike [7], our system does not require any manual intervention for model reconstruction. Instead, we rely on Haar features and boosting for facial feature detection, which, in turn, can be used for face model construction. Note that we use only one RGB frame for model reconstruction. The tracking step is based on depth frames only. After learning a subject's face, the pose estimation task is performed by a fully automatic, user non-cooperative and generic ICP formulation without any manual term. Our ICP formulation is robustified with Tukey functions in tracking mode. Thanks to the Tukey functions, our method successfully tracks a subject face in challenging scenarios. The outline of the paper is as follows: The method details are explained in Section 2. Afterwards, the experimental results are discussed in Section 3. Finally, the conclusions are drawn in Section 4.

Method Details
Our method consists of four key steps: (i) geometry processing of a generic face model and the first depth frame; (ii) generic face model initialization (i.e., model positioning at the location of the head); (iii) subject-specific face model construction by morphing the initialized generic model; and (iv) tracking the face in the next depth frames using the subject-specific face model. In our proposed system, we only model the face of the subject rather than the entire head, which, in turn, helps us to design a very robust, accurate and non-cooperative head tracking and pose estimation system. To accomplish this goal, a generic model is positioned on the subject's face in depth data. Next, it learns the subject's face and finally starts to track it. Both positioning a generic model on the subject's face and tracking it in depth data are accomplished using an ICP-based technique. However, the ICP registration technique which serves for positioning faces a major challenge: the generic model is a model of a complete head and not only the face. On the other hand, the depth data contain not only the face of the subject, but also other body parts such as the torso or the background ( Figure 2). Note that a major difficulty with ICP is its sensitivity to outliers and missing data between two 3D point clouds. To tackle this problem in model initialization, we perform geometry processing, which is explained next.

Geometry Processing
In this step, the goal is trimming the depth data and the generic model to remove spurious data and outliers. Note that we perform this step on the first depth frame only. The reason is that we should initialize (position) the model at the position of the subject's head, before tracking starts. For this purpose, we capture the entire environment using the Kinect. Next, we filter out the spurious point cloud and just keep the region of interest in the first depth frame, i.e., the facial surface. For this purpose, the first depth frame is automatically trimmed to discard the residual data. To this end, we need to automatically detect and localize the facial features (i.e., eyes, nose, and mouth) on the point cloud to determine the way the depth data should be trimmed. Detection of the facial features from a noisy depth frame directly is a challenge. Fortunately, the Kinect provide us the first RGB frame. Thus, the face and facial features are detected on the first RGB frame by first using Haar features and boosting [35]. Figure 3 demonstrates an example of the face and facial feature detection. As some false detections may occur, the next step is to reject them automatically. This is accomplished by utilizing the prior knowledge about the structure of a face and the relative positions of eyes, nose and mouth on a detected face. After the features are detected on the first RGB frame, their 3D loci are determined on the first depth frame through back projection using Kinect calibration data. To trim the depth data, a 3D plane passing through the 3D coordinates of the eyes and mouth is defined and shifted by an offset equal to the distance between the left and right eye. The shifted plane is called the cropping plane. Next, the depth data beneath the plane are discarded. Figure 4 shows the 3D loci of the facial features on the corresponding depth data trimmed by the cropping plane. Once the subject's face is captured and trimmed in 3D, the next step is to construct a model which simulates the subjects face (rather than the complete head). The type of 3D model we use to simulate the subject's identity is a family of Active Appearance Models (AAMs) called 3D Morphable models. Using these models, a subject's 3D head scan can be reconstructed by adding a set of weighted principal components (PCs) to the mean shape (the mean shape is the mean of all of the 200 subject's face scans in the database). For instance, we focus on the mean shape of the model. Similar to trimming the depth data, the mean shape of the 3D Morphable model is trimmed to facilitate the procedure of subject specific model construction through registration. In this context, a plane similar to that in Figure 4 is fitted to the model's mean shape. Once the mean shape is trimmed, it should be scaled to the size of the subject's face in 3D space. To this end, the model is scaled so the distance between the left and right eyes of the model and that of the subject's face scan (i.e., the first depth frame) becomes equal. Figure 5 demonstrates the mean shape of model, m, before and after trimming.

Generic Model Positioning
After processing both model and depth data, the trimmed model is positioned on the face of the subject using rigid ICP.

Learning and Modeling the Subject's Face
Capturing the subject's facial shape variations via morphing the mean shape is the main objective of this step. This problem can be considered as finding the weights of shape PCs in the 3D Morphable model, where each weight describes the contribution of its corresponding PC in simulating a subject's face. Similar to the generic ICP problem, this part also can be described by minimization of a cost function. Thus, we can unify both ICP and PCA terms into a unique equation and reformulate a more generic ICP problem through minimizing the following energy function [36]: where Y is the target surface in R 3 , X is the source surface, and Z is a deformed version of X which should be aligned with Y. Notice also that C y (z i ) is the closest point in the target surface to the point z i (i = 1, 2, ..., n, where n is the number of points in source). In this equation, the first term is the point-to-plane matching error, the second term is the point-to-point matching error, while the third term is the model error (for more details about these error the reader is referred to [36]). The energy function can be minimized by linearizing Equation (1) and iteratively solving the following linear system: where t is the number of iterations, z 0 i = x i , d contains the weights of PCs, andR andt are the linear updates that we obtain for the rotation (R) and translation (t) matrices, respectively, at each iteration. Notice that n i is the normal to the surface at point C Y (z i ) t , i.e., point to plane matching error. For more details, the reader is referred to the tutorial by Bouaziz and Pauly [36].

3D Head Tracking and Pose Estimation
Once the model is constructed from the first depth frame, the pose (orientation alone) of the head can be calculated directly from the rotation matrix, R, in terms of roll, pitch and yaw [37]. In Section 2.3, the rotation matrix corresponding to the first depth frame of the subject is obtained during model construction. A question arises here: How can one obtain the rotation matrices for the next depth frames? Indeed, this question is addressed by 3D registration of the form of Equation (1) with some differences. The first difference is that we no longer need to capture the subject's face variations, d, because it is calculated only once for the entire procedure. Thus, the E model term is dropped from Equation (1). The other difference is that we no longer need to trim the next depth frames. The reason is that the model is already fitted to the first depth frame during model construction (see Figure 6) and we expect the system to work in tracking mode. In tracking mode, head displacement in the next frame compared to the current frame is small and the model displacement should be very small compared to the initialization mode. Thus, instead of trimming the next depth frames, one can take advantage of registration using Tukey functions, which will filter out bad correspondences with large distances. The pose estimation procedure for the next depth frames is as follows: for the second depth frame, the model rotation and translation increments are calculated relative to that of the first depth frame. Next, the rotation and translation matrices for the second depth frame are obtained by applying the updates to the rotation and translation matrices in the first depth frame. This procedure is continued for the next frames. For each frame, the head pose can be directly calculated from the rotation matrix in terms of pitch, yaw and roll.

Robustness of Registration to Outliers
As mentioned, partial overlaps among source, target and outliers in the data are the most challenging problems in registration through ICP [38]. Two types of outliers exist: (i) outliers in the source point cloud; and (ii) outliers in the target point cloud. Discarding unreliable correspondences between the source and the target is the most common way to handle this problem. In Section 2.2, this goal was accomplished by trimming both model and depth data in the first depth frame. However, for the 3D face tracking mode, the same method cannot be used because the initialization modality is based on detection of facial features. Applying facial feature detection for each frame can decrease the frame rate at which the system operates. On the other hand, a limit of our method is that the system cannot start from an extreme pose, as the facial feature detection algorithms will fail. Fortunately, as the model is already positioned onto the face of the subject, we no longer need to trim the upcoming depth frames to perform tracking using ICP. Instead, we use Tukey functions to robustify the ICP. Tukey functions assign less weight to the bad correspondences and decrease or remove their effect on the energy function.

Robustness of Registration to Extreme Pose
A question may arise at this point: Can the method handle the case of a face with extreme pose where most facial parts cannot be sensed by the Kinect? In this case, the model should be registered with a partial point cloud of the subject's face. This leads to increasing the number of points in the source (trimmed model) without good correspondences in the target (partial point cloud of face). As a result, such points will form bad correspondences with relatively large Euclidean distance values. Fortunately, we also address this problem by using robust functions in the tracking mode, as robust functions discard/decrease the effect of such bad correspondences in the energy function. To clarify this, notice that bad correspondences inherently produce large Euclidean distances, while this is not the case for good correspondences. On other other hand, narrow robust functions act as low pass filters and discard the bad correspondences.

Robustness of Registration to Facial Expression Changes
As we use rigid ICP, facial expression changes may be considered as a challenging factor. In this context, Funes and Odobez [7] used a mask and only considered the upper part of the face in the rigid registration part of their system. Notice that we do not use such a mask. The reason is that, most of the time, the subject may not show significant facial expression changes (such as laughing or opening the mouth). On the other hand, relying on more data of the face may result in a more robust registration task. The problem becomes more challenging if we consider that self-occlusion may occur on the upper part of the face. For these reasons, we prefer not to use a mask. Instead, we rely on the robust Tukey functions to improve the robustness in the case of facial expression changes.

Head Pose Stabilization
This step is a pre-processing step for gaze estimation. First, we assume that the extrinsic parameters from the camera coordinates to the 3D depth sensor coordinates are known. As soon as the head pose is calculated, the texture of the corresponding RGB frame can be back-projected to a pose free 3D head model. This pose free head model can be visualized from any direction (e.g., see Figure 7b,c), but the ideal direction is the frontal view. The next step is to crop the eye appearances from this frontal view (e.g., see Figure 7d).

Gaze Estimation: Point of Regard
In this paper, gaze estimation refers to estimation of the point of regard (PoR). Our gaze estimation system consists of two parts: (i) training; and (ii) testing. The goal of the training part is to learn the parameters of an interpolation function which maps the pose-free eye appearance to the gaze vectors. In the training phase, the gaze vectors are known, because a computer screen is used to serve as a calibration pattern. At each time step, a moving target is displayed on the screen, where the subject's eye locations are known thanks to the model-based head tracker (Section 2.4).
On the other hand, the goal of the testing phase is estimating the PoR for a new eye-appearance. As mentioned previously, the PoR is the intersection of the line of sight with an object in the world coordinate system (WCS). The gaze estimation problem can be divided into four parts: (i) head pose estimation in the world coordinate system; (ii) line of sight estimation in the head coordinate system; (iii) line of sight calculation in the world coordinate system; and (iv) intersecting the line of sight with the object (i.e., a display in our case) in the WCS.

Training
In this section, we want to train an appearance-based gaze estimator to calculate a one to one mapping between the pose-free eye appearance (i.e., eye image) and its corresponding gaze vector in the head coordinate system. As an example, Figure 8 demonstrates a pose-free eye appearance which looks directly to the front (i.e., a 0 degree gaze vector). Let us assume that [(I 1 , G 1 ), (I 2 , G 2 ), ..., (I N , G N )] are the training data, where I i and G i (i = 1, 2, ..., N) stand for the eye appearance image and their corresponding gaze vectors, respectively. It is possible to train an interpolation-based gaze estimator using these N training examples. For any new G t est, it is possible to estimate the gaze by interpolation. Note that, in appearance-based gaze estimators, the G i (i = 1, 2, ..., N) vectors are calculated using a calibration pattern [7,30,34]. In this paper, the same methods explained in [7,30,34] are for interpolation: (i) A K-NN based approach; and (ii) Adaptive Linear Regression (ALR) which is based on manifold learning. The reader is referred to [7,30,34] for further details.

Head Pose Estimation in the WCS
This step is explained in Section 2.4. As a subject-specific model is tracking the head in the depth frames, it is possible to precisely track the location of eyes in the WCS. On the other hand, in geometry, a line can be uniquely defined with a point and a vector. Thus, the eye locations given by the head tracker are can be used to calculate the line of sight passing through them. In general, two goals are accomplished in this part: (i) estimating the pose (i.e., both position and orientation) of the head and eyes; and (ii) head pose stabilization (See Section 2.4).

LoS Calculation in the WCS
The LoS is estimated in the head coordinate system. On the other hand, the head is tracked using the head tracker. Thus, we can calculate the rotation and translation of the head coordinate system in the WCS. Consequently, we can re-express the line of sight in the WCS.

PoR Calculation
Once the LoS is calculated in WCS, we can intersect it with the computer screen, which is a plane in the WCS (the position of the computer screen is known).

Experimental Evaluation
In this section, we report our empirical evaluation. We start by describing the datasets used in our experiments, follow this with an explanation of the evaluation protocol, and finish with a report of the results and their discussion.

3D Basel Face Model (BFM)
The 3D Basel Face Model (BFM) is a Morphable model calculated from registered 3D scans of 100 male and 100 female faces. The model geometry consists of 53,490 3D vertices connected by 160,470 triangles. The model is given by the following: • The mean shape • The 199 principal components (PCs) of shape obtained by applying PCA on 200 subjects facial shape in the database • The variance of shape • The mesh topology • The mean texture • The 199 principal components (PCs) of texture obtained by applying PCA on 200 subjects facial texture in the database • The texture variance Any unknown face can be explained as a linear combination of the principal components and the mean shape/texture. In this paper, we only use the shape dataset (i.e., shape principal components together with mean shape) for the construction of a subject's specific face model (i.e., the head trackers).

Biwi Kinect Head Pose Database
We used the Biwi Kinect Head Pose Database [2,3] to evaluate the effectiveness of our method for the following reasons. Firstly, to the best of our knowledge, it is the only RGB-D database for pose estimation reported in the literature. Secondly, it provides ground truth data for comparison, and we wanted to make our results directly comparable to not only those of Fanelli et al. [2,3], but also to the recent works that have used this database. The dataset contains over 15000 depth frames and RGB image of 20 people, six females and fourteen males, where four people were recorded twice. The head pose ranges through about 75 degrees yaw and 60 degrees pitch. The ground truth for head rotation is also provided by a commercial software.

EYEDIAP Database
We used the EYEDIAP gaze database [39] to evaluate the effectiveness of gaze estimation part of our method for the following reasons. Firstly, to our knowledge, it is the only Kinect based database for gaze estimation in the literature. Secondly, it provides ground truth data for comparison of gaze estimation (but not pose) results. In addition, we wanted to make our results statistically comparable to the work of Funes and Odobez [7,34]. The dataset contains over 4450 depth frames and RGB image of 16 people, among them 14 subjects participated in a screen-based gaze estimation scenario. Each session itself is divided into two other sessions, where the subject was asked to keep the head stationary or moving.

Subject Specific Model Construction
We evaluated the proposed algorithm in a setting in which the first RGB frame and the first depth frame were used for learning in an unsupervised context, while the other depth frames were used for testing. Figure 11 demonstrates the registration procedure. In this figure, the blue point cloud is the (down sampled) mean shape of the BASEL data, while the red point cloud is the trimmed depth scan of the first subject in the Biwi database (the subject in Figure 3). Notice that trimming the model is not shown here, but it is considered in calculations. We wanted to register the two shapes with each other and, at the same time, capture the variation of the subject's face by minimizing Equation (1).

Pose Estimation and Tracking
After the subject's specific model was constructed from the first (trimmed) depth frame, we dropped the model term from the energy function and continued the registration of the model and depth data. The pose estimation for each frame could be directly calculated from the rotation matrix, R, obtained from registration. Figure 12a shows a sample where the model (red) was registered with the depth data (blue). The model was superimposed on the corresponding RGB frame through the Kinect calibration data (Figure 12b) for a better visualization.  A summary of the key evaluation results and method features of the proposed algorithm compared to two previous works is shown in Table 1. Three criteria were considered to compare the systems according to Pauly [40]: • Accuracy: The 3D head tracker should estimate the head pose with a high precision compared to a ground truth. Note that this ground truth was generated by applying a third-party commercial software on a public dataset, on which we performed our experiments. Thus, the term ground truth is used just to be consistent with the literature and be able to compare with the previous works. In theory, any pose estimator including the commercial software should have some inaccuracy. • Robustness: The 3D head tracker should be able to perform well under poor lighting conditions, fast motion of head and partial occlusion. • Usability in real scenarios: User-specific training, system calibration and manual intervention need to be kept to a minimum. The tracker should be non-invasive. This will make the tracker be widely accepted by the users. Ideally, the system should be calibration-free. Table 1. A summary of the key evaluation results and method features of the proposed algorithm, and the two previous works based on supervised learning Fanelli et al. [2,3]. Notice that we do not compare the results with those of Funes and Odobez [7] for face pose estimation due to lack of details in yaw, roll and pitch. Legend: , very good; , good; , weak.

Pitch Yaw Roll Accuracy Robustness Usability
Our Proposed Method 0.1 ± 6.7 • 0.25 ± 8.7 • 0.26 ± 9.3 • 1st report [2] 8.5 ± 9.9 • 8.9 ± 13.0 • 7.9 ± 8.3 • 2nd report [3] 5.2 ± 7.7 • 6.6 ± 12.6 • 6.0 ± 7. On the one hand, our system demonstrates better results than [2,3] in terms of average error and standard deviation. Both systems proposed by Fanelli et al. show slightly better performance than our system only in terms of standard deviation of roll. Moreover, our system is generic and the training phase is performed with a single RGB/depth frame. On the other hand, the systems of Fanelli et al. can work on a frame-by-frame basis, while our system can only work in tracking mode (i.e., the subject's head motion in successive frames should be small). Notice that both systems of Fanelli et al. need a training phase supported by a commercial face tracker, while we propose a new face tracker in this work. As both systems of Fanelli et al. require a training phase based on positive and negative patches cropped from a database of 20 subjects, the generic aspect of their system is an issue. We also compared our system to other non-model-based approaches [8,10,13,41], and the results show the effectiveness of the proposed system. The only comparable system is the model-based work of [11], which shows very good precision too.

Gaze Estimation
Adaptive Linear Regression (ALR) The results of testing the algorithm on the stationary head session of IDIAP database with 13 subjects are summarized in Tables 2 and 3 for left eye and right eye, respectively, while Tables 4 and 5 show the same results for the moving head session.

K-Nearest Neighbor
We used 40 training eye images from IDIAP database and represented the eye appearances as 15D feature vectors. These feature vectors are given to a K-NN regressor. Similar to Funes Mora and Odobez [7], we chose K = 5. The steps of K-NN regression is as follows: 1. Given a test image, find the K closest set of sample images forming a neighborhood based on K-NN. 2. Find a set of weights: Inverse the distances of the K images from test image and normalize them. 3. Use the same weights to interpolate the parameters to obtain the estimated parameters for the test gazing point.
The results of testing the algorithm on the stationary head session of the IDIAP database with 13 subjects are summarized in Tables 6 and 7 for the left eye and the right eye respectively, while Tables 8  and 9 show the same results for the moving head session. A summary of the key evaluation results and method features of the proposed algorithm compared to two previous works are shown in Tables 10-13. Three criteria were considered to compare the systems: • Accuracy compared to the ground truth data • Robustness to occlusions, bad lighting and fast motions • Usability in real scenarios, i.e., user-specific training, system calibration and manual intervention need to be kept to a minimum Table 10. A summary of the key evaluation results and method features of the proposed algorithm, and the previous work when Adaptive Linear Regression (ALR) is used and the subjects keep the head stationary: , very good; , good; , weak.

Left Eye Right Eye Accuracy Robustness Usability
Our Proposed Method 7.55 • 6.89 • Funes and Odobez Method 9.73 • 10.5 • Table 11. A summary of the key evaluation results and method features of the proposed algorithm, and the previous work when K-NN is used and the subject keeps the head stationary: , very good; , good; , weak. Our system demonstrates better results than Funes and Odobez in terms of average gaze estimation error. One possible reason for this can be the high precession of our pose estimation system, which performs similar to a commercial state-of-the-art pose estimator, while the pose estimator of Kenneth and Odobez shows slight deviation from our precise pose estimator, which results in an imprecise texture warping on the head model that in turn can affect the gaze estimation process.

Conclusions
This work addressed the problem of automatic facial pose and gaze estimation without subject cooperation or manual intervention using low quality depth data provided by the Microsoft Kinect. Previous works on pose estimation using the Kinect are based on supervised learning or require manual intervention. In this work, we proposed a 3D pose estimator based on low quality depth data. The proposed method is generic and fully automatic. The experimental results on the Biwi head pose database confirm the efficiency of our algorithm in handling large head pose variations and partial occlusion. Our results also confirm that model-based approaches outperform other approaches in terms of precision. We also evaluated the performance of our algorithm on the IDIAP database for 3D eye gaze estimation (i.e., point of regard) and we obtained promising results. Although the feature-based gaze estimators are the most accurate ones in the literature, they require high resolution images. As Kinect has a low resolution RGB camera, we designed two appearance-based gaze estimators (i.e., manifold based ALR and K-NN) that do not rely on the local eye features. Instead, our proposed systems depend on the entire eye image content. Our gaze estimators outperformed the other appearance-based gaze estimators from several aspects, thanks to the high precision of the head tracker. Moreover, the user can freely turn their head, while this is not the case for most of the appearance-based methods, where the user should use a chin rest.

Conflicts of Interest:
The authors declare no conflict of interest.