Compensation Method of Natural Head Movement for Gaze Tracking System Using an Ultrasonic Sensor for Distance Measurement

Most gaze tracking systems are based on the pupil center corneal reflection (PCCR) method using near infrared (NIR) illuminators. One advantage of the PCCR method is the high accuracy it achieves in gaze tracking because it compensates for the pupil center position based on the relative position of corneal specular reflection (SR). However, the PCCR method only works for user head movements within a limited range, and its performance is degraded by the natural movement of the user’s head. To overcome this problem, we propose a gaze tracking method using an ultrasonic sensor that is robust to the natural head movement of users. Experimental results demonstrate that with our compensation method the gaze tracking system is more robust to natural head movements compared to other systems without our method and commercial systems.


Introduction
Gaze tracking is a technology to find out where a user is looking. It has been widely used in various applications such as neuroscience, psychology, industrial engineering, human factors, marketing, advertising, and computer interfaces [1]. There has been a lot of research on gaze tracking based on the movement of face and eyes [2], and on the adoption of gaze tracking technology in natural input devices to replace the conventional keyboard and mouse [3][4][5][6]. With the increased adoption of gaze tracking technology in various fields, studies aiming to improve gaze tracking accuracy have been progressing actively. Most of the known gaze tracking systems use the pupil center corneal reflection (PCCR) method [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22]. In this method, the position of the corneal specular reflection (SR) determined by a near infrared (NIR) illuminator is regarded as a reference. The final gaze position on the screen is calculated using the vector from the corneal SR to the pupil center (PC), and the matrix defining the relationship between the four corner positions of the screen and the corresponding four vectors obtained during calibration. However, this matrix is obtained at one specific Z-distance between the user and the screen. Therefore, if a user's head moves after calibration, the matrix is considered incorrect for calculating the gaze position, which increases the consequent gaze detection error.
To overcome this problem, there has been research focusing on compensating for the user's head movement. We can categorize the past research into two types: multiple camera-based and single camera-based methods. In multiple camera-based methods [15][16][17][18]23], two or more cameras are used

Proposed Method for Compensating User's Head Movement
In our gaze tracking system, we acquire the Z-distance data using an ultrasonic sensor as well as using the actual inter-distance of two pupil centers obtained during a user calibration stage. Then, we measure the Z-distance of user's eye from the camera based on the actual inter-distance of two pupil centers during user calibration stage. In the testing stage, our gaze tracking system calculates the difference between the Z-distance of user's eye when capturing the current image and that obtained during user calibration. If this difference is not greater than a threshold, the user's head movement does not need to be compensated, and our system calculates the gaze position without compensation. If this difference is greater than a threshold, our system checks whether a case of head rotation (yaw) based on the vertical axis has occurred. If this case has not occurred, our system compensates the user's head movement and calculates the gaze position. If it has occurred the user's head movement does not need to be compensated, and our system calculates the gaze position without compensation.

Gaze Tracking Algorithm
Our gaze tracking algorithm is based on the PCCR method [7,21], which is the most commonly used method using 2-D mapping-based approach for gaze tracking. Details of our gaze tracking method are provided below with explanations.
Sensors 2016, 16,110 3 of 20 The remainder of this paper is organized as follows: in Section 2, our proposed gaze tracking method that is robust to head movement is described, in Section 3, the experimental results and analyses are provided, and in Section 4, we provide concluding remarks and directions for future studies in the area.

Proposed Method for Compensating User's Head Movement
In our gaze tracking system, we acquire the Z-distance data using an ultrasonic sensor as well as using the actual inter-distance of two pupil centers obtained during a user calibration stage. Then, we measure the Z-distance of user's eye from the camera based on the actual inter-distance of two pupil centers during user calibration stage. In the testing stage, our gaze tracking system calculates the difference between the Z-distance of user's eye when capturing the current image and that obtained during user calibration. If this difference is not greater than a threshold, the user's head movement does not need to be compensated, and our system calculates the gaze position without compensation. If this difference is greater than a threshold, our system checks whether a case of head rotation (yaw) based on the vertical axis has occurred. If this case has not occurred, our system compensates the user's head movement and calculates the gaze position. If it has occurred the user's head movement does not need to be compensated, and our system calculates the gaze position without compensation.

Gaze Tracking Algorithm
Our gaze tracking algorithm is based on the PCCR method [7,21], which is the most commonly used method using 2-D mapping-based approach for gaze tracking. Details of our gaze tracking method are provided below with explanations. As the first step, the rough position of the corneal SR is identified by finding the bright pixels in the captured image. Then, the region for detecting the pupil area is defined based on the identified position of the corneal SR as shown in Figure 1b. In order to make the pupil boundary As the first step, the rough position of the corneal SR is identified by finding the bright pixels in the captured image. Then, the region for detecting the pupil area is defined based on the identified  Figure 1b. In order to make the pupil boundary more distinctive, histogram stretching is applied to this region (Figure 1c), and binarized image is obtained because pupil area is usually darker than the surrounding regions as shown in Figure 1d. Then, only the pupil region is left in the image by excluding the corneal SR and other noise regions through the procedures of morphological operation and component labeling as shown in Figure 1e,f. Pupil boundary points are located by canny edge detector as shown in Figure 1g, and the damaged part on the pupil boundary by the corneal SR can be compensated (Figure 1h) through the convex hull method procedure [24][25][26]. A binarized image of Figure 1c is obtained as shown in Figure 1i. Then, the overlapped region of Figure 1h,i is removed from the image of Figure 1h as shown in Figure 1j. With Figure 1j, sn accurate pupil boundary can be detected by excluding the corneal SR. With the remaining points on the pupil boundary, the accurate pupil boundary is detected by ellipse-fitting method as shown in Figure 1k. The results of pupil boundary and center detection are shown in Figure 1l.
When the pupil center and boundary are detected as shown in Figure 1l, our gaze detection algorithm defines the searching area for detecting the accurate position of corneal SR. Within this area, the accurate center of corneal SR is detected through the image binarization process and by calculating the geometric center position. Based on the detected centers of pupil and corneal SR, the pupil-corneal SR vector can be obtained as shown in Figure 2.
Sensors 2016, 16,110 4 of 20 more distinctive, histogram stretching is applied to this region (Figure 1c), and binarized image is obtained because pupil area is usually darker than the surrounding regions as shown in Figure 1d. Then, only the pupil region is left in the image by excluding the corneal SR and other noise regions through the procedures of morphological operation and component labeling as shown in Figure  1e,f. Pupil boundary points are located by canny edge detector as shown in Figure 1g, and the damaged part on the pupil boundary by the corneal SR can be compensated (Figure 1h) through the convex hull method procedure [24][25][26]. A binarized image of Figure 1c is obtained as shown in Figure 1i. Then, the overlapped region of Figure 1h,i is removed from the image of Figure 1h as shown in Figure 1j. With Figure 1j, sn accurate pupil boundary can be detected by excluding the corneal SR. With the remaining points on the pupil boundary, the accurate pupil boundary is detected by ellipse-fitting method as shown in Figure 1k. The results of pupil boundary and center detection are shown in Figure 1l. When the pupil center and boundary are detected as shown in Figure 1l, our gaze detection algorithm defines the searching area for detecting the accurate position of corneal SR. Within this area, the accurate center of corneal SR is detected through the image binarization process and by calculating the geometric center position. Based on the detected centers of pupil and corneal SR, the pupil-corneal SR vector can be obtained as shown in Figure 2. Based on the pupil-corneal SR vector, the user's gaze position is calculated using a geometric transform method as follows [7,21]. Pupil-corneal SR vector is used for obtaining the mapping relationship between the pupil movable sub-region and monitor sub-region as shown in Figure 3. In our gaze tracking system, each user should look at nine reference points (M1, M2, … M9 of Figure 3) on the monitor during the user calibration stage. From that, nine pupil centers in the images are acquired, and nine pupil-corneal SR vectors are consequently obtained. In order to compensate the corneal SR movement caused by the head movement, all the starting positions of nine pupil-corneal SR vectors are superimposed, and the compensated nine pupil centers (P1, P2, … P9 of Figure 3) in the images are acquired. We include detailed explanations about the procedure for compensating the pupil centers as follows. For example, the positions of the corneal SR and pupil center are assumed to be (120, 120) and (110, 110), respectively, in the first image. In addition, those of the corneal SR and pupil center are assumed to be (105, 134) and (130, 90), respectively, in the second image. Then, the amount of movement between the two corneal SR positions in the first and second images becomes −15 (105−120) and +14 (134−120) on the -and -axis, respectively. Therefore, if we attempt to set the position of corneal SR in the first image (120, 120) to be the same as that in the second image (105, 134) (the vectors of corneal SR coincide), the position of pupil center position in the first image (110, 110) is changed to be (95, 124) considering the amount of movement (−15, +14). This new position of (95, 124) is the compensated pupil center, which is used for calculating gaze position instead of the original position of pupil center (110, 110). Based on the pupil-corneal SR vector, the user's gaze position is calculated using a geometric transform method as follows [7,21]. Pupil-corneal SR vector is used for obtaining the mapping relationship between the pupil movable sub-region and monitor sub-region as shown in Figure 3. In our gaze tracking system, each user should look at nine reference points (M 1 , M 2 , . . . M 9 of Figure 3) on the monitor during the user calibration stage. From that, nine pupil centers in the images are acquired, and nine pupil-corneal SR vectors are consequently obtained. In order to compensate the corneal SR movement caused by the head movement, all the starting positions of nine pupil-corneal SR vectors are superimposed, and the compensated nine pupil centers (P 1 , P 2 , . . . P 9 of Figure 3) in the images are acquired. We include detailed explanations about the procedure for compensating the pupil centers as follows. For example, the positions of the corneal SR and pupil center are assumed to be (120, 120) and (110, 110), respectively, in the first image. In addition, those of the corneal SR and pupil center are assumed to be (105, 134) and (130, 90), respectively, in the second image. Then, the amount of movement between the two corneal SR positions in the first and second images becomeś 15 (105´120) and +14 (134´120) on the xand y-axis, respectively. Therefore, if we attempt to set the position of corneal SR in the first image (120, 120) to be the same as that in the second image (105, 134) (the vectors of corneal SR coincide), the position of pupil center position in the first image (110, 110) is changed to be (95, 124) considering the amount of movement (´15, +14). This new position of (95, 124) is the compensated pupil center, which is used for calculating gaze position instead of the original position of pupil center (110, 110). Then, four pupil movable sub-regions and monitor sub-regions are defined, and a geometric transform matrix that defines each pair of pupil movable sub-regions and monitor sub-regions is obtained as shown in Figure 3 and Equation (1). Then, four pupil movable sub-regions and monitor sub-regions are defined, and a geometric transform matrix that defines each pair of pupil movable sub-regions and monitor sub-regions is obtained as shown in Figure 3 and Equation (1).
In general, the shape transform from one quadrangle to the other can be mathematically defined by using several unknown parameters as shown in [27]. If one quadrangle is changed to the other only by in-plane rotation and translation (x-axis and y-axis), the transform matrix can be defined by using three unknown parameters. If one quadrangle is changed to the other only by in-plane rotation, translation (x-axis and y-axis), and scaling, the transform matrix can be defined by using four unknown parameters. In case that one quadrangle is changed to the other only by in-plane rotation, translation (x-axis and y-axis), scaling, and parallel inclination (x-axis and y-axis), the transform matrix can be defined by using six unknown parameters. In the last case, if one quadrangle is changed to the other by in-plane rotation, translation (x-axis and y-axis), scaling, parallel inclination (x-axis and y-axis), and distortion (x-axis and y-axis), the transform matrix can be defined by using eight unknown parameters [27]. Because the shape transform from one pupil movable sub-region of Figure 3 to one monitor sub-region of Figure 3 can include the various factors of the last case, we define the transform matrix by using eight unknown parameters.
Once we obtain the parameters (a, b, c ... h), we can calculate where the user is staring at on the monitor (M ' x and M ' y of Equation (2)) using the compensated pupil center in a current frame (P ' x and P ' y Equation (2)): In Equation (1), n is 1, 2, 4, and 5. The four points of (P (n+i)x , P (n+i)y ) and (M (n+i)x , M (n+i)y ) (i is 0, 1, 3, and 4) represent the X and Y positions of compensated pupil center (in eye image) and reference point (on the monitor), respectively. From Equation (1), we can obtain eight unknown parameters (a, b, c ... h) using matrix inversion: P pn`4qx P ny P pn`1qy P pn`3qy P pn`4qy P nx P ny P pn`1qx P pn`1qy P pn`3qx P pn`3qy P pn`4qx P pn`4qy In general, the shape transform from one quadrangle to the other can be mathematically defined by using several unknown parameters as shown in [27]. If one quadrangle is changed to the other only by in-plane rotation and translation (x-axis and y-axis), the transform matrix can be defined by using three unknown parameters. If one quadrangle is changed to the other only by in-plane rotation, translation (x-axis and y-axis), and scaling, the transform matrix can be defined by using four unknown parameters. In case that one quadrangle is changed to the other only by in-plane rotation, translation (x-axis and y-axis), scaling, and parallel inclination (x-axis and y-axis), the transform matrix can be defined by using six unknown parameters. In the last case, if one quadrangle is changed to the other by in-plane rotation, translation (x-axis and y-axis), scaling, parallel inclination (x-axis and y-axis), and distortion (x-axis and y-axis), the transform matrix can be defined by using eight unknown parameters [27]. Because the shape transform from one pupil movable sub-region of Figure 3 to one monitor sub-region of Figure 3 can include the various factors of the last case, we define the transform matrix by using eight unknown parameters.
Once we obtain the parameters (a, b, c ... h), we can calculate where the user is staring at on the monitor (M 1 x and M 1 y of Equation (2)) using the compensated pupil center in a current frame (P 1 x and P 1 y Equation (2)

Analyses of the Change of Pupil-Corneal SR Vectors According to Head Movements
The geometric transform matrix of Equation (2) is obtained at one Z-distance between the user and the screen. Therefore, if the user's head is moved after the calibration, the matrix is not correct for calculating the gaze position, which increases the consequent gaze detection error. To overcome this problem, we propose the method of compensating user's head movement using an ultrasonic sensor. Before explaining our compensation method in detail, we define the user's head movements according to translation and rotation movement based on the X-, Y-, and Z-axes, respectively, in 3D space, as shown in Figure 4. In addition, we define the symbols representing the head movements in Table 1.

Analyses of the Change of Pupil-Corneal SR Vectors According to Head Movements
The geometric transform matrix of Equation (2) is obtained at one Z-distance between the user and the screen. Therefore, if the user's head is moved after the calibration, the matrix is not correct for calculating the gaze position, which increases the consequent gaze detection error. To overcome this problem, we propose the method of compensating user's head movement using an ultrasonic sensor. Before explaining our compensation method in detail, we define the user's head movements according to translation and rotation movement based on the X-, Y-, and Z-axes, respectively, in 3D space, as shown in Figure 4. In addition, we define the symbols representing the head movements in Table 1.

Head Movements X Axis Y Axis Z Axis
Translation Considering the definition of head rotation and translation as shown in Figure 4 and Table 1, we analyze the change of pupil-corneal SR vectors according to the head movements. In Figure 5, the circular shape represents a user's eyeball, and r is the radius of the eyeball based on Gullstrand's eye model [28]. POG means a point of gaze, and it is connected by user's pupil center and eyeball's center (O1 or O2). G represents the distance from POG to the Y position of eyeball's center. H shows the distance from camera (CAM)/near infrared (NIR) illuminator/ultrasonic sensor (USS) to the Y position of eyeball's center. d represents the distance from X-Y plane to the nearest surface of eyeball. g (g1 or g2) shows the distance from eyeball's surface center to the line passing through POG and eyeball's center. h (h1 or h2) represents the distance from the eyeball's surface center to the line passing through camera/NIR illuminator/ultrasonic sensor, and eyeball's center. The line of camera/NIR illuminator/ultrasonic sensor, and eyeball's center is passing though the point of corneal SR because NIR illuminator produces the corneal SR. Therefore, the pupil-corneal SR vector can be calculated from g and h. The eyeball of h1, g1, O1, and d1 of Figure 5 represents the case before movement, and that of h2, g2, O2, and d2 of Figure 5 shows the case after movement.
Considering the definition of head rotation and translation as shown in Figure 4 and Table 1, we analyze the change of pupil-corneal SR vectors according to the head movements. In Figure 5, the circular shape represents a user's eyeball, and r is the radius of the eyeball based on Gullstrand's eye model [28]. POG means a point of gaze, and it is connected by user's pupil center and eyeball's center (O 1 or O 2 ). G represents the distance from POG to the Y position of eyeball's center. H shows the distance from camera (CAM)/near infrared (NIR) illuminator/ultrasonic sensor (USS) to the Y position of eyeball's center. d represents the distance from X-Y plane to the nearest surface of eyeball. g (g 1 or g 2 ) shows the distance from eyeball's surface center to the line passing through POG and eyeball's center. h (h 1 or h 2 ) represents the distance from the eyeball's surface center to the line passing through camera/NIR illuminator/ultrasonic sensor, and eyeball's center. The line of camera/NIR illuminator/ultrasonic sensor, and eyeball's center is passing though the point of corneal SR because NIR illuminator produces the corneal SR. Therefore, the pupil-corneal SR vector can be calculated from g and h. The eyeball of h 1 , g 1 , O 1 , and d 1 of Figure 5 represents the case before movement, and that of h 2 , g 2 , O 2 , and d 2 of Figure 5 shows the case after movement.
Sensors 2016, 16,110 7 of 20 In case of T_X movement (Figure 5a), we can obtain the Equations (3) and (4) based on the similarity property of triangle. From the Equations (3) and (4), we get the Equation (5). The approximate value of the pupil-corneal SR vector after T_X can be obtained as (h2 + g2) as shown in Equation (6). This means that the pupil-corneal SR vector after T_X is same as that before T_X, and T_X does not change the original pupil-corneal SR vector:  In case of T_X movement (Figure 5a), we can obtain the Equations (3) and (4) based on the similarity property of triangle. From the Equations (3) and (4), we get the Equation (5). The approximate value of the pupil-corneal SR vector after T_X can be obtained as (h 2 + g 2 ) as shown in Equation (6). This means that the pupil-corneal SR vector after T_X is same as that before T_X, and T_X does not change the original pupil-corneal SR vector: In case of T_Y movement (Figure 5b), we can obtain Equations (7) and (8) based on the similarity property of triangle. From Equations (7) and (8), we get Equation (9). The approximate value of the pupil-corneal SR vector after T_Y can be obtained as (h 2 -g 2 ) as shown in Equation (10). This means that the pupil-corneal SR vector after T_Y is same as that before T_Y, and T_Y does not change the original pupil-corneal SR vector: Using the same method, in Figure 5c, we can obtain the pupil-corneal SR vector after T_Z as (h 2 + g 2 ) as shown in Equation (14) using Equations (11)- (13). As shown in Equation (14), the pupil-corneal SR vector after T_Z is changed according to the Z-distance ratio, considering rˆ((d 1 + r)/(d 2 + r)) compared to the original pupil-corneal SR vector (h 1 + g 1 ): pd 1`r q : H " r : h 1 , pd 2`r q : H " r : h 2 (11) pd 1`r q : G " r : g 1 , pd 2`r q : G " r : g 2 (12) In case of R_X movement (Figure 5d), we can decompose the R_X movement into T_Z and T_Y movements. As shown in Figure 5b and Equation (10), T_Y does not change the original pupil-corneal SR vector. In addition, as shown in Figure 5c and Equation (14), the pupil-corneal SR vector after T_Z is changed according to the Z-distance ratio considering rˆ((d 1 + r)/(d 2 + r)) compared to the original pupil-corneal SR vector (h 1 + g 1 ). Therefore, we can obtain the pupil-corneal SR vector after R_X as shown in Equation (18), obtained using Equations (15)-(17): pd 1`r q : H " r : h 1 , pd 2`r q :`H`T y˘" r : h 2 (15) pd 1`r q : G " r : g 1 , pd 2`r q : pT y´G q " r : g 2 (16) In case of R_Y movement (Figure 5e), we can decompose R_Y movement into T_Z and T_X movements. Using a similar method for the case of R_X movement, T_X does not change the original pupil-corneal SR vector as shown in Figure 5a and Equation (6). In addition, the pupil-corneal SR vector after T_Z is changed according to the Z-distance ratio considering rˆ((d 1 + r)/(d 2 + r)) compared to the original pupil-corneal SR vector (h 1 + g 1 ). Therefore, we can obtain the pupil-corneal SR vector after R_Y as shown in Equation (22), obtained using Equations (19)- (21): pd 1`r q : G " r : g 1 , pd 2`r q : G " r : g 2 (20) Finally, for the last case of R_Z movement (Figure 5f), we can decompose R_Z movement into T_X and T_Y movements. As explained before, T_X and T_Y do not change the original pupil-corneal SR vector. Therefore, the pupil-corneal SR vector after R_Z is same as that before R_Z, and R_Z does not change the original pupil-corneal SR vector as shown in Equation (26) pd`rq : G " r : g 1 , pd`rq :`G´T y˘" r : g 2 (24) To summarize, we can estimate that the change of the pupil-corneal SR vector is only affected by the change of Z-distance (Table 2).  (14)) Rotation Change (Equation (18)) Change (Equation (22)) No change Although we assume that the Z-axis is orthogonal to the camera image plane as shown in Figure 4, we do not use the assumption that the head movement of user in the direction of Z-axis is limited only to the parallel direction of Z-axis. That is, although the head movement of user is not parallel to the direction of Z-axis, this case can be handled in our research by considering the combination of T z and T y (or T z and T x ) as shown in Figure 5d,e.

Compensating the Head Movement
The Equations (14), (18) and (22) are obtained in 3D space. Therefore, we should obtain the corresponding Equations in 2D image plane in order to compensate the head movement because the pupil-corneal SR vector can be measured in 2D captured image. For that, we apply a camera perspective model [24] as shown in Figure 6. f c is camera focal length, d 1 or d 2 is Z-distance, and L 1 or L 2 is the projection in image of l 1 or l 2 in 3D space. Then, we can get Equations (27) and (28). In Equations (14), (18) and (22), h 2`g2 or h 2´g2 is l 2 , and h 1`g1 is l 1 . Therefore, we can obtain the Equation (29) using the Equations (14), (18), (22) and (28). Finally, the Equation (30) can be obtained, where L 1 and L 2 are the pupil-corneal SR vectors in the image before and after head movement, respectively. We can compensate the head movement using the Equation (30).
The Equations (14), (18) and (22) are obtained in 3D space. Therefore, we should obtain the corresponding Equations in 2D image plane in order to compensate the head movement because the pupil-corneal SR vector can be measured in 2D captured image. For that, we apply a camera perspective model [24] as shown in Figure 6. fc is camera focal length, d1 or d2 is Z-distance, and L1 or L2 is the projection in image of l1 or l2 in 3D space. Then, we can get Equations (27) and (28). In Equations (14), (18) and (22), or is l2, and is l1. Therefore, we can obtain the Equation (29) using the Equations (14), (18), (22) and (28). Finally, the Equation (30) can be obtained, where L1 and L2 are the pupil-corneal SR vectors in the image before and after head movement, respectively. We can compensate the head movement using the Equation (30).
In order to compensate the head movement using Equation (30), we should know the Z-distances (d 1 and d 2 ) and the eyeball radius (r). We refer to Gullstrand's eye model [28] for the eyeball radius. To get the Z-distance, d 1 and d 2 , we use a commercial ultrasonic sensor (SRF08) [29] set-up above the camera for gaze tracking. The sensor consists of two parts-transmitter & receiver-and a control board. The transmitter & receiver part is connected to the control board via an I2C interface. The control board is connected to a desktop computer via a universal serial bus (USB) 2.0 interface. Therefore, the Z-distance data measured by transmitter & receiver is continuously transmitted to the desktop computer via the control board at the frequency of 40 kHz. The maximum Z-distance which can be measured is limited to 6 m. The principle of the ultrasonic sensor is to measure the Z-distance between the transmitter & receiver and the object closest to the transmitter & receiver. Therefore, it is difficult to measure the accurate Z-distance between the ultrasonic sensor and the eye. Instead, it can measure the Z-distance between the ultrasonic sensor and another part of the face such as chin (or nose). In our research, the Z-distance between the chin (or nose) and eye is much smaller (2~3 cm) compared to the Z-distance (70~80 cm) between the chin (or nose) and ultrasonic sensor. Therefore, we can use the assumption that the Z-distances of the chin (or nose) and eye to the sensor are similar. In addition, the ultrasonic sensor can measure the Z-distance, even when a user wears glasses.
However, the stability for measuring an accurate Z-distance with the ultrasonic sensor is not high because the Z-distance can change even while the user's head rotation is not changed in the Z-distance. To overcome this problem, we use the inter-distance between two eyes (pupil centers) (in the 3D space) which is measured during the initial stage of user calibration. The Z axis of the head in Figure 4 is usually oriented towards the ultrasonic sensor when a user gazes at the center-lower position of the monitor (close to the position of the ultrasonic sensor) during the initial stage of user calibration. Then, we can obtain the Z-distance (d) using the ultrasonic sensor and assume that d is approximately the distance between the camera and user's eye because the camera is close to the ultrasonic sensor. With the known camera focal length (f c ) (which is obtained by initial camera calibration), the measured Z-distance (d), and the inter-distance between two pupil centers (in the image) (L), we can obtain the actual inter-distance between two pupil centers (in the 3D space) (l) using the equation (l = (d¨L)/f c ) based on camera perspective model of Figure 6.
This actual inter-distance between two pupil centers (in the 3D space) (l) is not changed even when head movement occurs. Therefore, since we know the inter-distance between two eyes (in the 3D space) (l 2 (=l) of Figure 6) and the changed inter-distance in the camera image after head movement (L 2 of Figure 6) with the camera focal length (f c ), we can estimate the changed Z-distance after head movement (d 2 of Figure 6) using the equation (d 2 = (l 2¨f c )/L 2 ).
In general, there are many individual variances of the inter-distance (l) between two eyes (pupil centers) (in the 3D space). Therefore, if we use the average value (calculated from many people) of l for measuring the Z-distance, the measurement error of Z-distance is much higher than that by our method. Experimental results with 10 people showed that the measurement error of the Z-distance by using the average value of l (calculated from many people) was about 3.5 cm, while that measured by our method was less than 1 cm. However, in the case of R_Y movement, the inter-distance between two eyes and the corresponding Z-distance are measured incorrectly. This is caused by the mapping of the 3D space object into 2D. As shown in Figure 7a,b, the inter-distance between two eyes in the 2D image plane becomes smaller in both cases of R_Y and T_Z. Although the pupil-corneal SR vector is changed, and it should be compensated in case of R_Y as shown in Equation (22), our system does not need to perform the compensation. The reason for this is our use of binocular gaze tracking. As shown in Figure 7a, the left eye is closer to the monitor whereas the right one becomes farther away in the case of R_Y. Based on the Equations (22) and (30), d 2 is smaller in the case of the left eye whereas d 2 is larger in the case of the right one. Because our system uses binocular gaze tracking by obtaining the average gaze position of two eyes, the head movement by R_Y does not need to be compensated. However, in case of T_Z, the head movement must be compensated as shown in Table 2. For that, our system should discriminate the R_Y and T_Z movement, which is achieved as explained below.
Sensors 2016, 16, 110 11 of 20 caused by the mapping of the 3D space object into 2D. As shown in Figure 7a,b, the inter-distance between two eyes in the 2D image plane becomes smaller in both cases of R_Y and T_Z. Although the pupil-corneal SR vector is changed, and it should be compensated in case of R_Y as shown in Equation (22), our system does not need to perform the compensation. The reason for this is our use of binocular gaze tracking. As shown in Figure 7a, the left eye is closer to the monitor whereas the right one becomes farther away in the case of R_Y. Based on the Equations (22) and (30), d2 is smaller in the case of the left eye whereas d2 is larger in the case of the right one. Because our system uses binocular gaze tracking by obtaining the average gaze position of two eyes, the head movement by R_Y does not need to be compensated. However, in case of T_Z, the head movement must be compensated as shown in Table 2. For that, our system should discriminate the R_Y and T_Z movement, which is achieved as explained below. When R_Y movement occurs, the amount of movement of both eyes on the horizontal axis is different from each other as shown in Figure 8a, whereas it is almost similar in case of T_Z movement as shown in Figure 8b. Based on this, our system discriminates the case of R_Y movement from that of T_Z movement, and compensates the user's head movement by compensating the pupil-corneal SR vector in the image based on Equation (30). When R_Y movement occurs, the amount of movement of both eyes on the horizontal axis is different from each other as shown in Figure 8a, whereas it is almost similar in case of T_Z movement as shown in Figure 8b. Based on this, our system discriminates the case of R_Y movement from that of T_Z movement, and compensates the user's head movement by compensating the pupil-corneal SR vector in the image based on Equation (30).

Comparisons of Gaze Tracking Errors without and with Our Compensation Methods
To compare the gaze accuracies with or without the proposed head movement compensation, we performed experiments with data from ten subjects. In most previous research on gaze tracking, experiments were performed with less than ten subjects. In the research of [15,16,19,20], and [23], the number of subjects participating in the experiments were seven, nine, two, three and six subjects, respectively. Each subject moved his or her head according to two movements (translation and rotation), three axes (X, Y, and Z), and two directions (plus and minus of the translation, clockwise and counterclockwise rotation). The proposed algorithm was executed on a desktop computer having a CPU configuration of 3.47 GHz (Intel (R) Core (TM) i7 CPU X 990) and 12 GB RAM with a monitor of diagonal size of about 48.3 cm and having 1280 × 1024 pixel resolution. As a gaze tracking camera, a conventional web-camera was used (Logitech C600 [30]). The viewing angle of camera lens is 10° (−5°-+5°). Therefore, the size of the viewport at the typical Z-distance (about 70 cm) is approximately 12.2 cm (=2 × 70 × tan5°). The viewing angle of original camera lens is much larger than 10 degrees. However, with this lens, the eye region becomes too small in the captured image, which can degrade the accuracy of gaze detection. Therefore, we replace the original lens by a zoom lens whose viewing angle is 10° (−5°-+5°) (whose focal length is 22 mm). This lens is made by other Korean optical lens company [31]. Using this lens, we can capture the magnified eye image.
In order to capture images at fast speed (30 frames/s), an image of 800 × 600 pixels was acquired, and was re-sized to 1600 × 1200 pixels by bi-linear interpolation. Although the image of 1600 × 1200 pixels can be captured by the camera, the data size of this image is four times larger than that of the image of 800 × 600 pixels. Therefore, the capturing speed becomes slower (7-8 frames/s (=30/4 frames/s)) due to the bandwidth limitation of the data transfer from the camera to computer, and the fast movement of user's head and eye cannot be captured by the camera, consequently. In addition, through the resizing by bi-linear interpolation, additional pixel information of eye region can be obtained, through which more accurate detection of pupil and corneal SR can be possible, and the consequent accuracy of gaze detection becomes higher.
To get the NIR image, the NIR cutting filter inside the web-camera is replaced by a long pass filter (Wratten Filter No. 89B) which can pass the NIR light with a wavelength longer than 700 nm

Comparisons of Gaze Tracking Errors without and with Our Compensation Methods
To compare the gaze accuracies with or without the proposed head movement compensation, we performed experiments with data from ten subjects. In most previous research on gaze tracking, experiments were performed with less than ten subjects. In the research of [15,16,19,20], and [23], the number of subjects participating in the experiments were seven, nine, two, three and six subjects, respectively. Each subject moved his or her head according to two movements (translation and rotation), three axes (X, Y, and Z), and two directions (plus and minus of the translation, clockwise and counterclockwise rotation). The proposed algorithm was executed on a desktop computer having a CPU configuration of 3.47 GHz (Intel (R) Core (TM) i7 CPU X 990) and 12 GB RAM with a monitor of diagonal size of about 48.3 cm and having 1280ˆ1024 pixel resolution. As a gaze tracking camera, a conventional web-camera was used (Logitech C600 [30]). The viewing angle of camera lens is 10˝(´5˝-+5˝). Therefore, the size of the viewport at the typical Z-distance (about 70 cm) is approximately 12.2 cm (=2ˆ70ˆtan5˝). The viewing angle of original camera lens is much larger than 10 degrees. However, with this lens, the eye region becomes too small in the captured image, which can degrade the accuracy of gaze detection. Therefore, we replace the original lens by a zoom lens whose viewing angle is 10˝(´5˝-+5˝) (whose focal length is 22 mm). This lens is made by other Korean optical lens company [31]. Using this lens, we can capture the magnified eye image.
In order to capture images at fast speed (30 frames/s), an image of 800ˆ600 pixels was acquired, and was re-sized to 1600ˆ1200 pixels by bi-linear interpolation. Although the image of 1600ˆ1200 pixels can be captured by the camera, the data size of this image is four times larger than that of the image of 800ˆ600 pixels. Therefore, the capturing speed becomes slower (7-8 frames/s (=30/4 frames/s)) due to the bandwidth limitation of the data transfer from the camera to computer, and the fast movement of user's head and eye cannot be captured by the camera, consequently. In addition, through the resizing by bi-linear interpolation, additional pixel information of eye region can be obtained, through which more accurate detection of pupil and corneal SR can be possible, and the consequent accuracy of gaze detection becomes higher.
To get the NIR image, the NIR cutting filter inside the web-camera is replaced by a long pass filter (Wratten Filter No. 89B) which can pass the NIR light with a wavelength longer than 700 nm [32]. The proposed algorithm was implemented with C++ language using Microsoft Foundation Class (MFC) and OpenCV library (ver. 2.4.5 [33]). Figure 9 shows the experimental environment. In Figure 10, we show the examples of experimental images according to each head movement.
Sensors 2016, 16,110 13 of 20 [32]. The proposed algorithm was implemented with C++ language using Microsoft Foundation Class (MFC) and OpenCV library (ver. 2.4.5 [33]). Figure 9 shows the experimental environment. In Figure 10, we show the examples of experimental images according to each head movement. (a)   [32]. The proposed algorithm was implemented with C++ language using Microsoft Foundation Class (MFC) and OpenCV library (ver. 2.4.5 [33]). Figure 9 shows the experimental environment. In Figure 10, we show the examples of experimental images according to each head movement. (a)  After each user performed the user dependent calibration by gazing at nine reference points on the monitor (calibration stage), the gaze tracking errors with or without the proposed method were measured (with the data which are obtained when each user gazes at 20 reference points) according to the head movements (testing stage). The results are shown in Table 3. Gaze tracking error is measured as the angular difference (error) between the calculated gaze vector and the vector to the reference (ground-truth) point, and STD means the standard deviation of gaze tracking error. Therefore, a smaller error means a higher gaze detection accuracy. As shown in Table 3, the gaze detection errors achieved with our proposed method are lower than those achieved without our method, in all cases of head movement. In addition, as shown in Table 3, the gaze detection error with our method is less than 0.9˝in both cases of no head movement and head movement. Table 3. Average gaze detection errors of ten subjects with or without our method (unit:˝).

Kinds of Head Movement
Average Gaze Detection Error (STD) In Table 4, we show the individual gaze detection errors with and without our method. As shown in Table 4, the average gaze detection error achieved with our method is lower than the error achieved without our method, in all the subjects.
In order to prove that the average gaze detection error achieved using our method is statistically lower than the error achieved without our method, we performed a t-test [34]. When the t-test was performed using two independent samples (two gaze detection errors without our method (1.81˝with STD of 0.11˝as shown in Table 3) and with our methods (0.69˝with STD of 0.06˝as shown in Table 3)), the calculated p-value was 4.43105ˆ10´1 6 , which was smaller than the 99% (0.01) significance level.
Then, the null-hypothesis for the t-test (there is no difference between the two independent samples) is rejected based on the p-value. Therefore, we can conclude that there is a significant difference between the average gaze detection errors with and without our method at the significance level of 99%.
We performed Cohen's d analysis as a next step, by which the size of the difference between the two groups can be indicated using the effect size [35]. In general, Cohen's d is determined as small at about 0.2-0.3, medium at about 0.5, and large at or higher than 0.8. The calculated Cohen's d is 12.1765, and we can conclude that the difference between the average gaze detection errors without and with our method is large. From the p-value and Cohen's d, we can conclude that there is a significant difference in the average gaze detection errors with or without our method.
In the next experiment, we measured the processing time required by our method. The total processing time taken by our method is 15.55 ms per frame, and we can know that our system can be operated at the speed of about 64.3 frames/s. For experiments, each participant moved his or her head within the range of head movement as shown in Table 5.

Comparisons of Gaze Tracking Errors with Our Method and Commercial System
In the next experiment, we compared the gaze detection errors by our method with that of a commercial gaze tracking system. As an example, a TheEyeTribe commercial system is used [14]. To compare the gaze accuracies, we performed experiments with data from ten subjects like the experiments of Section 3.1.
Each subject moved his or her head according to two movements (translation and rotation), three axes (X, Y, and Z), and two directions (plus and minus of the translation, clockwise and counter-clockwise of the rotation). Like the experiments of Section 3.1, the proposed algorithm was executed on a desktop computer having a CPU configuration of 3.47 GHz (Intel (R) Core (TM) i7 CPU X 990) and 12 GB RAM with a monitor of diagonal size of about 48.3 cm and having 1280ˆ1024 pixel resolution. Experimental results showed that our system produce lower gaze detection errors in all cases of head movements than the commercial system (TheEyeTribe) as shown in Table 6. In Figure 11, we show the t-test, and the analysis of Cohen's d. The measured p-value is 0.014569, which was smaller than the 95% (0.05) significance level, and we can conclude that there is a significant difference in the average gaze detection errors achieved using our method and those achieved using commercial system. In addition, the measured Cohen's d is 1.0826, and we can conclude that the difference between the average gaze detection errors achieved using our method and commercial system is large.  In Figure 11, we show the t-test, and the analysis of Cohen's d. The measured p-value is 0.014569, which was smaller than the 95% (0.05) significance level, and we can conclude that there is a significant difference in the average gaze detection errors achieved using our method and those achieved using commercial system. In addition, the measured Cohen's d is 1.0826, and we can conclude that the difference between the average gaze detection errors achieved using our method and commercial system is large. Figure 11. T-test of average gaze detection errors by our method and a commercial system (TheEyeTribe).
In the last experiment, we compared the gaze detection errors by our method with that of another commercial gaze tracking system. As an example, the commercial system (Tobii EyeX) is used [36]. To compare the gaze accuracies, we performed experiments with data from ten subjects like the experiments of Section 3.1. Each subject moved his or her head according to two movements (translation and rotation), three axes (X, Y, and Z), and two directions (plus and minus of the translation, clockwise and counter-clockwise of the rotation). Figure 11. T-test of average gaze detection errors by our method and a commercial system (TheEyeTribe).
In the last experiment, we compared the gaze detection errors by our method with that of another commercial gaze tracking system. As an example, the commercial system (Tobii EyeX) is used [36]. To compare the gaze accuracies, we performed experiments with data from ten subjects like the experiments of Section 3.1. Each subject moved his or her head according to two movements (translation and rotation), three axes (X, Y, and Z), and two directions (plus and minus of the translation, clockwise and counter-clockwise of the rotation).
Like the experiments of Section 3.1, the proposed algorithm was executed on a desktop computer having a CPU configuration of 3.47 GHz (Intel (R) Core (TM) i7 CPU X 990) and 12 GB RAM with a 19-inch monitor having 1280ˆ1024 pixel resolution. Experimental results showed that our system produce lower gaze detection errors in all cases of head movements than the commercial system (Tobii EyeX) as shown in Table 7. In Figure 12, we show the t-test, and the analysis of Cohen's d. The measured p-value is 0.005853, which was smaller than the 99% (0.01) significance level, and we can conclude that there is a significant difference in the average gaze detection accuracies achieved using our method and those achieved using commercial systems. In addition, the measured Cohen's d is 1.32538, and we can conclude that the difference between the average gaze detection accuracies achieved using our method and commercial system is large. Like the experiments of Section 3.1, the proposed algorithm was executed on a desktop computer having a CPU configuration of 3.47 GHz (Intel (R) Core (TM) i7 CPU X 990) and 12 GB RAM with a 19-inch monitor having 1280 × 1024 pixel resolution. Experimental results showed that our system produce lower gaze detection errors in all cases of head movements than the commercial system (Tobii EyeX) as shown in Table 7. In Figure 12, we show the t-test, and the analysis of Cohen's d. The measured p-value is 0.005853, which was smaller than the 99% (0.01) significance level, and we can conclude that there is a significant difference in the average gaze detection accuracies achieved using our method and those achieved using commercial systems. In addition, the measured Cohen's d is 1.32538, and we can conclude that the difference between the average gaze detection accuracies achieved using our method and commercial system is large. Figure 12. T-test of average gaze detection accuracies by our method and commercial system (Tobii EyeX).
The reason why we selected two commercial systems such as TheEyeTribe [14] and Tobii EyeX [36] is that these systems have been the most widely used as commercial gaze detection systems due to their advantages of high performance and low cost. When people use a gaze tracking system, it is The reason why we selected two commercial systems such as TheEyeTribe [14] and Tobii EyeX [36] is that these systems have been the most widely used as commercial gaze detection systems due to their advantages of high performance and low cost. When people use a gaze tracking system, it is often the case that user's head moves from the position where user's calibration is performed. This causes the degradation of gaze detection accuracy in most gaze tracking systems. To overcome this problem, our head movement compensation method is proposed, and our method can guarantee the high accuracy of gaze detection even while user's natural head movement occurs after initial user calibration. These statements are confirmed through the experimental results of Tables 3, 4, 6 and 7 and Figures 11 and 12.
In [37], Ishima and Ebisawa proposed an eye tracking method using an ultrasonic sensor for head free gaze detection. However, their method requires users to wear a head-mounted device including three ultrasonic transmitters. In addition, three ultrasonic receivers should be attached on known (three) positions around the monitor frame. By using the head-mounted device and three ultrasonic transmitters & receivers, user convenience is lower and the size of the system becomes bigger. In addition, because of the uss of the three ultrasonic receivers at predetermined positions around the monitor frame, additional time for setting up the system is required when a different monitor is used. Also they did not show the accuracy of gaze detection, but rather present the robustness of continuously capturing eye images with their system. Our system uses only one ultrasonic sensor (transmitter & receiver) as shown in Figure 9. By using one sensor, the size of the system is very small and the set-up of the system is easy, irrespective of the kind of monitor. In addition, our system does not require users to wear any device, which can enhance the users' convenience. Different from [37], we showed that our system can guarantee the high accuracy of gaze detection irrespective of a user's natural head movement after the initial user calibration.
Our system also has cost advantages compared to the two commercial systems. Our system is composed of one web-camera (less than $20), one ultrasonic sensor (less than $20), and NIR illuminator with filter (less than $20). Therefore, the total cost of our system is less than $60, which is cheaper than the two commercial systems (TheEyeTribe [14] and Tobii EyeX [36]) whose cost are about $100-$150, respectively.

Conclusions
In our research, we propose a gaze tracking method using an ultrasonic sensor which is robust to the natural head movement of a user. Through mathematical analyses of the change of relative positions between the PC and corneal SR due to various head movements, we find that only the head movement (causing the change of Z-distance between the camera and user's eye) can affect the gaze detection accuracy. We measure the change of Z-distance of a user's eye using a small sized ultrasonic sensor, which does not require any complicated camera calibration and its Z-distance measurement is much faster than possible with stereo cameras. The accuracy of the measured Z-distance by the ultrasonic sensor was enhanced by using the change of inter-distance of two eyes in the images. By using the information of eye movement change for both eyes in X-axis of the image, we can discriminate R_Y from T_Z. Experimental results show that the gaze tracking system with our compensation method is more robust to natural head movements than the systems without our method and commercial systems. The results are proved by t-test and the analysis of effect size.
In the future, we would measure the performance of our system in more varied environments. In addition, we would research a method for enhancing the accuracy of gaze detection systems, which are robust to user's head movement by combining the proposed method with the train-based estimator of Z-distance.