A Novel Gaze Tracking Method Based on the Generation of Virtual Calibration Points

Most conventional gaze-tracking systems require that users look at many points during the initial calibration stage, which is inconvenient for them. To avoid this requirement, we propose a new gaze-tracking method with four important characteristics. First, our gaze-tracking system uses a large screen located at a distance from the user, who wears a lightweight device. Second, our system requires that users look at only four calibration points during the initial calibration stage, during which four pupil centers are noted. Third, five additional points (virtual pupil centers) are generated with a multilayer perceptron using the four actual points (detected pupil centers) as inputs. Fourth, when a user gazes at a large screen, the shape defined by the positions of the four pupil centers is a distorted quadrangle because of the nonlinear movement of the human eyeball. The gaze-detection accuracy is reduced if we map the pupil movement area onto the screen area using a single transform function. We overcame this problem by calculating the gaze position based on multi-geometric transforms using the five virtual points and the four actual points. Experiment results show that the accuracy of the proposed method is better than that of other methods.


Introduction
Gaze-tracking technology is used to detect a user's gaze position in many applications, such as computer interfaces for the disabled, medical care, rehabilitation, and virtual reality [1][2][3]. Two approaches for gaze tracking exist: the wearable type and the remote type. The wearable type requires a user to wear a device that includes a camera and a near-infrared (NIR) light illuminator. Various types of devices can be used, such as a helmet or a pair of glasses [4][5][6], which do not require adjustments for head movements, because the device follows the user's head movements. However, when calculating the gaze position on a screen, tracking the head movements requires additional NIR illuminators in the four corners of the screen or an additional camera [4][5][6]. With the remote-type method, the user does not need to wear a device, because a remote camera captures an image of the user's eye, which is more convenient for the user [7,8]. However, additional cameras or expensive pan-tilt devices are required to capture eye images when users move their head.
Previous studies of gaze tracking can be classified into 2D-or 3D-based gaze-tracking methods. The 2D-based gaze-tracking methods use a simple mapping function between the pupil's position and the gaze position on the screen [4][5][6][9][10][11]. In contrast, the 3D-based gaze-tracking methods calculate the gaze position based on a 3D eyeball model [12,13]. In general, the 3D-based method is more accurate than the 2D-based method, but it requires the complex calibration of stereo cameras or multiple light sources.
In all previous studies on gaze tracking, an initial user calibration stage was required for more accurate gaze estimation. During the user calibration, the user needs to gaze at reference positions on a screen. In general, the accuracy of a gaze-tracking system tends to increase with the number of reference points, but this can be highly inconvenient for the user. To minimize the user inconvenience, NIR illuminators are attached to the four corners of the monitor and the system requires that users view only one position during Kappa calibration [5,6,13]. Table 1 provides a summary of the number of calibration points required by previous gaze-tracking methods and our proposed method. Gaze-Tracking Method [12] [5], [6], [8], [10], [13] [14], [15] [ 4,7], [16], and proposed method [17] [18] [19], [20], [21] [22] [23] [3] [24] [25] To minimize the number of calibration points while maintaining the accuracy of gaze tracking, we propose a new gaze-tracking method based on the generation of virtual calibration points. In this study, we adopted a wearable gaze-tracking method to avoid the use of bulky panning and tilting devices while allowing natural head movements with a large display. We also used a 2D-based method in this study to reduce the complex calibrations of stereo cameras or multiple light sources that are required by 3D-based gaze-tracking methods.
In a previous study, Memmert used an eye-tracking system with a large screen (3.2 × 2.4 m) [26]. Agustin et al. also proposed a gaze-tracking system that used a large screen [22]. With a large screen, the shape defined by the four pupil center positions (the top-left, top-right, bottom-left, and bottom-right corners of the screen) is a distorted quadrangle rather than a rectangle because of the nonlinear movement of the 3D eyeball. Thus, the gaze-detection accuracy is reduced if we map the pupil movement area onto the screen area using a single transform function. We overcame this problem by calculating the final gaze position based on multi-geometric transforms. The remainder of this paper is organized as follows: in Section 2, we explain the proposed method. The experiment results and conclusions are presented in Sections 3 and 4, respectively. Figure 1 provides an overview of the proposed gaze-tracking method. First, a user's eye image is captured with a camera using the NIR illuminator in the device, as shown in Figure 2. The image is not affected by external visible light, because an IR-passing filter attached to the camera rejects visible light [10,11]. Second, the captured NIR eye image is processed, and the pupil center is detected (Section 2.3). Third, the user is required to gaze at points near the four corners of the screen during the user-calibration stage. The eight feature values of the four detected pupil centers are then extracted. (Section 2.4). Fourth, the eight extracted values are used as the inputs for training the multilayer perceptron (MLP). The MLP has a linear kernel, and it generates five additional points or virtual pupil centers as outputs (Section 2.5). Finally, the five generated points (virtual pupil centers) and the four actual points (detected pupil centers) are used to calculate the final gaze position (x, y) on a screen, based on multi-geometric transforms (Section 2.6).

Proposed Gaze-Tracking Device
In this study, we developed a gaze-tracking method that uses a wearable-type device. As shown in Figure 2, the device is comprised of a small universal serial bus (USB) camera and an NIR LED (light emitting diode) [10,11]. The USB camera is a Logitech WebCam C600 [27]. The NIR-passing (visible light rejection) filter is included in the camera, and an additional zoom lens is attached to the camera's built-in lens [10,11]. Thus, the camera can capture the magnified NIR eye image unaffected by external visible light. The Z distance between the camera lens and the eye is about 8 cm. The detailed specifications are as follows: • NIR LED

Detecting the Pupil Center
In Step 2 of Figure 1, The circular edge detection (CED), local binarization, morphological closing, and geometric center calculation are performed sequentially to detect the pupil region in the NIR eye image, as shown in Figure 3 [6,10,11,28]. First, two scalable concentric circles (external and internal circles) are moved together within the whole image, as shown Figure 3a. The pupil center is determined to be the position where the average of sum of pixel difference between the outer circle and the inner circle is maximized, as shown in Figure 3b. The detected pupil region is used to define a rectangular region on which local binarization is performed, as shown in Figure 3c. The threshold value for the local binarization is determined using the p-tile method [29]. The NIR LED used in our device ( Figure 2) generates two types of reflections: specular reflections and Purkinje images. As shown in Figure 3b, specular reflections are produced on the corneal surface and are referred to as the first Purkinje image [11]. The specular reflections are not included in the pupil area because of the relative positions of the NIR LED and the user's eye, as shown in Figures 3b and c; hence, they are not used to detect the pupil center in our study.  Figure 3c. These reflections occur on the posterior surface of the cornea and the anterior and posterior surfaces of the lens. These are referred to as the second, third, and fourth Purkinje images, respectively [11]. In our eye images, two Purkinje images were found in the pupil area and one in the iris area, as shown in Figure 3c. To determine the accurate pupil center, the Purkinje images in the pupil area were filled in using morphological closing, as shown in Figure 3d [30]. Finally, the geometric center position of the black pixels in the pupil region is calculated as the pupil center, as shown in Figure 3e, and the final detection result is shown in Figure 3f.

User Calibration by Gazing at the Four Corners of a Screen
Most conventional gaze-tracking systems require an initial user-dependent calibration procedure. In the present study, a user is requested to gaze at the four corners of a screen, as shown in Figure 4. When the user gazes at the four reference points, the four center positions of the user's pupil Figure 4a-d show the four eye images and the pupil center positions when a user gazes at the top-left, top-right, bottom-left, and bottom-right reference points, respectively. These eight values [(C x1 , C y1 ), (C x2 , C y2 ), (C x3 , C y3 ), and (C x4 , C y4 )] are used in the next step to estimate the five virtual points using the MLP algorithm.  (c) (d)
Additional user calibration points, i.e., , and (C x9 , C y9 ), are obtained, and one of the points (C x5 ) can be represented using Equations (1-4) [11]: where O_h i is the output value of the hidden node (h i ) and w' i1 is the weight between the hidden node (h i ) and the output node (o 1 ). Various kernel functions can be used for the hidden and output nodes, such as linear or sigmoid functions. For a linear function, Equation (1) can be represented as follows: O_h 1 , O_h 2 , O_h 3 , …, O_h n can also be represented as follows: where func1(·) is the kernel function of the hidden node (h i ). After replacing O_h 1 , O_h 2 , O_h 3 , …, O_h n in Equation (1) using Equation (3), C x5 can be represented as follows: Figure 5. MLP to estimate the five virtual points in the pupil center. Figure 6 shows the MSE using different numbers of hidden nodes with the training data, given eight input and ten output nodes in the MLP, as shown in Figure 5. The MSE decreased as the learning epoch increased during MLP training. In this experiment, we compared the MSE on the training set according to the numbers of hidden nodes from 1-50. Based on the minimum MSE in the experiment results, we selected 38 as the optimal number of hidden nodes. To simplify the graph, Figure 6 shows only the cases where the numbers of hidden nodes are 9, 17, 23, 35, 38 (optimal), and 41. The left part of Figure 7 shows examples of the four pupil movable areas defined using the four actual pupil centers For example, Pupil Movable Area 1 is defined by (C x1 , C y1 ), (C x5 , C y5 ), (C x6 , C y6 ), and (C x7 , C y7 ). The right part of Figure 7 shows the four screen areas corresponding to each pupil movable area. For example, Pupil Movable Area 1 corresponds to Screen Area 1. Based on these relationships between the pupil movable areas and the screen areas, multi-geometric transforms are obtained, and the final gaze position is calculated. Detailed explanations are provided in the following Section.

Calculating Final Gaze Position using Multi-geometric Transforms
As shown in Figure 7, four relationships are defined between the four pupil movable areas and the four screen areas after the user-dependent calibration stage is completed; for example, the relationship between Pupil Movable Area 1 and Screen Area 1. The pupil movable area and the screen area are a distorted quadrangle and a rectangle, respectively, as shown in Figure 7; hence, each relationship can be determined as a mapping function. In general, 1st-order or 2nd-order polynomials are used as the mapping function, as shown in Equations (5) and (6).
With the 1st-order polynomial function, the relationship between the coordinates of the pupil center (C x , C y ) and the calculated position on the screen (S x , S y ) is as follows: As shown in Equation (5), the 1st-order polynomial function includes eight parameters, which consider the 2D factors of rotation, translation, scaling, parallel inclining, and distortion between (C x , C y ) and (S x , S y ) [32]. This is referred to as a geometric transform mapping function [10,11].
As shown in Figure 8a, T 1 is the mapping transform matrix between Pupil Movable Area 1 and Screen Area 1. Using the training data, T 1 can be obtained in advance by multiplying S 1 ′ and the inverse matrix of C 1 ′ in Equation (9) [10,11]. During the testing stage, if the position vector of the detected pupil center belongs to the quadrangle of Pupil Movable Area 1, the T 1 matrix in Equation (9) is selected and the gaze position vector on the screen is calculated by multiplying T 1 and the position vector of the detected pupil center [10,11]. By the same method, T 2 , T 3 , and T 4 of Figure 8b, c, and d are obtained, and the gaze position vector on the screen is also calculated.

Experimental Results
The proposed gaze-tracking method was tested on a laptop computer with an Intel Core 2 Duo 1.83 GHz CPU and 1 GB RAM. The algorithm was developed in C++ using Microsoft Foundation Class (MFC), and the image capture software was produced using the DirectX 9.0 software development kit (SDK). In our experiments, each user gazed at 81 reference points on a screen, as shown in Figure 9. The screen size was 2 m × 1.6 m (horizontal and vertical), and the distance from the user to the screen was approximately 3 m. Ten subjects participated in this experiment and each subject had six trials. Half of the data were used for training, and the other half were used for testing. This procedure was repeated by switching the training data and the testing data, and the average accuracy was calculated.
From the training data, we obtained the desired output positions for the MLP training. For example, we can train the MLP with the five desired output (virtual) points [(C x5 , C y5 ), (C x6 , C y6 ), (C x7 , C y7 ), (C x8 , C y8 ), and (C x9 , C y9 ) in Figure 5], because these five points are the data acquired when user gazed at the positions (upper-center, middle-left, middle-center, middle-right, and lower-center positions of the screen in Figure 9) which were among the 81 gazing points acquired during the training procedure. In the experiments, we measured the error of gaze detection (EGD) using Equation (10), where Z is the distance from the user's eye to the screen, X e is the error distance between the reference position and the calculated gaze position on the x-axis on the screen, and Y e is the error distance between the reference position and the calculated gaze position on the y-axis on the screen: ° (10) We measured the EGD with increasing number of calibration points. In the first test, we used the 1st-order polynomial mapping function (geometric transform) in Equations (5) and (7). Figure 10 shows the performance with 4, 6, 9, 10, 15, and 25 calibration points. We applied geometric transform matrices to each subarea to map the pupil movable area onto the screen area. For example, when the number of calibration points was 9, a user actually gazed at nine calibration points. Four geometric transform matrices (T 1 , T 2 , T 3 , and T 4 in Figure 8) were used to calculate the gaze position in each sub-region. As shown in Figure 10, the EGD generally decreased as the number of calibration points increased, if the calibration points included the screen center. The EGD was lowest when a user gazed at 25 calibration points.
In the next experiment, we measured the EGD when using the proposed method to generate the virtual points with the 1st-order polynomial function, as shown in Figure 11. For example, with nine calibration points, each user actually gazed at four calibration points (the four corners of the screen, i.e., the uncircled red points in Figure 11), and the virtual points (the red points inside blue dotted circles in Figure 11) were generated by the MLP algorithm, which used linear or sigmoid kernel functions. In Figure 11, "real calibration" refers to the results in Figure 10   In most cases, the EGD with "real calibration" was less than that with the proposed method. However, the EGD with the proposed method was less than that with an existing method, when using four actual calibration points [11]. In a previous study [11], users gazed at a small viewing area and the calculated EGD was less than 1.6°. However, the larger area used in our research generated nonlinear movements of the pupil due to the greater rotation of the eyeball; therefore, the calculated EGD was > 4° (the extreme left bar in Figure 10) despite using the same method to calculate the gaze position [11].
When the proposed method generated five virtual points based on four actual points using MLP with a linear kernel, the EGD was less than that in other scenarios using the proposed method, as shown in Figure 11.
When the number of calibration points was ten (i.e., six virtual points and four actual points), the EGD was higher than in other cases, as shown in Figure 11. The reasons for the higher EGD are as follows: As shown in Figures 2 and 9, the user gazed at a large display while the camera captured the user's eye image from below the eye. In addition, the horizontal length (2 m) of the display was longer than the vertical length (1.6 m). Thus, the nonlinear movement of the pupil was greater when the eye was rotated in the horizontal direction (i.e., when a user gazed at the extreme upper or lower horizontal boundary of the display) than when the eye was rotated in the vertical direction (i.e., when a user gazed at the extreme left or right horizontal boundary of the display). To compensate for the nonlinear movements of the pupil, points had to be generated for the extreme upper or lower boundary of the display. These points were not generated when the number of calibration points was 10, resulting in a higher EGD. Figure 11. Error of gaze detection depending on the number of calibration points, when using the 1st-order polynomial function with "real calibration" (Figure 10) and the proposed method.
The EGD values for Figure 11 are shown in Table 2. With the proposed method (MLP with a linear kernel), the EGD was lowest (1.66°) in the scenario where the user actually gazed at four points and five additional virtual points were generated, compared to other scenarios. Figure 11 (1st-order polynomial function) (unit:°) When the user actually gazed at nine points, the EGD was 1.36°. The EGD was lowest when a user actually gazed at 25 calibration points (0.55°). Even with a higher EGD, the proposed method is much more convenient for the user, because they had to gaze at only four positions during the initial calibration stage. In addition, when the user actually gazed at four points, the EGD of the proposed method with five virtual points (1.66°) was much lower than the EGD of the existing method without virtual points (4.19°) [11].

Table 2. Comparison of EGD results in
In the next test, we used the 2nd-order polynomial mapping function in Equations (6) and (8). In Figure 12, the numbers of calibration points were 6,9,15, and 25, and we applied the 2nd-order polynomial function to each subarea to map the pupil movable area onto the screen area. For example, when the number of calibration points was nine, the user actually gazed at nine calibration points and two 2nd-order polynomial functions were used to calculate the gaze position in two sub-regions. As shown in Equations (6) and (8), the 2nd-order polynomial function had 12 unknown parameters, and at least six calibration points were required to obtain those parameters. When the number of calibration points was nine, only two 2nd-order polynomial functions were defined, as shown in Figure 12. However, as shown in Equations (5) and (7), 1st-order polynomial function had eight unknown parameters, and at least four calibration points were required to obtain those parameters. With nine calibration points, the four 1st-order polynomial functions were defined as shown in Figure 10.
The experiment results showed that the EGD was lowest when a user actually gazed at 15 calibration points, as shown in Figure 12. In the next experiment, we measured the EGD when using the proposed method with the 2nd-order polynomial function to generate the virtual points, as shown in Figure 13. For example, with nine calibration points, each user actually gazed at four calibration points (the four corners of the screen, i.e., the uncircled red points), and the virtual points (the red points inside the blue dotted circles) were generated with the MLP algorithm, which used linear or sigmoid kernel functions. In Figure 13, "real calibration" refers to Figure 12, where the user actually gazed at all of the calibration points without generating virtual points.
In most cases, the EGD with "real calibration" was lower than that with the proposed method. When the proposed method was used to generate five virtual points based on four actual points with the MLP using the linear kernel, the EGD was lower than that in other cases with the proposed method, as shown in Figure 13. The EGD values for Figure 13 are shown in Table 3. With the proposed method (MLP with a linear kernel), the EGD was lowest (1.75°) in the scenario where the user actually gazed at four points and five additional virtual points were generated, compared to other scenarios. The EGD was lowest when a user actually gazed at 15 calibration points (0.78°); however, the proposed method is much more convenient for the user, because they had to gaze at only four positions during the initial calibration stage. The lowest EGD of the 2nd-order polynomial function (1.75°) was higher than the lowest EGD of the 1st-order polynomial function (1.66°), as shown in Table 2. Thus, we confirmed that the accuracy was better when using the 1st-order polynomial function.
The performance of the 2nd-order polynomial-based mapping function is worse than that of the 1st-order, because of the following reasons: The lowest EGDs of both 1st-order and 2nd-order polynomial functions were obtained with five virtual points based on four actual (gazing) points. However, with the 2nd-order polynomial-based mapping function, two transform matrices were defined, as shown in Figure 13. On the other hand, with the 1st-order polynomial function, four transform matrices were defined, as shown in Figure 11. That is, twice as many transform matrices were used with the 1st-order polynomial function on a smaller pupil movement area; therefore, the correlation between the pupil movement area and the screen region can be more accurately (minutely) defined (Figure 8), thereby reducing the gaze-detection error.
As shown in the Equations (6) and (8), six points are required for determining one 2nd-order polynomial function because the number of unknown parameters is 12 [a, b, … l of Equations (6) and (8)]. However, only four points are required for determining one 1st-order polynomial function because the number of unknown parameters is eight [a, b, … h of Equations (5) and (7)]. So, the only two matrices are obtained for the 2nd-order polynomial function in the 1 st case that "Number of Calib. points" is nine in the Figure 13. But the four matrices are obtained for the 1st-order polynomial function in the case that "Number of Calib. points" is nine in the Figure 11.
However, the comparisons of the 2nd-order and the 1st-polynomial functions were also made with the same condition, i.e., using four transformation matrices for the both cases. As shown in the two cases that "Number of Calib. points" is 15 in the Figure 13, the four transform matrices are used for the 2nd-order function in the both cases, respectively. In these cases, the EGDs with MLP with linear kernel are 4.68° and 4.5°, respectively, as shown in Table 3, which are larger than the EGD (1.66°) by the 1st-order polynomial function with MLP with linear kernel and the four transformation matrices as shown in Table 2. In addition, the EGDs with MLP with sigmoid kernel are 4.65° and 3.9°, respectively, as shown in Table 3, which are larger than the EGD (2.04°) by the 1st-order polynomial function with MLP with sigmoid kernel and the four transformation matrices as shown in Table 2. Table 3. Comparison of the EGD results in Figure 13 (2nd-order polynomial function) (unit:°)  Figure 14, the user gazed at four calibration points, and the gaze position was calculated using the existing geometric method without generating virtual points [11]. Figure 15 shows the results using the proposed method with the lowest EGD in Table 2. The same user gazed at the four calibration points, and five virtual points were generated using MLP with a linear kernel. The gaze position was calculated based on the 1st-order polynomial function. Figure 16 shows the results with the "real calibration" method using the lowest EGD in Table 2. The same user gazed at nine calibration points, and the gaze position was calculated using the multi-geometric transform method.
The proposed method ( Figure 15) was less accurate than the "real calibration" method ( Figure 16); however, the proposed method was much more convenient to use, because fewer points were needed for the initial calibration. In addition, the proposed method was more accurate than the existing method ( Figure 14).
In the final experiment, we measured the processing time with the proposed gaze-tracking method. Detecting the pupil center took 16 ms, generating new calibration points required 1 ms, and calculating the final gaze position took 20 ms. Thus, the total processing time was approximately 37 ms, and we confirmed that the processing speed with the proposed method was approximately 27 fps. Figure 14. Example of the gaze points calculated using the existing method [11], which required the user to gaze at four calibration points. Figure 15. Example of the gaze points calculated using the proposed method, which required the user to gaze at four calibration points and which generated five virtual points using MLP with a linear kernel and the 1st-order polynomial function. Figure 16. Example of the gaze points calculated using the "real calibration" method (in Table 1 and 2), which required the user to gaze at nine calibration points.

Conclusion
In this paper, we proposed a new gaze-tracking method to improve the performance of a gaze-tracking system using a large screen at a distance. The proposed device was light and wearable, and it was comprised of a USB camera, a zoom lens, and an NIR-LED. The proposed method generated five virtual points using an MLP with a linear kernel based on four actual points (detected pupil centers) as the input. The five virtual points and four actual points were used in multi-geometric transforms to calculate the final gaze position. The proposed system is more accurate and more convenient to use than the existing method, because it requires fewer calibration points.
In future work, we will test the proposed method in various environments, such as gaze detection on the small display of a mobile device or gaze detection while driving a vehicle. In addition, we would research a method that hides the calibration process from the users; for example, by requesting a user to watch a moving target on the screen, while the system acquires the data points needed for calibration.