Introduction
Recent advancements in eye-tracking hardware research have resulted in an increased number of available models that have improved performance and that provide easier setup procedures. However, the main problem with these devices continues to be the scalability since their price and the required expertise for operation make them infeasible at the large scale.
These latest commercial models provide great accuracies (between 0.1 and 1°) at high frequencies (over 100Hz); however, in situations where such accuracies are not necessary and such frequencies are irrelevant, their high prices make them unsuitable. For example, in the case of online advertisement, eye-tracking is used to analyze which parts of webpages draw more attention and how the page layout directs user gaze. In such a use case, even 10Hz data with an accuracy between 1–2° would be enough to generate the expected heatmap output.
In this work, we aim to build a cheap, open source alternative that works on a hardware setup common in consumer environments: a camera and an electronic device display. We believe that a system that provides comparable performance at an acceptable frequency will enable many applications on devices ranging from computers to tablets and smartphones.
Method
The components of the software can be seen in
Figure 1. The original system requires at least 4 facial feature points chosen manually on subject’s face and it employs a combination of optical flow (OF) and 3D head pose based estimation for tracking them. The image region containing one of the eyes is extracted and used in calibration and testing. In calibration, the subject is asked to look at several target locations on the display while image samples are taken and for each target, an average eye image is calculated to be used as input to train a Gaussian process (GP) estimator. This estimator component maps the input images to display coordinates during testing.
Our first contribution is a programmatic point selection mechanism to automate this task. Then we propose several improvements in the tracking component, and we implement image intensity normalization algorithms during and after tracking. We finish the work on the blink detector to use these detections in other components. In calibration, we propose a procedure to assess and eliminate the training error. For the gaze estimation component, we try to employ a neural network method (
Holland & Komogortsev, 2012). In the following subsections, we give the details of these contributions and talk about their effects on system performance in the discussion section.
Point Selection
Our contribution in the automation of the point selection mechanism aims at removing the errors due to operation mistakes. Moreover, it provides a standardized technique which increases the system’s robustness. It employs a combination of Haar cascade detectors (
Castrillón-Santana, 2013;
Hameed, 2014), geometrical heuristics and a novel eye-corner detection technique. First, a cascade is used to detect the region containing both eyes and then the novel method detects the outer eye-corner points (
Figure 2(a)). Here, the proposed method extracts all corner points inside the ROI using Harris detector, and calculates the average corner coordinates in the left and right half of the region. These two points are considered as approximate eye centers and the outer corner points are chosen on the line that passes through them. As we only search a point around the eye corner that is stable enough, we do not make more complex calculations and we simply choose the eye corner points at a predefined distance (1/3 of the distance between two centers) away from the center point approximates.
After the eye corners are selected, we search the nose in a square region below them. When the Haar cascade returns a valid detection—as in the inner rectangle in
Figure 2(b)—, the two nasal points are selected at fixed locations inside this area. The algorithm continues in a similar way for the mouth and eyebrow feature points.
Point Tracking
The point tracking component of the original system uses a combination of optical flow (OF) and 3D head pose based estimation. Optical flow calculations are done between the current camera image and the previous image. This methodology results in the accumulation of small tracking errors and causes the feature points to deviate vastly from their original positions after blinking, for instance. In order to make our eye-tracker more robust to these problems, we modified the tracking component so that OF is only calculated against the initial image saved while choosing the feature points. Moreover, if we still lose track of any point, we directly use the estimate calculated using the 3D head pose and correctly tracked points’ locations.
Image Normalization
During eye-tracker usage, the ambient light may change depending on the sun or other external light sources. Particularly, the computer display itself also acts as a source of frontal illumination, which contributes to a modification of the shades and shadows on the face as images of different intensities are shown in the screen. As the gaze estimation component of our eye-tracker uses intensity images for training and testing, the change in the light level is reflected in the increased error rates of the system.
Normalization Techniques. In order to tackle the varying lighting conditions, we incorporate two image normalization techniques to standardize the intensities over time (
Gonzalez & Woods, 2008):
1.
Standard pixel intensity mean and variance: In this technique, we first calculate the mean (
µorig) and the standard deviation (σ
orig) of the original image (
Iorig) pixels. In the next step, the scale factor (
S) is calculated as:
where σ
norm is the desired standard deviation of intensity for the normalized image (
Inorm). Finally, the normalized image pixels are calculated with the formula:
Here, the equation first scales the image pixels to have the desired standard deviation, then shifts the mean intensity to the desired value.
2. Standard minimum and maximum intensity: The second method aims at normalizing the images so that the minimum and maximum intensity values are the same among all the images.
We start by calculating the minimum (min
orig) and maximum (max
orig) pixel intensities in the original image. Then, the scale factor is calculated as:
which is basically the ratio of pixel intensity interval between the desired normalized image and original image. Lastly, the normalized image pixels are calculated as:
where the image pixels are mapped from range [min
orig, max
orig] to [min
norm, max
norm].
Variations in Usage. Having these two normalization techniques at hand, we continue by incorporating them in the eye-tracker. Normalization takes into account the distribution of gray levels for a given region. In our particular context, this can be applied in a pyramidal approach to: 1) the region containing the eye, 2) the region containing the face, or 3) the whole image. Since the statistics of each region are different, normalization is expected to provide different results depending on the region of application. In addition, normalizing in different regions has an impact in a number of system modules as explained next:
Eye-region normalization: Only the extracted eye regions are used for the normalization. By applying normalization to the eye-regions we guarantee that the gaze estimation component always receives images with similar intensity distributions.
Face-region normalization: Normalization parameters are derived and applied to the face region. We make use of the facial feature points’ positions for a fast region segmentation. The bounding box coordinates for these points are calculated and the box is expanded by 80% horizontally and 100% vertically so that the whole face is contained. By normalizing within the face bounding box we aim at improving point tracking by removing the effects of intensity variations.
Whole-image normalization: By using the whole image, we adapt the normalization to the average light conditions. However, variations in the background can affect the final result. Potentially, changes in the frontal illumination provided by the display can affect stability of the facial feature points detection.
Combined normalization: Lastly, we apply the eye-region normalization on top of the face-region or whole-image normalizations. By combining both methodologies, we expect to address both the tracking problems and the problems caused by not normalized eye-region images.
Blink Detection
The blink detector is an unfinished component of Opengazer and we continue with analyzing it and making the necessary modifications to get it running. We believe that blinks have an effect on performance and by skipping them during training, we can remove the errors they introduce.
The blink detector is designed as a state machine with initial, blinking and double blinking states. The system switches between these, depending on the differences in eye images that are extracted as described in the previous section. These differences are calculated as the L2 norm between the eye images in consecutive frames. When the difference threshold for switching states is exceeded during several frames, the state is switched to the next state and a blink is detected.
We built on this structure and completed the rules for the state switching mechanism. Moreover, we added a state reset rule that resets the system to the initial state whenever the threshold criteria is not met at a certain frame.
Calibration
The original system uses all the images acquired during the calibration step. We propose a modification in the calibration part which uses our blink detector so that the images acquired during blinks are no longer included in the calibration procedure. This is crucial because these frames can alter the average eye images calculated during calibration and therefore are reflected as noise in the calibration procedure. However; as these frames are no longer available for calibration, we have to increase the time each target point is displayed on the screen in order to provide the system with enough samples during calibration.
Another improvement that we propose is the correction of calibration errors as illustrated in
Figure 3.
Here, red triangles on the left side correspond to a target point displayed on the screen and the corresponding gaze estimations of our system, one for each camera frame. The larger symbol denotes the actual target, whereas the smaller ones are the estimates. The shorter line connects the average estimation and the target location. Therefore, the length and direction of this line gives us the magnitude and direction of average testing error. Apart from these symbols, the longer line that starts from the target denotes the direction of the calibration error. However, it should be noted that in order to easily observe the direction, the magnitude of the calibration error is increased by a factor of 5. In this figure, we can see the correlation between the calibration error and average testing error, therefore we propose a correction method. The final effect of this technique can be seen on the right side, where the estimates are moved closer to the actual target point.
To calculate the calibration errors, we store the grayscale images which are used to calculate the average eye images during calibration. Therefore, we save several images corresponding to different frames for each target point. After calibration is finished, the gaze estimations for these images are calculated to obtain the average gaze estimation for each target. The difference between these and the actual target locations gives the calibration error.
After the calibration errors are calculated, we continue with correcting these errors during testing. We employ two multivariate interpolators (Wang, Moin, & Iaccarino, 2010;
MIR, 2014) which receive the average gaze estimations for each target point as inputs and are trained to output the actual target
x and
y coordinates they belong to. The parameters that we chose for the interpolators are: approximation space dimension
ndim = 2, Taylor order parameter
N = 6, polynomial exactness parameter
P = 1 and safety factor
sa f ety = 50. After the interpolator is trained, we use it during testing to remove the effects of calibration errors. We pass the currently calculated gaze estimate to the trained interpolators and use the
x and
y outputs as the corrected gaze point estimation.
Gaze Estimation
Originally, gaze estimates are calculated using the image of only one eye. We propose to use both of the extracted eye images to calculate two estimates. Then, we combine these estimations by averaging.
We also consider the case where the GP interpolator is completely substituted in order to see if other approaches can perform better in this particular setup. Neural network (NN) methods constitute a popular alternative for this purpose. There exist recent implementations of this technique (
Holland & Komogortsev, 2012). In the aforementioned work, an eyetracker using NNs to map the eye image to gaze point coordinates is implemented and is made available (
Komogortsev, 2014).
We incorporated the NN method in our system by making use of the Fast Artificial Neural Network (FANN) library (
Nissen, 2003) and created a similar network structure, and a similar input-output system as the original work. Our neural network had 2 levels where the first level contained 128 nodes (1 for each pixel of 16
× 8 eye image) and the second level contained 2 nodes (one each for
x and
y coordinates). We scaled the pixel intensities to the interval [0, 1] because of the chosen sigmoid activation function.
Experimental Setup
In this section, we give the details of the experimental setup we created to test the performance of our application. Variations in the setup are introduced to create separate experiments which allow us to see how the system performs in different conditions.
Figure 4 shows how the components of the experimental setup are placed in the environment.
The stimuli display faces the subject and it is raised by a support which enables the subject to face the center of the display directly. The camera is placed at the top of this display at the center (A), and it has an alternative location which is 19.5 cm towards the left from the central location (B). An optional chinrest is placed at the specific distance of 80 cm away from the display, acting as a stabilizing factor for one of the experiments.
By introducing variations in this placement, we achieve several setups for several experiments which test different aspects of the system. These setups are:
Standard setup: Only the optional chinrest is removed from the setup shown in
Figure 4. Subject’s face is 80 cm away from the display. The whole screen is used to display the 15 target points one by one.
Extreme camera placement setup: This setup is similar to the previous one. The only difference is that the camera is placed at its alternative location which is 19.5 cm shifted towards the left. The purpose of this setup is to test how the position of the camera affects the results.
Chinrest setup: A setup similar to the first one. The only difference is that the chinrest is employed. This experiment is aimed at testing the effects of head pose stability in the performance.
iPad setup: This setup is used to test the performance of our system simulating the layout of an iPad on the stimuli display. This background image contains an iPad image where the iPad screen corresponds to the active area in the experiment and is shown in a different color (see
Figure 5(a)). The distance of the subject is decreased to 40 cm, in order to simulate the use-case of an actual iPad. The camera stays in the central position; and it is tilted down as seen necessary in order to center the subject’s face in the camera image.
We also analyze the effect of different camera resolutions in these setups. This is done in an offline manner by resizing the original 1280 × 720 image to 640 × 480.
The error in degrees is calculated with the formula:
where,
x is the target,
x′ is the estimate,
C is the display center and
E is the face center point. The variables
DxC,
DEC and so on denote the distances between the specified points. They are converted from pixel values to cm using the dimensions and resolution of the display.
For the evaluation of normalization techniques, the videos recorded for the standard setup are processed again with one of the normalization techniques incorporated into the eye-tracker at a time.
Results
In this section, we present the results which show the effects of the proposed changes on the performance. To achieve this, we reflect our changes on the original Opengazer code one by one and compare the results for all four experiments. We compare 6 different versions of the system which denote its certain phases:
[ORIG] Original Opengazer application + automatic point selection
[2-EYE] Previous case + average estimate of 2 eyes
[TRACK] Previous case + tracking changes
[BLINK] Previous case + excluding blinks during calibration
[CORR] Previous case + training error correction
[NN] Previous case + neural network estimator
In all versions, the facial feature points are selected automatically by the method described in previous sections and gaze is not estimated during blinks. For each experiment, average horizontal and vertical errors for all subjects and all frames are given in degrees and the standard deviation is supplied in parentheses.
Table 1 shows the progressive results of our eyetracker’s performance for different versions of the system. Each result column denotes the horizontal or vertical errors for a different experimental setup. Moving from top to bottom in each column, the effects of our changes can be seen for a single error measure of an experimental setup. Along each row, the comparison of errors for different setups can be observed.
Table 2 shows the performance values of the system in the standard setup with the lower resolution camera. These results can be compared to the high resolution camera’s results as seen in
Table 1 to see how the camera resolution affects the errors in the standard setup. The original application’s results (ORIG) and our final version’s results (CORR) are shown in boldface to enable fast comparison.
As for the normalization contributions, the results are grouped in
Figure 6.
Figure 6(a) and
6(d) show the results for the two techniques applied on the eye images. The black baseline shows the best results of the system without any normalization (1.37° horizontal, 1.48° vertical errors). For the
standard pixel intensity mean and variance normalization (
NORM 1), the standard intensity mean parameter is fixed to 127 and several values are evaluated as the standard deviation parameter (main parameter). This choice was made because when the standard deviation parameter is selected, the mean parameter does not affect the gaze estimations unless it results in the trimming of pixel intensities (mapping to intensities outside the range [0, 255]). In the case of
standard minimum and maximum intensity normalization (
NORM 2), the minimum intensity parameter is considered the main parameter and the maximum intensity is set to: max
norm = 255
− min
norm.
In
Figure 6(b) and
6(e), the second set of normalization results are shown. Here, the better performing
NORM 1 technique is applied to the whole camera image (
WHOLE) or the face-region (
FACE) and the results are given.
Lastly,
Figure 6(c) and
6(f) show the results when the eye-region normalization is combined with wholeimage or face-region normalizations. As the eye-region normalization is independent of the others, the parameters for this step are fixed to the best performing values (
µnorm = 127, σnorm = 50) and the black baseline shows the best results achieved with only eye normalization (1.22° horizontal, 1.36° vertical errors).
Discussion
Considering the
1.47° horizontal and
1.35° vertical errors of the final system in the standard experimental setup, we conclude that we have improved the original system by
18% horizontally and
8% vertically. As seen in
Table 2, the performance difference in the same experiment done with VGA cameras (
13% horizontally,
3% vertically) is comparably lower than the first case, which shows us that our contributions in this work exhibit more robust performance with the increased image quality. From another aspect, it means that better cameras will favor the methods we proposed in terms of robustness.
One interesting aspect of these results is that with the increased camera resolution, the original application shows a worse performance. We believe this is caused by the optical flow algorithm used in the tracking component. The increased detail in the images affect the tracking and the position of the tracked point may vary more compared to the lower resolution image. This, combined with the accumulated tracking error of the original application, result in a higher error rate. However, it is seen that the final version of our eye-tracker (CORR) recovers most of this error.
From the extreme camera placement setup results seen in
Table 1, we see that shifting the camera from the top center of the display decreased the performance by
52% horizontally and
44% vertically. Here, the performance loss is mainly caused by the point tracking component. From such a camera angle, the farther eye corner point may be positioned on the face boundary, making it hard to detect and track. When we compare the errors before and after the error correction is applied (BLINK and CORR), we see that this change introduced a great amount of error for this case. We can say that the unreliable tracking also hinders the error correction component, because the correction relies on the calibration being as good as possible. In order to tackle these problems, a 3D model based face tracking algorithm may be employed.
In the 3rd experimental setup, we show that the chinrest improves the performance by 27% horizontally compared to the standard setup. This setup proves to be more reliable for experimental purposes.
The results for the iPad setup may be deceiving, because here actually the errors in pixels are lower; however, as the distance of the subject is used in the calculation of errors in degrees, the angular errors are higher.
Each 1° error in other setups corresponds to around twice as many pixels on the screen compared to a 1° error in the iPad setup. Using this rule of thumb, we can see that the iPad case results in lower error rate in pixels compared to even the chinrest setup.
One of the major problems with the original system lies in the tracking component. As tracking is handled by means of optical flow (OF) calculations between subsequent frames, the tracking errors are accumulated and the facial feature points end up far away from their original locations. To tackle this problem, we proposed to change this calculation to compare the last camera frame with the initial frame which was saved during facial feature point selection. Comparing the
2-EYE and
TRACK results in
Table 1, we see that the tracking changes increased the system’s accuracy in the iPad setup. However, in the first two setups we have just the opposite results. This is probably because when using both of the eyes for gaze estimation, the tracking problems with the second eye have a larger effect on the averaged gaze estimation.
We observe that excluding the blink frames from the calibration process (application version labeled BLINK) does not have a perceivable effect on the performance. We argue that the averaging step in the calibration procedure already takes care of the outlier images. The neural network estimator does not provide an improvement over the Gaussian process estimator and performs similar to its reported accuracy (4.42°). We believe this is due to eye images extracted by our system. Currently the feature point selection and tracking mechanism allows small shifts in point locations and therefore the extracted eye images vary among samples. The GP estimator resolves this issue during the image averaging step; however, the NN estimator may have problems when the images vary a little in the testing phase. In order to resolve this problem, a detection algorithm with a higher accuracy may be used to better estimate the eye locations.
Analyzing the eye image normalization results as seen in
Figure 6(a) and
6(d), we see that both approaches improve the results. For the first normalization technique, the parameter value giving the best results is σ
norm = 50, which decreases the errors by
11% horizontally and
8% vertically. We can say that the eye image normalization does just what the Gaussian process estimator needs and helps compare eye images from different time periods in a more accurate way. The results for the second normalization method lag behind especially in the horizontal direction.
Figure 6(b) and
6(e) show the results for intensity normalization in the large scale, either applied to the whole camera image or around the face region. Here, we see that large scale normalization applied on top of eye image normalization does not improve the results at all. Our expectations for the face normalization to perform better than whole image normalization are not verified, either. We observe that the face normalization performs especially worse in vertical direction.
As it can be seen in
Figure 6(c) and
6(f), combining the two normalization methodologies do not increase the system’s accuracy, either. From these results, we can conclude that face-region or whole-image normalization cause the tracking component perform worse, and thus decrease the performance.
Conclusion
Our contribution provides significant improvement in a number of modules for the state of the art of low cost eye-trackers using natural light. Our automatic point selection technique enabled us create an easy to use application, removing errors caused by wrong operation. The experiments showed that the final system is more reliable in a variety of scenarios. The blink detection component is mostly aimed at preparing the eye-tracker to real world scenarios where the incorrect estimations during blinks should be separated from meaningful estimates. The proposed error correction algorithm helped the system better estimate gazes around the borders of the monitor. Lastly, the eye image normalization technique improved the performance of the system and made the system more robust to lighting conditions slightly changing in time. The final code can be downloaded from the project page (
Ferhat & Vilariño, 2014a). Apart from these experimental performance assessments, our work resulted in additional valuable outputs: