Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection

Zhang, Yuying; Xu, Fei; Yang, Yi

doi:10.3390/app152312475

Open AccessArticle

Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection

by

Yuying Zhang

^*,

Fei Xu

and

Yi Yang

Control Engineering Program, College of Electronic Information Engineering, Changchun University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12475; https://doi.org/10.3390/app152312475

Submission received: 10 October 2025 / Revised: 17 November 2025 / Accepted: 20 November 2025 / Published: 25 November 2025

(This article belongs to the Special Issue AI Technologies for eHealth and mHealth, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In recent years, gaze estimation has received a lot of interest in areas including human–computer interface, virtual reality, and user engagement analysis. Despite significant advances in convolutional neural network (CNN) techniques, directly and effectively predicting the point of gaze (PoG) in unconstrained situations remains a difficult task. This study proposes a gaze point estimation network (L1fcs-Net) that combines facial features with positional features derived from a two-dimensional array obtained by projecting the face relative to the screen. Our approach incorporates a Face-grid branch to enhance the network’s ability to extract features such as the relative position and distance of the face to the screen. Additionally, independent fully connected layers regress x and y coordinates separately, enabling the model to better capture gaze movement characteristics in both horizontal and vertical directions. Furthermore, we employ a multi-loss approach, balancing classification and regression losses to reduce gaze point prediction errors and improve overall gaze performance. To evaluate our model, we conducted experiments on the MPIIFaceGaz dataset, which was collected under unconstrained settings. The proposed model achieves state-of-the-art performance on this dataset with a gaze point prediction error of 2.05 cm, demonstrating its superior capability in gaze estimation.

Keywords:

gaze point estimation; deep learning; CNN

1. Introduction

Humans obtain 80–90% of external information through their eyes. Visual perception of information can be captured via eye tracking [1]. Point of Gaze (PoG) estimation, as one of the core tasks in eye tracking, aims to predict the user’s gaze location from facial or eye images. PoG estimation not only facilitates more natural and efficient interaction methods [2,3] but also plays a vital role across diverse fields, including healthcare [4,5], psychological research [6,7], virtual reality [8], and assistive technology [9,10]. Consequently, researchers have developed various techniques and methods to accurately estimate the PoG. These approaches fall into two categories: model-based methods and appearance-based methods. Model-based methods typically require specialized hardware, limiting their use in unconstrained environments. Appearance-based methods directly regress human gaze from images captured by inexpensive off-the-shelf cameras, enabling easy deployment across diverse locations with minimal setup constraints.

Recently, methods based on convolutional neural networks (CNNs) have become the most widely used eye-tracking estimation techniques. Table 1 summarizes the usage and error metrics (geometric distance between actual and predicted pixel coordinates) for common gaze point estimation models. According to existing literature, two common gaze point estimation methods exist. The first method structure is shown in Figure 1. Most related work focuses on developing novel CNN architectures to extract gaze features and output gaze direction [11,12,13]. For example, Abdelrahman et al. [14] proposed a fine-grained gaze estimation method that can directly predict pitch and yaw angles from images.

This method estimates the gaze point on a two-dimensional plane by calculating the intersection point between the extended line of the gaze direction and the gaze plane [15]. However, methods using gaze direction to estimate the fixation point accumulate errors from both the detection phase and the gaze line estimation phase. These errors stem not only from deviations in the gaze direction estimation algorithm but also from inaccuracies in calculating the starting point of the gaze direction (the center point of the outer canthi of both eyes). This starting point is primarily derived from the interocular distance between the outer canthi. Consequently, the detection accuracy of the outer canthus keypoints and the interocular distance set under reference conditions both introduce computational errors. Moreover, implementing gaze point estimation via gaze direction is relatively complex. When the acquisition device’s position or orientation relative to the gaze plane changes, recalibration of camera parameters is required to obtain the transformation matrix between the altered camera coordinate system and the world coordinate system. During facial yaw movements, detection errors of the outer canthi increase, leading to larger gaze point estimation errors.

The second method, structured as shown in Figure 2, enables gaze point estimation by training a deep neural network to learn the mapping relationship between input images and the position of the gaze point on a 2D plane [16]. Krafka et al. [17] proposed the iTracker network, which uses facial images, left/right eye images, facial grids, and binocular images as input signals for the gaze estimation algorithm to directly obtain a 2D gaze point. However, it lacks the precision required for fine-grained gaze point localization on the screen.

Kim et al. [18] further investigated classification-regression functions to enhance the precision of fixation points, but the issue of complex network structures remains an ongoing area of research.

Therefore, a review of existing gaze estimation literature reveals two key limitations: (1) End-to-end methods that directly output gaze points require multiple input streams, resulting in high computational complexity and insufficient accuracy; (2) Methods deriving gaze points from gaze direction necessitate manual setting of the distance between the face and screen, introducing errors, and require additional modeling to map gaze points to the screen, thereby increasing computational complexity.

Table 1. Gaze Point Estimation Model Methods and Experimental Error/cm.

Model	Primary Methods	Parameters (M)	Inference Time (ms)	FPS	Absolute Error/cm
iTracker	Feature Fusion Across Multiple Facial Datasets	6.78	None	10–15	7.25
AFF-Net [19]	Adaptive Fusion of Left and Right Eye Features Based on Similarity	None	None	None	4.21
CA-Net [20]	Add the base direction and residual to obtain the final gaze direction.	None	None	None	4.50
GazeNets [21]	Integrating ocular image features and head posture information	None	138	None	6.42
FullFace [22]	Full-face appearance-based gaze estimation using deep convolutional networks	196.6	50	None	5.54
Ours	The fusion face is gazing at the screen projection position.	28.16	12	100	2.05

As shown in Table 1, existing gaze point estimation methods struggle to achieve a balance between accuracy, speed, and parameter efficiency. Specifically: (1) High-parameter methods like FullFace-25 employ a deep convolutional network with 196.6 million parameters, yet suffer from redundant parameters despite achieving only 5.54 cm accuracy and 50 ms inference time; (2) Multimodal fusion methods like GazeNet-24 integrate eye images and head pose information, but its 138 ms inference latency renders it unsuitable for real-time interactive scenarios; (3) Lightweight methods such as iTracker-18 feature only 6.78 M parameters and relatively fast speed (10–15 FPS), yet its 7.25 cm error falls far short of high-precision requirements; (4) Even the most accurate methods, AFF-Net-40 (4.21 cm) and CA-Net-39 (4.50 cm), lack reported inference times and parameter counts, casting doubt on their engineering practicality.

To address these issues, this paper proposes a method that uses facial images containing gaze information and a facial grid representing the face’s relative position on the screen as input, simplifying the complex structure required to obtain the gaze point. This approach employs independent fully connected layers to regress x and y coordinates separately, enabling more precise extraction of relevant features in both directions. We propose a classification-based eye-point localization method that divides the screen into 69 × 39 regions for coarse eye-point category estimation, thereby reducing computational cost. Two independent loss functions (each containing classification and regression components) predict eye-point positions, effectively enhancing model stability.

2. Related Work

According to the literature, appearance-based gaze estimation algorithms can be categorized into traditional methods and appearance-based methods.

2.1. Gaze Estimation Methods for Traditional Function Mappings

Traditional gaze estimation methods employ regression functions to create specific mappings to human gaze, such as adaptive linear regression and Gaussian process regression [23,24,25]. These methods demonstrate reasonable accuracy in constrained settings (e.g., subject-specific and fixed head posture and lighting); however, their performance degrades significantly when tested in unconstrained settings.

2.2. Deep Learning-Based Gaze Estimation Methods

Deep learning-based methods can simulate highly nonlinear mapping functions between images and gaze point locations. Zhang et al. first proposed a simple CNN-based architecture using monocular images to predict gaze, while subsequent studies demonstrated that combining features from both eyes improves the accuracy of gaze estimation. Fischer et al. employed two VGG-16 networks to extract individual features from both eye images, then concatenated these features for regression analysis. However, simply combining binocular features into a new feature vector yielded only marginal improvements in gaze estimation accuracy. Wang et al. [26] proposed an adversarial learning method to extract invariant features from eye images. This approach feeds features into an additional classifier and designs an adversarial loss function to handle variations in appearance across subjects. Kim et al. [27] employed GANs to convert low-light images to high-light images; Rangesh et al. [28] used GANs to remove eyeglasses. Beyond supervised feature extraction, unlabeled eye images can also be utilized for feature extraction. Yu et al. [29] employed unsupervised learning [30] using unlabeled eye images, inputting the differences between both eyes into the network. Subsequent research revealed that facial images contain head pose information, aiding gaze estimation. Several studies directly utilized full-face images as input, achieving significant performance improvements over methods relying solely on eye images.

Recently, Abdelrahman et al. proposed a fine-grained gaze estimation method. This approach directly predicts pitch and yaw angles from images. It employs a multi-loss network incorporating angle classification loss and regression loss. The classification loss categorizes gaze angles into distinct classes, while the regression loss predicts pitch and yaw angles. The classification loss categorizes angles, whereas the regression loss predicts angular deviations from the target. By weighting the classification and regression losses, the network learns to predict precise gaze angles. However, the coarse classification leads to quantization loss. Hu et al. [31] proposed the HG-Net coarse-fine hybrid classification framework, which not only enables finer angle classification but also improves prediction performance. Methods including [32,33,34] have advanced the field of gaze estimation, but they primarily focus on 3D gaze direction estimation. Obtaining specific 2D gaze point coordinates requires substantial computational resources.

Krafka et al. proposed the iTracker network, which combines inputs from left and right eye images, facial images, and facial meshes containing positional information to directly obtain gaze point locations. However, it lacks the precision required for fine gaze point localization on screens. Building upon this, Kim S. et al. investigated classification-regression functions to enhance gaze point accuracy, though the issue of complex network structures remains an ongoing research topic.

Therefore, this paper proposes a gaze point estimation network (L1fcs-Net) that integrates the face with the projected gaze position on the screen, aiming to improve gaze point prediction accuracy while conserving computational resources and simplifying the network structure. This network utilizes both the facial image and a facial mesh containing relative position information as inputs. Independent fully connected layers regress each coordinate separately, enabling the model to better capture gaze movement features in both horizontal and vertical directions. A classifier module predicts the horizontal and vertical positions of the gaze point, respectively, while a loss function achieves a coarse estimate of the gaze point. Simultaneously, the predicted classification is refined through regression functions to achieve precise gaze point estimation. This classification-and-regression-weighted approach mitigates large-scale errors inherent in direct regression models, enhancing robustness.

3. L1fcs-Net Network

Figure 3 depicts the L2cs-2D-Net model architecture, a novel network model for 2D gaze point estimation. The facial image branch of this network employs ResNet-50 [35,36] as its backbone for extracting gaze features. The face grid branch utilizes two fully connected layers to extract the relative position of the face on the screen. By integrating coordinate information from both branches and employing two fully connected layers to independently regress and predict each gaze coordinate (x, y), the model directly obtains the x and y coordinates of the gaze point on the screen. A multi-loss combination function is employed for backpropagation and network weight adjustment, enhancing both the generalization capability and prediction accuracy of this network architecture.

3.1. Model

The proposed L1fcs-Net model takes face images (224 × 224) and face meshes (25 × 25) as inputs. The face image branch employs ResNet-50 to extract 2048-dimensional features containing gaze information, while the face mesh branch utilizes two fully connected layers to extract 625-dimensional relative position features. These features are concatenated into a 2673-dimensional representation, which is then processed by two independent fully connected branches to predict x and y coordinates.

To achieve multi-task learning, the gaze pixel coordinates are discretized into classification units: at 1920 × 1080 resolution, the horizontal direction is divided into 69 intervals and the vertical direction into 39 intervals. The model achieves collaborative learning by jointly optimizing classification and regression losses: logit values from the fully connected layer undergo softmax transformation to generate classification probability distributions, while smooth L1 loss handles the regression task. This multi-objective loss strategy integrates cross-entropy classification loss with smooth L1 regression loss, constructing an end-to-end framework that significantly enhances model robustness and training stability.

3.2. Classification and Regression Fusion Output

To obtain the final gaze point coordinates, L1fcs-Net employs a coarse-to-fine fusion strategy. The classification branch, responsible for coarse localization, outputs logit values for the x and y dimensions, which are then converted into probability distributions via softmax. The grid index corresponding to the maximum probability is selected as the coarse gaze point location:

b i n_{x} = - \arg \max (s o f t \max (X_{c l s})), b i n_{y} = - \arg \max (s o f t \max (y_{c l s})) .

(1)

where

b i n_{x} \in [0, 68]

and

b i n_{y} \in [0, 38]

denotes the predicted x-coordinate grid index, and the predicted y-coordinate grid index. This classification step reduces the search space from 2,073,600 pixels (1920 × 1080) to 2691 grid cells (69 × 39), lowering computational complexity by approximately 770 times.

The discrete grid indices are then converted to continuous coordinates within the normalized space [−1, 1]:

c e n t e r_{x} = - 1 + 2 (b i n_{x} + 0.5) / 69, c e n t e r_{y} = - 1 + 2 (b i n_{y} + 0.5) / 39 .

(2)

Adding 0.5 aligns coordinates to grid midpoints rather than boundaries, providing baseline estimates for subsequent optimization. The regression branch outputs continuous offset values constrained within the range [−0.9, 0.9]. The Tanh activation function ensures adjustments remain confined to a single grid cell.

Finally, the normalized coordinates are mapped to screen pixel coordinates. This fusion mechanism fully leverages the complementary strengths of classification and regression: the classification branch achieves stable global localization through discrete grid predictions, while the regression branch attains sub-pixel accuracy via continuous offset adjustments. This coarse-to-fine strategy effectively balances localization precision and robustness, enabling the model to maintain stable performance across diverse scenarios. Compared to pure classification (error of 2.19 cm) or pure regression (error of 2.23 cm), the proposed fusion strategy achieves an accuracy of 2.05 cm (shown in Table 2)—representing improvements of 6.39% and 9.1%, respectively—validating the effectiveness of the two-stage fusion approach.

3.3. Loss Functions

To enhance model performance, this paper adopts a multi-task loss function strategy, imposing independent loss constraints on the horizontal and vertical coordinates, respectively, to decouple the mutual interference between the two dimensions. Based on reference [15], this paper applies cross-entropy loss for the classification task and adopts Smooth L1 Loss instead of mean squared error (MSE) for the regression task, effectively mitigating gradient oscillations caused by samples with large errors during training, thereby significantly improving the stability of model training. The overall loss function for each coordinate axis is defined as in Equation (3):

L_{x} = L_{x_c l s} + λ L_{x_r e g}, L_{y} = L_{y_c l s} + λ L_{y_r e g}, L_{t o t a l} = L_{x} + L_{x} .

(3)

Here, λ is a hyperparameter for the regression loss weight (set to λ = 1 in the experiments),

L_{y_c l s}

representing the classification loss for x and y coordinates,

L_{x_r e g}

and corresponding to their respective regression losses.

In addition, this work discretizes continuous gaze coordinates into a classification task: the x coordinates are divided into

N_{x} = 69

several intervals, and the y coordinates are also divided into

N_{y} = 39

several intervals. The classification loss is calculated using label-smoothed cross-entropy, as shown in Equation (4):

L_{x_c l s} = - \sum_{j = 1}^{N_{x}} q_{j} \log (p_{j}^{x}), L_{y_c l s} = - \sum_{j = 1}^{N_{y}} q_{j} \log (p_{j}^{y}) .

(4)

Here,

p_{j}^{y}

represents the probability of the

j

-th interval output by softmax, as indicated in Formula (5):

p_{j}^{x} = \frac{\exp (z_{j}^{x})}{\sum_{k - 1}^{N_{x}} \exp (z_{j}^{x})}

(5)

Here,

z_{j}^{x}

represents the th logit value output by the fully connected layer. The label smoothing strategy converts hard labels into a soft label distribution as shown in Equation (6):

q_{i j}^{x} = y_{i j}^{x} (1 - ε) + \frac{ε}{B_{x}}

(6)

Here,

q_{i j}^{x}

represents the interval index to which the true x-coordinate of sample i belongs, and

ε

is the smoothing factor (set to 0.1 in this paper). Label smoothing assigns small probability values (

ε / B_{x}

) to non-true classes, effectively preventing the model from being overconfident in a single class and reducing the risk of overfitting. In particular, when gaze point data may contain annotation noise or uneven distribution, label smoothing can significantly enhance the model’s robustness and generalization ability. In addition, the smoothed softmax output stabilizes the expected coordinates predicted by the classification branch, indirectly reducing the consistency loss with the regression branch and promoting the collaborative optimization of both branches.

The regression branch in this study uses the Smooth L1 loss function, which is used for fine localization within the coarse intervals determined by classification, as shown in Equation (7):

L_{x_r e g} = \frac{1}{N} \sum_{i = 1}^{N} S m o o t h L 1 (x_{i}^{g t} - \overset{\land}{x_{i}^{r e g}}), L_{y_r e g} = \frac{1}{N} \sum_{i = 1}^{N} S m o o t h L 1 (y_{i}^{g t} - \overset{\land}{y_{i}^{r e g}}), S m o o t h L 1 (x) = \{\begin{cases} 0.5 x^{2}, i f |x| < δ \\ |x| - 0.5 δ, o t h e r w i s e \end{cases} .

(7)

Among them,

δ = x_{i}^{g t} - \overset{\land}{x_{i}^{r e g}}

represents the prediction error,

x_{i}^{g t}

and

y_{i}^{g t}

are the true coordinates,

\overset{\land}{x_{i}^{r e g}}

and

\overset{\land}{y_{i}^{r e g}}

these are the continuous coordinate values predicted by the regression branch. The Smooth L1 loss combines the advantages of being sensitive to small errors and robust to large errors. It uses a quadratic function to smooth the gradient when the error is small and a linear function to avoid gradient explosion when the error is large, thereby improving training stability.

The final loss is obtained by weighted fusion of the classification and regression losses. During training, the classification branch first locates the gaze points in coarse intervals, providing global positioning information, while the regression branch subsequently performs sub-pixel fine adjustments within that interval. The two branches complement each other through shared feature representations and joint optimization: the classification loss guides the model to learn discriminative regional features, and the regression loss enhances local precision. This coarse-to-fine strategy effectively balances localization accuracy and robustness, enabling the model to maintain stable predictive performance in complex scenarios.

4. Experiments

4.1. Datasets

With the growing demand for deep learning-based methods to enhance gaze estimation accuracy, large-scale datasets have emerged. These datasets span diverse environments, ranging from controlled laboratory settings to unconstrained outdoor scenarios. In this paper, the MPII-FaceGaze dataset [37] is employed to train the model, aiming to improve the versatility of the network architecture in terms of precision.

4.1.1. MPII-FaceGaze Dataset Preprocessing

This dataset is one of the most widely used datasets for appearance-based gaze estimation algorithms. Data was collected in uncontrolled environments, featuring images under varying backgrounds and lighting conditions, along with natural head movements during data acquisition. Each of the 15 participants has an independent folder containing daily facial images, totaling 213,659 images. As shown in Figure 4: 100% screen view coverage and 78.24% head pose coverage. However, when eye tracking was interrupted due to blinking, prolonged eye closure, head pose exceeding tracking range, temporary failure of facial feature recognition, or occlusion of the face/eyes by hands, hair, or objects, the annotation system recorded default values (0,0) as placeholders. This occurred in 2564 samples, accounting for 12% of the dataset.

Therefore, all (0, 0) samples were removed during experimentation, and the dataset was reconstructed to achieve robust gaze estimation capabilities across diverse environments. Through five-fold cross-validation, the dataset was divided into five subsets. In each training iteration, one subset is selected as the test set, while the remaining four serve as the training set. The model is trained on the training set and evaluated on the validation set. This process is repeated five times. Averaging the results mitigates the inherent randomness of single-round splits, reduces overfitting risks, and enables more reliable model performance assessment.

4.1.2. Label Classification Preprocessing

The gaze point estimation method in this paper directly estimates the landing point of the user’s gaze on the screen. However, since the human gaze perceives not just a single point but an entire area encompassing the target and its surroundings at a certain distance, the gaze point estimation task inherently involves an error. Observations indicate that at a typical viewing distance of approximately 50 cm from the screen, the human eye can simultaneously perceive 3 to 5 characters (spaced roughly 2 cm apart horizontally) within the same line of sight. Considering the typical size of standard software interface elements, the corresponding button area occupies a square region with a side length of approximately 0.5 cm. Therefore, in this study’s two-dimensional gaze point estimation task, the target screen is divided into intervals of 5 mm in both the horizontal and vertical directions. A 15.6-inch laptop screen is horizontally divided into 69 intervals and vertically into 39 regions. Since the dataset labels specify gaze point positions using pixel coordinates, the labels require preprocessing before application in 2D gaze point estimation. Based on the 1920 × 1080 resolution of the personal laptop used in experiments, the screen is divided into 69 intervals along the horizontal pixel dimension and 39 intervals along the vertical pixel dimension. Using the target screen size and resolution information provided in the dataset, the annotated actual gaze points were converted from pixel coordinates to Euclidean coordinates. The horizontal and vertical categories were numbered starting from 1.

4.1.3. Face Grid Preprocessing

The facial mesh generation is illustrated in Figure 5. This mesh represents a two-dimensional array of the face image projected relative to the screen, providing positional and distance features of the face relative to the screen. This method utilizes the dlib facial landmark detection algorithm to detect the facial region within the input image and obtain its bounding box coordinates (

x, y, h, w

).

Original image dimensions:

W \times H

; Experimental target grid dimensions:

G_{w} \times G_{h}

set to 25 × 25; Apply Formula (8) to calculate the scaling factor:

s c a l e_{x} = G_{w} / W, s c a l e_{y} = G_{h} / H .

(8)

Translate face bounding box B = (

x

,

y

,

w

,

h

) to grid dimensions:

x_{g r i d} = f l o o r (x \times s c a l e_{x}), y_{g r i d} = f l o o r (y \times s c a l e_{y}), w_{g r i d} = f l o o r (w \times s c a l e_{x}), h_{g r i d} = f l o o r (h \times s c a l e_{y}) .

(9)

Simultaneously, clip the grid coordinates according to Formula (10) to ensure they remain within the grid’s valid range at all times:

x_{g r i d} = m a x (0, m i n (x_{g r i d}, G_{w} - 1)), y_{g r i d} = m a x (0, m i n (y_{g r i d}, G_{h} - 1)) .

(10)

Constrain initial coordinates

x_{g r i d}

and

y_{g r i d}

within ranges,

G_{w}

and

G_{h}

Negative coordinates are set to 0. If the coordinates exceed the grid boundaries, they are clipped to

G_{w} - 1

or

G_{h} - 1

(the maximum index value of the right/bottom grid boundary). Guarantee:

0 \leq x_{g r i d} \leq G_{w} - 1

and

0 \leq y_{g r i d} \leq G_{h} - 1

.

Calculate the cropping area dimensions according to Formula (11).

w_{g r i d} = m a x (1, m i n (w_{g r i d}, G_{w} - x_{g r i d})), h_{g r i d} = m a x (1, m i n (h_{g r i d}, G_{h} - y_{g r i d})) .

(11)

Ensure the area dimensions are reasonable and do not exceed the grid boundaries. The area must contain at least one grid cell (non-empty), with the maximum value constrained by the distance from the starting position to the grid boundary. Ensure:

1 \leq w_{g r i d} \leq G_{w} - x_{g r i d}

and

1 \leq h_{g r i d} \leq G_{h} - y_{g r i d}

.

Calculate the ending coordinates of the region (excluding the upper boundary) using Formula (12).

x_{h i} = m i n (G_{w}, x_{g r i d} + w_{g r i d}), y_{h i} = m i n (G_{h}, y_{g r i d} + h_{g r i d}) .

(12)

Ensure it does not exceed the total width and height of the grid

G_{w}

,

G_{h}

. Create a G_h × G_w two-dimensional matrix, initializing all values to 0. Set all pixel values within the detected face bounding box to 1, leaving the rest as 0. This generates a binary mask where

G [i, j] = (1.0)

: Pixel (i,j) is within the detected facial region

G [i, j] = (0 .0)

: Pixel (i,j) is outside the facial region. Subsequently, convert the binary grid into a one-dimensional vector as input for the network.

4.2. Training Details

This paper implements the method using PyTorch 2.1.0, employing the ImageNet-pretrained ResNet-50 as the backbone network. Training utilizes the Adam optimizer with a learning rate of 0.0001. Our network is trained for 100 epochs with a batch size of 32. Performance is evaluated using the absolute Euclidean distance (in screen pixels) between predicted and ground-truth gaze points.

4.3. Results and Comparison

The experiment compares the loss function model from Reference 15 with the improved model proposed in this paper. The results are shown in Figure 6.

Experimental results demonstrate that the combination of Cross-Entropy and SmoothL1 loss functions exhibits superior convergence characteristics during training. Compared to the Cross-Entropy + MSE combination, Cross-Entropy + SmoothL1 achieved a lower final loss value of 2.02 cm after 100 training cycles while displaying a smoother and more stable convergence trajectory. The Cross-Entropy + MSE combination exhibited significant numerical fluctuations during the early training phase. This is attributed to the quadratic penalty characteristic of the MSE loss function for outliers, while the piecewise linear nature of the SmoothL1 loss function effectively mitigated gradient explosion issues. Thus, this result demonstrates that the SmoothL1 loss function selected in this paper reduces abrupt changes in model outputs, ensuring consistent predictions between adjacent frames. Consequently, it enables stable and smooth convergence for data featuring extreme head rotations and lighting conditions, enhancing the model’s robustness.

To validate the effectiveness of our coarse-to-fine fusion strategy, we first evaluate the classification branch’s ability to predict discrete grid bins. Figure 7 presents confusion matrices for both coordinate dimensions, showing.

The confusion matrix indicates that the classification branches successfully achieve coarse localization with an accuracy of 0.765 for the horizontal x category and 0.832 for the vertical y category, with prediction results highly concentrated along the diagonal. This demonstrates that the model accurately identifies the correct grid cell in most cases. Off-diagonal activations primarily occur in adjacent cells, a phenomenon consistent with expectations given the continuity of gaze movements and the 5-mm discretization interval.

Based on coarse classification results, the regression branch performs fine coordinate optimization within predicted grid cells. Figure 8 displays scatter plots comparing final predicted coordinates with actual values after coarse-fine fusion.

The results show that the model’s predictions on both coordinate axes closely approach the ideal diagonal line x = y. The X-axis coordinate prediction achieved a correlation coefficient of 0.987, while the Y-axis coordinate prediction obtained a correlation coefficient of 0.967. Data points clustered tightly around the prediction line, indicating a strong linear correlation between predicted and actual values. The model maintained high accuracy across the entire coordinate prediction range, showing no significant systematic bias or prediction inaccuracies.

Additionally, we included several example prediction images in Figure 9. The images demonstrate that our improved model also delivers stable predictions for challenging examples (extreme lighting, dark environments), maintaining an error margin of approximately 2 cm.

4.4. Computational Efficiency Analysis

To evaluate the practical deployment feasibility of L1fcs-Net, we conducted a comprehensive computational cost analysis. The model architecture exhibits balanced complexity characteristics, making it suitable for real-time applications. The L1fcs-Net model contains a total of 28.16 million parameters, distributed across different components as follows:

This study conducted comprehensive performance evaluations on Alibaba Cloud’s GPU computing instance (configured with 8 vCPUs, 30GB memory, and NVIDIA A10 GPU). Experimental results demonstrate that the proposed model achieves a single-frame image inference time of 12 milliseconds, corresponding to a processing speed of 100 frames per second, fully meeting real-time processing requirements.

Furthermore, compared to other models in Table 1, our model achieves the best overall performance with a 12-millisecond inference time and 100 frames per second processing capability. This represents a 6.7–10× improvement over iTracker’s 10–15 fps, an 11.5× acceleration compared to GazeNets’ 138 ms inference speed, and a 4.17× increase over FullFace’s 50 ms processing speed. Moreover, the facial mesh branch introduced in Table 2 incurs only a minimal overhead of 0.08 GFLOPs (approximately 1.85% of the total computational load) while delivering significant accuracy gains, fully validating the efficiency of the dual-branch architecture.

4.5. Ablation Studies

To validate the effectiveness of each key component in the L2-2D-Net model, this paper conducted a comprehensive ablation study on the MPII-FaceGaze dataset, as shown in Table 3. These studies systematically analyzed the impact of the facial mesh branch, independent coordinate regression, independent classification estimation, and their combinations on model performance.

Experiments revealed that model (e) exhibited the smallest error, while model (b) showed the largest error. Additionally, all models incorporating facial mesh branches demonstrated superior accuracy compared to others. This indicates that the facial mesh branch enhances the network’s prediction of the face’s relative position to the gaze plane, thereby increasing the accuracy of gaze point prediction. Simultaneously, the experimental results demonstrate that adding a coarse classification estimation component significantly improves the network’s prediction precision.

5. Conclusions

Most current gaze point estimation models overly focus on gaze direction prediction while neglecting gaze point localization accuracy. Furthermore, methods mapping gaze direction to gaze points introduce cumulative errors from facial detection inaccuracies, and rigidly defined screen-to-face distances further degrade precision. These mapping computations also exhibit high computational complexity. To overcome gaze point estimation challenges, this paper proposes a gaze point estimation model that eliminates the need for additional mappings. By introducing facial mesh branches, the model projects the face relative to the screen, thereby obtaining distance and position information between the face and screen. Performance comparisons using classification and regression loss functions from the original model validate the improved performance, demonstrating that the proposed model effectively addresses the issue of insufficient gaze point estimation accuracy.

Furthermore, ablation experiments demonstrate that incorporating coarse-grained classification components significantly enhances network prediction accuracy. Consequently, further research on optimizing classification methods to improve gaze point estimation precision holds significant importance.

Future research directions will extend beyond computer monitor environments to encompass mobile devices such as smartphones and tablets. To account for scenarios like “gaze deviating from the screen,” the training phase incorporates gaze samples outside the screen; a binary classifier determines whether the gaze is within the screen boundary; and a post-processing threshold based on prediction confidence is implemented. As most modern mobile devices now feature RGB cameras suitable for eye-tracking technology, gaze estimation algorithms will serve a broader user base and diverse scenarios.

Author Contributions

Conceptualization: Y.Z. and F.X.; Methodology: F.X.; Software: F.X.; Validation: F.X., Y.Z. and Y.Y.; Formal Analysis: F.X. and Y.Y.; Investigation: F.X.; Resources: Y.Z.; Data Curation: F.X. and Y.Y.; Writing—Original Draft Preparation: F.X.; Writing—Review and Editing: Y.Z. and F.X.; Visualization: F.X.; Supervision: Y.Z.; Project Administration: Y.Z.; Funding Acquisition: Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jilin Provincial Science and Technology Development Project: Key Technologies for Monocular Retinal Imaging Suitable for Individuals with Low Vision, Project Number YDZJ202401527ZYTS. Publication fees for this article were funded by the Jilin Provincial Science and Technology Development Project: Key Technologies for Monocular Retinal Imaging Suitable for Individuals with Low Vision, Project Number YDZJ202401527ZYTS.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy/ethical restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, J.; Zhang, G.; Shi, J. 2D gaze estimation based on pupil-glint vector using an artificial neural network. Appl. Sci. 2016, 6, 174. [Google Scholar] [CrossRef]
Hui, H.; Junhao, H. Human-Computer Interaction Applications Based on Eye Tracking. J. Shandong Univ. (Eng. Ed.) 2021, 51, 1–8. [Google Scholar]
Lei, Y.; He, S.; Khamis, M.; Ye, J. An end-to-end review of gaze estimation and its interactive applications on handheld mobile devices. ACM Comput. Surv. 2023, 56, 1–38. [Google Scholar] [CrossRef]
Adler, M.; Ziglio, E. Gazing into the Oracle: The Delphi Method and Its Application to Social Policy and Public Health; Jessica Kingsley Publishers: London, UK, 1996. [Google Scholar]
Selaskowski, B.; Asché, L.M.; Wiebe, A.; Kannen, K.; Aslan, B.; Gerding, T.M.; Sanchez, D.; Ettinger, U.; Kölle, M.; Lux, S.; et al. Gaze-based attention refocusing training in virtual reality for adult attention-deficit/hyperactivity disorder. BMC Psychiatry 2023, 23, 74. [Google Scholar] [CrossRef]
Jia, S.J.; Jing, J.Q.; Yang, C.J. A review on autism spectrum disorder screening by artificial intelligence methods. J. Autism Dev. Disord. 2024, 55, 3011–3027. [Google Scholar] [CrossRef]
Svoranu, A.M.; Epskamp, S. Which estimation method to choose in network psychometrics: Deriving guidelines for applied researchers. Psychol. Methods 2023, 28, 925. [Google Scholar] [CrossRef]
Voinescu, A.; Petrini, K.; Fraser, D.S.; Lazarovicz, R.-A.; Papavă, I.; Fodor, L.A.; David, D. The effectiveness of a virtual reality attention task to predict depression and anxiety in comparison with current clinical measures. Virtual Real. 2023, 27, 119–140. [Google Scholar] [CrossRef]
Eid, M.A.; Giakoumidis, N.; El Saddik, A. A novel eye-gaze-controlled wheelchair system for navigating unknown environments: Case study with a person with ALS. IEEE Access 2016, 4, 558–573. [Google Scholar] [CrossRef]
Xu, J.; Huang, Z.; Liu, L.; Li, X.; Wei, K. Eye-Gaze-controlled wheelchair based on deep learning. Sensors 2023, 23, 6239. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Fischer, T.; Chang, J.H.; Demiris, Y. RT-GENE: Real-time eye gaze estimation in natural environments. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 339–357. [Google Scholar]
Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A. Fine-grained gaze estimation based on the combination of regression and classification losses. Appl. Intell. 2024, 54, 10982–10994. [Google Scholar] [CrossRef]
Huang, Q.; Veeraraghavan, A.; Sabharwal, A. TabletGaze: Dataset and Analysis for Unconstrained Appearance-based Gaze Estimation in Mobile Tablets. Mach. Vis. Appl. 2017, 28, 445–461. [Google Scholar] [CrossRef]
Bao, J.; Liu, B.; Yu, J. The Story in Your Eyes: An Individual-difference-aware Model for Cross-person Gaze Estimation. arXiv 2021, arXiv:2106.14183. [Google Scholar]
Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; Torralba, A. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2176–2184. [Google Scholar]
Kim, S.; Lee, S.; Lee, E.C. Advancements in Gaze Coordinate Prediction Using Deep Learning: A Novel Ensemble Loss Approach. Appl. Sci. 2024, 14, 5334. [Google Scholar] [CrossRef]
Bao, Y.; Cheng, Y.; Liu, Y.; Lu, F. Adaptive Feature Fusion Network for Gaze Tracking in Mobile Tablets. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Nashville, TN, USA, 2021; pp. 9936–9943. [Google Scholar]
Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A Coarse-to-fine Adaptive Network for Appearance-based Gaze Estimation. AAAI Conf. Artif. Intell. 2020, 34, 10623–10630. [Google Scholar] [CrossRef]
Zhang, X.C.; Sugano, Y.; Fritz, M. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7509–7528. [Google Scholar] [CrossRef]
Lu, F.; Sugano, Y.; Okabe, T.; Sato, Y. Adaptive linear regression for appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2033–2046. [Google Scholar] [CrossRef]
Williams, O.; Blake, A.; Cipolla, R. Sparse and Semi-supervised Visual Mapping with the S^3GP. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; Volume 1, pp. 230–237. [Google Scholar]
Tan, K.H.; Kriegman, D.J.; Ahuja, N. Appearance-based eye gaze estimation. In Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision, 2002. (WACV 2002). Proceedings, Orlando, FL, USA, 4 December 2002; pp. 191–195. [Google Scholar]
Wang, K.; Zhao, R.; Su, H.; Ji, Q. Generalizing eye tracking with Bayesian adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE Press: Piscataway, NJ, USA, 2019; pp. 11907–11916. [Google Scholar]
Kim, J.; Jeong, J. Gaze estimation in the dark with generative adversarial networks. In Proceedings of the ACM Symposium on Eye Tracking Research &Applications, Stuttgart, Germany, 2–5 June 2020; ACM Press: New York, NY, USA, 2020; pp. 1–3. [Google Scholar]
Rangesh, A.; Zhang, B.; Trivedi, M. Driver gaze estimation in the real world: Overcoming the eyeglass challenge. In Proceedings of the IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 19 October–13 November 2020; IEEE Press: Piscataway, NJ, USA, 2020; pp. 1054–1059. [Google Scholar]
Yu, Y.; Odobez, J. Unsupervised representation learning for gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE Press: Piscataway, NJ, USA, 2020; pp. 7312–7322. [Google Scholar]
Ruigang, Y.; Shuai, W.; Han, L. Review of Unsupervised Learning Methods in Deep Learning. Comput. Syst. Appl. 2016, 25, 1–7. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; 2016; pp. 770–778. [Google Scholar]
Yang, H.; Yang, Z.; Liu, J.; Chi, J. A new appearance-based gaze estimation via multi-modal fusion. In Proceedings of the 2023 3rd International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 24–26 February 2023; pp. 498–502. [Google Scholar]
Bandi, C.; Thomas, U. Face-Based Gaze Estimation Using Residual Attention Pooling Network. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Lisabon, Portugal, 19–21 February 2023; pp. 541–549. [Google Scholar]
Huang, L.; Li, Y.; Wang, X.; Wang, H.; Bouridane, A.; Chaddad, A. Gaze Estimation Approach Using Deep Differential Residual Network. Sensors 2022, 22, 5462. [Google Scholar] [CrossRef]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4511–4520. [Google Scholar]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 162–175. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Gaze Point Structure Diagram Obtained Based on Gaze Direction [14].

Figure 2. Directly Obtained Gaze Point Structure Diagram [17].

Figure 3. L1fcs-Net Network Architecture.

Figure 4. Heatmaps of Dataset Collections. (a) Heatmap of Head Pose Collections (b) Heatmap of Gaze Point Collections.

Figure 5. Face Grid Generation Diagram. (Face image shown for demonstration. Experimental data from MPII-FaceGaze dataset [37]).

Figure 6. Experimental comparison between the model in Reference 15 and the model in this paper.

Figure 7. Confusion Matrix Heatmap. (a) X-dimension: 69 × 69 matrix showing horizontal bin predictions; (b) Y-dimension: 39 × 39 matrix showing vertical bin predictions.

Figure 8. Correlation Scatter Plots of Final Predicted Coordinates vs. Actual Values After Coarse-Fine Fusion. (a) x-axis, (b) y-axis.

Figure 9. Predicted gaze points of the visual model. Red dots indicate actual gaze locations, green triangles represent predicted points, and yellow lines connect their respective centers. Data from MPII-FaceGaze dataset [37].

Table 2. Parameter Distribution.

Structure	Parameters (M)	FLOPs (G)
ResNet-50	25.6	4.1
Face grid branch	1.6	0.08
Regression branch	2.4	0.14
Classification branch	3.8	0.14

Table 3. Melting Experiment Results.

Model	Initial Absolute Error	Final Absolute Error
Reg+Class (a)	4.19	2.17
Reg (b)	4.56	2.23
class (c)	6.08	2.19
Grid+Reg (d)	4.01	2.10
Grid+Class (e)	4.75	2.07

(a) Reg+Class (Gaze Point Classification and Gaze Point Regression) denotes a model incorporating both coarse gaze point classification and fine-grained gaze point regression components; (b) Reg (Gaze Point Regression) denotes a model containing only the fine-grained gaze point regression component; (c) Class (Gaze Point Classification) denotes a model containing only the coarse gaze point classification component; (d) Grid+Reg (Face Grids and Gaze Point Regression) is a model incorporating both the face grid branch and the gaze point fine regression component. (e) Grid+Class (Face Grids and Gaze Point Classification) is a model incorporating both the face grid branch and the classification coarse estimation component.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Xu, F.; Yang, Y. Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection. Appl. Sci. 2025, 15, 12475. https://doi.org/10.3390/app152312475

AMA Style

Zhang Y, Xu F, Yang Y. Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection. Applied Sciences. 2025; 15(23):12475. https://doi.org/10.3390/app152312475

Chicago/Turabian Style

Zhang, Yuying, Fei Xu, and Yi Yang. 2025. "Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection" Applied Sciences 15, no. 23: 12475. https://doi.org/10.3390/app152312475

APA Style

Zhang, Y., Xu, F., & Yang, Y. (2025). Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection. Applied Sciences, 15(23), 12475. https://doi.org/10.3390/app152312475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gaze Point Estimation via Joint Learning of Facial Features and Screen Projection

Abstract

1. Introduction

2. Related Work

2.1. Gaze Estimation Methods for Traditional Function Mappings

2.2. Deep Learning-Based Gaze Estimation Methods

3. L1fcs-Net Network

3.1. Model

3.2. Classification and Regression Fusion Output

3.3. Loss Functions

4. Experiments

4.1. Datasets

4.1.1. MPII-FaceGaze Dataset Preprocessing

4.1.2. Label Classification Preprocessing

4.1.3. Face Grid Preprocessing

4.2. Training Details

4.3. Results and Comparison

4.4. Computational Efficiency Analysis

4.5. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI