Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation

Bendimered, Abderrahmen; Iguernaissi, Rabah; Nawaf, Mohamad Motasem; Cherif, Rim; Dubuisson, Séverine; Merad, Djamal

doi:10.3390/s25134086

Open AccessArticle

Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation

by

Abderrahmen Bendimered

^*

,

Rabah Iguernaissi

,

Mohamad Motasem Nawaf

,

Rim Cherif

,

Séverine Dubuisson

and

Djamal Merad

Laboratoire d’Informatique et des Systèmes, CNRS UMR 7020, Aix-Marseille University, 13009 Marseille, France

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(13), 4086; https://doi.org/10.3390/s25134086

Submission received: 14 May 2025 / Revised: 23 June 2025 / Accepted: 26 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Advances in Optical Sensing, Instrumentation and Systems: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Estimating gaze direction is a key task in computer vision, especially for understanding where a person is focusing their attention. It is essential for applications in assistive technology, medical diagnostics, virtual environments, and human–computer interaction. In this work, we introduce Dual Focus-3D, a novel hybrid deep learning architecture that combines appearance-based features from eye images with 3D head orientation data. This fusion enhances the model’s prediction accuracy and robustness, particularly in challenging natural environments. To support training and evaluation, we present EyeLis, a new dataset containing 5206 annotated samples with corresponding 3D gaze and head pose information. Our model achieves state-of-the-art performance, with a MAE of 1.64° on EyeLis, demonstrating its ability to generalize effectively across both synthetic and real datasets. Key innovations include a multimodal feature fusion strategy, an angular loss function optimized for 3D gaze prediction, and regularization techniques to mitigate overfitting. Our results show that including 3D spatial information directly in the learning process significantly improves accuracy.

Keywords:

3D gaze estimation; computer vision; multimodal fusion; EyeLis dataset

1. Introduction

Gaze direction is a critical source of information for assessing individuals’ intentions and behaviors; it allows for the real-time inference of human intent, particularly regarding attention and engagement. For example, observing an individual’s gaze direction can provide valuable insights into their level of fatigue [1], support gaze prediction in contexts such as driving [2,3], or, in the medical domain, help to evaluate levels of intentionality in patients with neurological or other disorders [4]. Moreover, beyond the analysis of spontaneous human behaviors, gaze estimation also plays a significant role in controlled interaction settings, such as wheelchair navigation [5], activation of mechatronic systems [6], and medical robotics, notably in the control of robotic arms [7].

In some applications, gaze-related information is simultaneously essential for interpreting spontaneous intention and for enabling action control; for example, in immersive environments such as VR, the use of gaze-based interaction techniques offers a significant enhancement of user–environment engagement [8,9]. Despite the broad range of applications, accurately predicting gaze direction remains a fundamental and persistent challenge. One of the primary difficulties lies in the variability and complexity of the natural environment. Gaze estimation systems must operate reliably under diverse lighting environments, across a wide range of head poses, and in the presence of partial occlusions, such as eyeglasses, eyelashes, or motion blur. These factors can significantly degrade the precision of gaze prediction models.

Broadly, existing gaze estimation methods can be divided into two main categories [10]: model-based approaches, which rely on geometric modeling of eye anatomy, and appearance-based approaches, which directly learn mappings from facial or eye images using machine learning or deep learning techniques. While model-based methods typically achieve high accuracy under controlled conditions, they require specialized equipment and are sensitive to lighting and head pose variations. Appearance-based methods, on the other hand, offer greater flexibility but often struggle to capture complex three-dimensional head–gaze relationships and remain vulnerable to occlusions.

To overcome these limitations, hybrid approaches have emerged, combining geometric modeling with appearance-based learning to enhance robustness and generalization. Recent work, such as [11], demonstrates that integrating geometric constraints into learning pipelines can significantly improve gaze estimation performance.

However, several significant challenges remain. The first is the limited diversity of available data. Many existing datasets, such as Columbia Gaze [12] and UT Multiview Gaze Dataset [13], have limited variability in terms of lighting or extreme head poses, which reduce the models’ ability to generalize and adapt to diverse natural environments. Another major issue is the lack of precise 3D information in existing datasets. While some datasets include 3D annotations—such as the gaze target position or inferred gaze based on fixed target points—they often do not provide an exact 3D gaze direction vector within a metric reference frame. As a result, the actual gaze direction is not explicitly represented with the accuracy required for fine-grained 3D estimation. In this context, other authors in [14] proposed a 3D gaze estimation method based on a four-camera tracking system.

To address these limitations, combining eye images with head orientation vectors along with precise 3D annotations in a hybrid model appears to be a promising solution. Integrating these two sources of information could lead to a more reliable and robust estimation of gaze direction, even in uncontrolled conditions.

In this work, we propose an innovative approach for gaze direction estimation by leveraging 3D information. Our method combines a geometric model-based approach with an appearance-based approach, integrating CNN and MLP to extract eye visual features and incorporate head orientation data. Specifically, our model uses eye images together with a 3D head orientation vector to predict gaze direction within a metric reference frame. This approach ensures accurate gaze estimation without the need for invasive equipment to capture 3D head orientations or expensive techniques to detect the ocular region, thus simplifying and accelerating the estimation process.

We introduce the EyeLis dataset [15], which includes annotated images of eyes, as well as head orientation and gaze direction data in 3D. The annotations were performed using the methodology described in [16], which relies on non-invasive gaze estimation within a metric reference frame, with a tolerance of 2° for annotation errors. This dataset contains a total of 5206 annotated images and provides a valuable resource for the development and evaluation of 3D gaze estimation models under diverse and natural environment.

In summary, the main contributions of this paper are as follows: (a) a hybrid CNN-MLP architecture that combines visual and orientation data for 3D gaze estimation; (b) the demonstration of improved robustness and accuracy through 3D integration; and (c) the public release of the EyeLis dataset represents a valuable contribution to the community. Experimental results on EyeLis, as well as on other public datasets such as MPIIFaceGaze [17], RT-GENE [18], and UnityEye [19], demonstrate that our approach outperforms existing methods, achieving average angular errors of 1.64° for EyeLis, 3.24° for MPIIFaceGaze, 1.73° for RT-GENE, and 2.39° for UnityEye. These results validate the effectiveness of our method and pave the way for new applications in fields like AR, healthcare, and contactless human–computer interaction, while ensuring high accuracy.

The rest of the paper is organized as follows: Section 2 presents a review of the related works in the field. Section 3 details the EyeLis dataset and its creation process. Section 4 presents the proposed Dual Focus-3D methodology, including architecture and training. In Section 5, we provide an analysis of the experimental results, including the performance of the proposed method on several benchmark datasets. Finally, Section 6 discusses the limitations and practical considerations, as well as future research directions. Section 7 concludes the paper.

2. Related Work

Gaze estimation has attracted considerable research interest due to its wide range of applications in fields such as human–computer interaction, autonomous systems, and medical diagnostics. Early methods primarily relied on precise 3D geometric models of the eye to estimate gaze direction. However, with the advancement of machine learning, the focus has progressively shifted toward appearance-based techniques, particularly deep learning models, which have demonstrated significant improvements in robustness and adaptability to complex, unconstrained environments.

More recently, hybrid approaches combining geometric modeling with appearance-based learning have emerged, aiming to leverage the strengths of both strategies to enhance accuracy, generalization, and usability across diverse natural environment. In the following, we review the main categories of gaze estimation methods—model-based, appearance-based, and hybrid techniques—as well as the key datasets that have driven progress in this area.

2.1. Model-Based Methods

Model-based gaze estimation methods aim to predict gaze direction by relying on the geometric modeling of the eye. Early approaches typically used 2D anatomical features extracted from images, such as the center of the pupil or the boundary of the iris, to infer gaze direction. These methods assumed simplified geometric relationships between visible eye landmarks and gaze vectors, often estimating the point of regard based on pupil center displacement or corneal reflection patterns. For example, certain systems extracted these features and related them to screen coordinates or simple angular displacements [20]. While such 2D model-based methods offered good performance in controlled laboratory conditions, they faced notable limitations. They generally required specialized equipment such as high-resolution cameras or infrared illumination [21] to enhance feature visibility, and their accuracy degraded significantly in the presence of lighting variations, eyelid occlusions, or head pose changes.

To overcome these limitations, more recent model-based approaches introduced the construction of full 3D eye models. In these methods, a three-dimensional geometric representation of the eyeball is created, allowing for a more direct and interpretable mapping between eye features and gaze direction. Typically, these techniques detect key eye landmarks in 2D images, such as the pupil center and iris contour, and align them with a calibrated 3D model of the eye to reconstruct the 3D gaze vector in world coordinates [22,23]. The use of a 3D model provides greater robustness to variations in individual eye anatomy, as well as improved tolerance to small lighting changes and minor occlusions. Moreover, it allows the estimation of gaze not only on a 2D plane but within the full three-dimensional space, which is essential for many real-world applications such as robotics or AR.

However, 3D model-based methods are not without challenges. Their performance is highly dependent on the accuracy of the modeled eye parameters (e.g., eyeball radius, corneal curvature), which may vary significantly across individuals. Inaccurate modeling can introduce systematic errors. Additionally, they often require precise calibration procedures or controlled acquisition settings to ensure reliable landmark detection and model fitting. In environments with uncontrolled lighting, head motion, or partial occlusions, feature extraction can become unreliable, leading to degraded gaze estimation. Furthermore, methods relying on structured light patterns to detect corneal reflections can be perceived as intrusive or uncomfortable for users, limiting their usability in naturalistic settings.

In summary, model-based methods, evolving from simple 2D geometric assumptions to sophisticated 3D reconstructions, have demonstrated strong theoretical appeal and physical interpretability. Nevertheless, their practical application remains constrained by the complexity of accurately modeling and capturing individual eye characteristics under diverse natural environment.

2.2. Appearance-Based Methods

Appearance-based methods for gaze estimation aim to learn a direct mapping between visual inputs—typically facial or eye images—and gaze direction, without explicitly modeling anatomical features. Early approaches used classical machine learning techniques, such as SVR and GPR [24], relying on handcrafted features extracted from eye regions.

The emergence of deep learning profoundly transformed this field. CNNs became the dominant paradigm, thanks to their ability to automatically extract hierarchical visual features. One of the first applications of CNNs to gaze estimation was proposed by [25], who adapted a simplified LeNet architecture [26] to process single-channel eye images and directly infer gaze direction. By introducing dynamic weighting mechanisms, their network could prioritize the most informative regions of the input, enhancing prediction robustness.

Later work developed more sophisticated CNN-based frameworks. For instance, the Pictorial Gaze model introduced in [27] was an intermediate representation, termed “Gazemap”, capturing the spatial relationship between the eyeball and iris. The system performed a two-stage regression: first estimating the Gazemap from the input image, then predicting the gaze vector from this abstract representation. To better handle extreme head poses and lighting variability, asymmetric regression strategies were also explored, assuming that gaze directions between the two eyes remain correlated despite appearance differences [28].

Other studies have explored multi-stream architectures to integrate richer information. For example, the MAFI-Gaze model [29] compared single-stream approaches—using either head or eye images separately—to multi-stream models combining both modalities. The results highlighted the benefit of incorporating multiple visual cues to enhance gaze estimation, particularly under varied conditions.

While CNNs excel with high-resolution inputs, their performance degrades significantly for low-resolution images due to lost texture details. To address this, [30] proposed FSKT-GE, a feature-similarity knowledge transfer framework that aligns intermediate representations between high- and low-resolution networks via cosine similarity. By avoiding explicit super-resolution and instead enforcing distributional consistency, their method achieved state-of-the-art results on downsampled images from Gaze360 and RT-Gene (6.73°–13.61° mean angular error), demonstrating the viability of knowledge transfer for low-resolution gaze estimation.

Most CNN-based methods tend to focus on localized features without explicitly modeling head pose variations. Furthermore, appearance-based systems are sensitive to occlusions, such as those caused by eyeglasses, eyelashes, or shadows. Techniques such as radial symmetry detection for the pupil center have been proposed to mitigate these issues [31], but robustness remains a significant concern, especially in unconstrained real-world settings.

Beyond static image-based models, researchers have explored temporal modeling to capture dynamic aspects of gaze behavior. RNNs, particularly LSTM networks [32,33], have been employed to model sequential eye movement patterns over time. More recently, Transformer architectures [34,35] have been introduced, offering the ability to simultaneously model spatial and temporal dependencies, thus improving performance under highly variable conditions.

The success of appearance-based methods is closely tied to the availability of large and diverse annotated datasets. Early datasets such as EyeDiap [36] combined controlled and spontaneous gaze behaviors with 3D head pose annotations, while GazeCapture [37] enabled large-scale data collection in unconstrained mobile environments. More recent datasets like ETH-XGaze [38] introduced significant variations in head pose and gaze direction, further promoting generalization across natural environment.

Building on these advances, hybrid models have emerged, combining different network types to leverage complementary strengths. For instance, Gaze-Swin [39] integrates CNN-based feature extraction with Swin Transformer modules to simultaneously capture local details and global context, leading to further improvements in gaze estimation accuracy and robustness.

3. Gaze Dataset Generation

One of the main challenges in gaze estimation is the scarcity of highly accurate annotated data, particularly for gaze direction. To address this, we utilized the EyeLis-DB database, which was developed using a novel, non-intrusive method to simultaneously capture gaze direction and head orientation via the Kinect Azure camera in real-world 3D environments. The database was constructed from real-time facial recordings under a fixed multi-target setup, with orientation vectors computed per frame using over 83,000 annotated frames collected at 30 FPS from 8 participants. The full details of the acquisition protocol, annotation procedure, and dataset characteristics are described in [16], and all images are provided at a resolution of

960 \times 540

pixels.

The EyeLis-DB database enables users to select a predefined annotation error threshold when generating samples, where each image is annotated based on real-world data and validated against 3D gaze target positions. The annotations include the error between the actual and labeled gaze directions, offering a flexible process for generating datasets with controlled accuracy levels. This makes it possible to train and evaluate gaze estimation models under varying error conditions.

In our work, we selected a threshold of 2° and derived the EyeLis dataset accordingly. This curated subset ensures high annotation accuracy and was used in all experiments presented in this paper.

4. Proposed Methodology

Our gaze estimation framework (Figure 1) uses a dual-branch architecture combining a CNN and an MLP for accurate 3D gaze prediction. As shown in Figure 1a, input RGB images are first processed using a Haar cascade-based method [40] with OpenCV’s version 4.11.0 pre-trained eye detector [41], which has shown reliable performance in real-time applications [42].

Figure 1b shows the architecture of the proposed system, which includes two parallel branches: a residual CNN for visual feature extraction from grayscale eye images, with normalization and resizing to

53 \times 90

pixels, and an MLP for processing 3D head orientation vectors. The extracted features are concatenated and refined through dense layers with dropout regularization.

Figure 2 provides a detailed breakdown of this architecture; the CNN branch includes successive residual blocks with increasing channel sizes (

64 \to 128 \to 256

), each followed by max-pooling layers. The MLP branch consists of fully connected layers with ReLU activations. The final fused features are processed through dense layers (

1024 \to 512

units), effectively fusing spatial and positional information for robust gaze vector regression. This hybrid design achieves computational efficiency without sacrificing accuracy, as the residual connections enhance feature propagation while dropout prevents overfitting, making the system suitable for real-time applications in human–computer interaction.

4.1. Eye Feature Extraction

The process begins with an initial convolution applied to a grayscale images. This image is then processed using a CNN. The first step involves a convolution layer (Conv2D-64) that extracts edges and textures from the image. This operation is defined by the following equation:

F_{0} = ReLU (W_{0} * I + b_{0})

(1)

where

F_{0}

is the output of the first convolutional layer,

W_{0}

is the convolution filter, I is the input image (eye patch), and

b_{0}

is the bias.

Next, a max-pooling operation is applied to reduce the spatial dimensions of the image by a factor of 2. Feature extraction then continues through three hierarchical residual blocks with 64, 128, and 256 channels, respectively. These blocks capture intermediate and high-level features. Each block is defined as

F_{l + 1} = ReLU (F_{l} + H (F_{l}, W_{l}))

(2)

where

F_{l}

is the input to the residual block,

H (F_{l}, W_{l})

represents a transformation function consisting of two convolutional layers followed by batch normalization, and

F_{l + 1}

is the output. These residual connections help to mitigate the vanishing gradient problem.

The extracted features are then flattened and projected into a 512-dimensional space as follows:

f_{e y e} = W_{d e n s e} \cdot Flatten (F_{3}) + b_{d e n s e}

(3)

where

f_{e y e}

is the eye feature vector,

W_{d e n s e}

and

b_{d e n s e}

are the weights and bias of the dense layer, and

Flatten (F_{3})

denotes the flattened output from the third residual block.

The goal of this step is to produce a compact feature vector from the eye image. This approach supports real-time processing by avoiding the complexity and computational overhead associated with multi-head attention mechanisms used in previous works, as discussed in Section 1.

4.2. Head Pose Feature Extraction

Head pose feature extraction starts with a 3D vector representing head orientation (yaw, pitch, roll). This vector is normalized and used as input to the model. It is then non-linearly projected through two dense layers, defined as

f_{h e a d} = W_{2} \cdot ReLU (W_{1} \cdot h + b_{1}) + b_{2}

(4)

where

f_{h e a d}

is the head feature vector, h is the input head pose vector,

W_{1} \in R^{256 \times 3}

,

W_{2} \in R^{128 \times 256}

, and

b_{1}

,

b_{2}

are the corresponding biases. Head pose input h helps resolve gaze direction ambiguity.

4.3. Feature Fusion

The features extracted from the eyes and head are concatenated into a single vector:

[f_{e y e}; f_{h e a d}] \in R^{640}

(5)

This fused vector combines information from both modalities. A joint representation is then learned through a dense layer:

z = ReLU (W_{f} [f_{e y e}; f_{h e a d}] + b_{f})

(6)

where

W_{f} \in R^{1024 \times 640}

and

b_{f}

is the bias term. A dropout with probability

p = 0.5

is applied for regularization. Alignment is implicitly performed using a dot product:

g = f_{e y e}^{T} M f_{h e a d}

(7)

where g is the gaze–head alignment score, and M is a learned metric space matrix.

4.4. Gaze Regression Head

The model returns a 3D vector representing the gaze direction in space, computed as follows:

\hat{g} = W_{o} \cdot z + b_{o}

(8)

where

\hat{g}

is the predicted gaze vector, and

W_{o}

,

b_{o}

are the weights and bias of the final dense layer.

The loss function used to train the model minimizes the angular error between the true gaze vector

g_{i}

and the predicted one

{\hat{g}}_{i}

:

L = \frac{1}{N} \sum_{i = 1}^{N} {cos}^{- 1} (\frac{g_{i} \cdot {\hat{g}}_{i}}{∥ g_{i} ∥ ∥ {\hat{g}}_{i} ∥})

(9)

This loss function allows for precise prediction of gaze direction in 3D space, which is crucial for 3D gaze estimation tasks where only the vector’s orientation is important.

5. Experiments

In this section, we present the experimental setup and evaluation of the proposed Dual Focus-3Dmodel. We describe the datasets used, detail the training protocols, and report quantitative and qualitative results under various evaluation scenarios. These include both intra-dataset and cross-dataset experiments, designed to assess the model’s accuracy, robustness, and generalization across diverse conditions.

5.1. Datasets

As shown in Table 1, the datasets used differ in terms of acquisition hardware, lighting conditions, and participant count.

MPIIFaceGaze comprises 213,659 images collected from 15 individuals using a laptop webcam in everyday settings over an extended period. The data covers a wide range of illumination scenarios and does not impose constraints on head orientation. Among the full dataset, 37,667 images are annotated with accurate gaze vectors, pupil locations, and facial landmarks, providing a solid base for supervised learning tasks in gaze estimation.

RT-GENE includes multimodal recordings acquired with a Kinect v2 sensor combined with a motion capture system. It contains data from 15 participants recorded over 17 sessions. The dataset offers synchronized RGB, depth, and infrared streams along with head pose and gaze direction annotations. Data was collected in naturally lit indoor environments, enabling diverse head orientations and natural conditions.

UnityEye is a synthetic dataset generated using a 3D graphics engine, simulating eye images with precise control over pose, lighting, and gaze direction. It supports the creation of unlimited samples with detailed ground truth. For our experiments, we generated a custom subset of 66,316 synthetic images, facilitating model pre-training and evaluation under fully controlled conditions.

The EyeLis-DB was recorded by the Azure Kinect v3 sensor in a lab environment. It includes recordings from 8 participants, each one captured in 5 predefined head positions, resulting in 40 total sessions. Sessions were conducted under both natural daylight and uniform artificial lighting, introducing a mix of conditions to evaluate model robustness. The dataset is designed to provide consistent pose variation while maintaining real-world variability in lighting.

5.2. Experimental Results Analysis

The model was developed using the TensorFlow framework version 2.19.0, utilizing the Adam optimizer with an initial learning rate of 0.0005 and a batch size of 32. The training process was set for 100 epochs. Early stopping was applied to optimize training, monitoring the validation loss and halting when no improvement was observed over 10 consecutive epochs. Additionally, the learning rate was reduced to 0.0001 after 5 epochs without improvement.

To prevent overfitting, a

50 %

dropout rate was applied, especially in the fully connected layers. A custom angular loss function was employed to minimize the angular error between the predicted and actual gaze vectors.

Our model, trained on a custom dataset with 70% for training and 30% for testing, achieves a MAE of 1.64° on the EyeLis dataset.

In Table 2, we evaluate the model on the MPIIFaceGaze dataset, where it achieves a MAE of 3.24°, showcasing the model’s effectiveness in controlled environments with artificial lighting. In Table 3, the RT-GENE dataset yields a MAE of 1.73°, demonstrating that our model can handle natural lighting conditions with high precision. These results confirm that our architecture generalizes well across different datasets and ensures that the integration of 3D features significantly enhances the training process, improving accuracy and robustness. The architecture also includes a regression component that can detect head pose variations, even when trained on different datasets.

Table 4 presents results on the UnityEye dataset, with a MAE of 2.39°, evaluated on a synthetic dataset. This helps confirm that our model is resistant to overfitting, and it consistently performs well even on synthetic data, further validating its generalizability.

These tables summarize the performance of the Dual Focus-3D model on the MPIIFaceGaze, RT-GENE, and UnityEye datasets, demonstrating its ability to achieve accurate gaze estimation across various conditions.

Each table includes baseline methods that were specifically developed and evaluated on the corresponding dataset, to ensure a fair and meaningful comparison.

As observed, the Dual Focus-3D approach outperforms other state-of-the-art methods by providing more accurate gaze estimations while maintaining a compact and efficient architecture. This makes our model a reliable and precise solution for gaze tracking applications.

Figure 3 shows a 2D heatmap of the MAE relative to the ground-truth gaze direction, represented in spherical coordinates (pitch and yaw), on the MPIIFaceGaze dataset. The heatmap reveals low prediction errors across the gaze space, indicating that the model achieves high accuracy, even for extreme gaze directions.

The results confirm that, across the EyeLis, UnityEye, and RT-GENE datasets as shown in Figure A1, the model accurately predicts gaze direction even when the gaze direction is extreme.

Figure 4 shows the distribution of the MAE as a function of gaze direction using the MPIIFaceGaze dataset.

Analyzing the results in Figure A2 for the EyeLis, RT-GENE, and UnityEye datasets shows that the model accurately predicts gaze direction, even when the head orientation is extreme.

Figure 5 and Figure A3 show the distribution of angular errors relative to the angle between the head orientation vector and the ground-truth gaze direction for the MPIIFaceGaze, EyeLis, RT-GENE and UnityEye datasets. The horizontal axis represents the angle between the head and gaze, divided into four quantile-based bins, while the vertical axis shows the corresponding angular error in degrees.

We observe that the predictions of our Dual Focus-3D model remain consistent across different head–gaze angles. The values for mean and median angular errors are nearly identical across the four bins, showing that the model is well-trained and effectively predicts the true gaze direction based on the image and head orientation.

These results confirm that the Dual Focus-3D model performs accurately even in extreme head–gaze conditions, showing its ability to effectively link image features with 3D head vectors. This emphasizes the value of integrating 3D information for reliable gaze estimation.

5.3. Cross-Dataset Generalization

We observe variations in extreme gaze angles, head orientations, and the difference between gaze direction and head orientation angles across the datasets.

For the EyeLis dataset, the gaze direction angles range from

[- 40^{°}, 40^{°}]

for pitch and

[- 35^{°}, 25^{°}]

for yaw, based on the target positions defined in the dataset (see Section 3). Head orientation varies with pitch values between

[- 30^{°}, 60^{°}]

and yaw values between

[- 40^{°}, 30^{°}]

.

In the RT-GENE dataset, gaze direction angles range from

[- 35^{°}, 35^{°}]

for pitch and

[- 50^{°}, 25^{°}]

for yaw. Two head orientation zones are defined:

Around pitch $= - 170^{°}$ , with yaw between $[- 0.4, 0.8]$ .
Around pitch $= 170^{°}$ , with yaw between $[- 0.4, 0.8]$ .

In the UnityEye dataset, gaze direction angles cover a wider range:

[- 75^{°}, 75^{°}]

for pitch and

[- 50^{°}, 50^{°}]

for yaw. Two synthetic head orientation zones were created:

Around pitch $= - 90^{°}$ , with yaw between $[- 95^{°}, - 85^{°}]$ .
Also at pitch $= - 90^{°}$ , with yaw between $[- 35^{°}, - 25^{°}]$ .

Considering the data coverage for head and gaze angle distributions across all datasets, MPIIFaceGaze is the most suitable for training the model to evaluate its generalization ability. This choice is further motivated by the closer alignment of MPIIFaceGaze’s annotation and acquisition methods with those of other datasets, facilitating superior cross-dataset generalization. Consequently, we used the MPIIFaceGaze dataset for training and evaluated the model on the complete EyeLis, RT-GENE, and UnityEye datasets. As an additional experiment, we also trained the model on RT-GENE and tested it on the full MPIIFaceGaze dataset to further assess its robustness.

The results in Table 5 demonstrate that the Dual Focus-3D model generalizes well, even though the annotation methods and tools differ across datasets. These results confirm that our model remains reliable even under real-world variations in eye appearance and head pose annotations. It is important to highlight that the annotation method employed in EyeLis follows a rigorous procedure combining facial landmark extraction and depth estimation to accurately reconstruct gaze direction within a real-world metric coordinate system, independently of supervised learning. Additionally, an angular error threshold filtering—set, for example, at 2°—ensures that only gaze vectors with guaranteed precision, corresponding to exact fixations, are retained. This careful selection process enables the creation of coherent and usable subsets from a larger dataset, providing EyeLis with significant flexibility to generate datasets tailored to specific precision requirements.

Figure 6 presents qualitative results obtained by applying the trained Dual Focus-3D model to the EyeLis-DB database. Each example includes three vectors: the blue vector is the ground-truth gaze direction, the green vector is the ground-truth head orientation, and the black vector is the predicted gaze direction. All vectors are projected from 3D to 2D using the camera calibration parameters. The results on the test set indicate that the predicted gaze direction closely matches the ground truth, demonstrating the model’s accuracy in estimating gaze based on head orientation and image features.

6. Discussion and Limitations

Dual Focus-3D combines eye image features with 3D head orientation vectors to improve accuracy in 3D gaze estimation. The experimental results on both real (EyeLis, MPIIFaceGaze, RT-GENE) and synthetic (UnityEye) datasets show that including head orientation significantly reduces gaze prediction errors, especially in challenging conditions. The model maintains good performance even with large variations in head pose and gaze direction, confirming the importance of head orientation in this task.

However, the method has limitations. It requires both the eye image and the 3D head orientation vector as inputs. Although our results show that head orientation is the most important input, this requirement depends on external hardware for pose capture, which limits the method’s practical deployment.

Another limitation is related to the eye detection method used during preprocessing. We currently use a Haar cascade detector, which is fast but not always reliable under occlusion, blur, or lighting variation. The choice of detector should be adapted to the specific use case, especially when real-time performance is required.

The dual-branch architecture increases computational load, but this remains acceptable with modern hardware. With a suitable system setup, the pipeline can run in real time. For deployment on embedded or low-power devices, further optimization will be necessary.

In future work, we aim to develop an end-to-end model capable of accurately estimating 3D head orientation directly from facial images [50], thereby eliminating the need for external hardware and enhancing system portability. This model will produce precisely annotated 3D head orientation vectors based on well-calibrated facial data.

7. Conclusions

In this work, we introduce Dual Focus-3D, a hybrid deep learning framework for accurate 3D gaze estimation in natural environments. The proposed model combines appearance-based eye image features with 3D head orientation vectors using a dual-branch architecture. A residual CNN extracts spatial features from grayscale eye images, while a parallel MLP processes head pose information. These features are fused through a dense integration module optimized for gaze vector regression using an angular loss function. We also introduced EyeLis, a novel dataset containing annotated eye images and corresponding 3D gaze and head pose data. Experimental results on EyeLis and three public benchmarks—MPIIFaceGaze, RT-GENE, and UnityEye—demonstrate that Dual Focus-3D achieves state-of-the-art performance across various conditions. Through cross-dataset analysis, and considering that each dataset presents specific acquisition conditions (e.g., lighting, head orientation, image quality), we demonstrate that our model is robust under diverse real-world scenarios. Furthermore, using EyeLis, we demonstrate that the model is trained with values that closely match the real annotations, exhibiting an angular error deviation of only 2°, which further supports its precision and reliability. We observe that combining visual features with head pose data yields improved accuracy and stability across different datasets, positioning this model as a strong candidate for real-time applications in HCI, AR/VR, and assistive technologies. Future work will explore extending the model to dynamic scenarios and incorporating additional temporal information.

Author Contributions

Conceptualization, A.B., R.I., M.M.N., S.D. and D.M.; methodology, A.B., M.M.N., R.I. and D.M.; software, A.B.; validation, A.B., R.C., R.I. and D.M.; formal analysis, A.B., D.M. and M.M.N.; investigation A.B., M.M.N., R.I. and D.M.; writing—original draft preparation, A.B.; writing—review and editing, A.B., M.M.N., S.D., R.I., D.M. and R.C.; visualization, A.B.; supervision, A.B., D.M., M.M.N., R.I. and S.D.; project administration, D.M., M.M.N., R.I. and S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the French government through the ACTIVIS project managed by the National Research Agency (ANR), with the reference number ANR-19-CE19-0004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset supporting the findings of this study, *EyeLis: A 3D Gaze and Head Pose Dataset for Unconstrained Human Interaction, Version 1.0. 2025. Available online: https://im.lis-lab.fr/datasets/EyeLis/V1/ (accessed on 13 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VR	Virtual Reality
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
MLP	Multilayer Perceptron
SVR	Support Vector Regression
GPR	Gaussian Process Regression
RNN	Recurrent Neural Network
RGB	Red, Green, Blue
MAE	Mean Angular Error
HCI	Human–Computer Interaction
AR	Augmented Reality

Appendix A

Figure A1. MAE heatmap across gaze directions on the (a) EyeLis, (b) RT-GENE, and (c) UnityEye test sets.

Figure A2. Angular error distribution across head–gaze angle on (a) EyeLis, (b) RT-GENE, and (c) Unity datasets.

Figure A3. Angular error distribution across head–gaze angle quantiles for (a) EyeLis, (b) RT-GENE, and (c) UnityEye datasets.

References

Kovalenko, S.; Mamonov, A.; Kuznetsov, V.; Bulygin, A.; Shoshina, I.; Brak, I.; Kashevnik, A. OperatorEYEVP: Operator dataset for fatigue detection based on eye movements, heart rate data, and video information. Sensors 2023, 23, 6197. [Google Scholar] [CrossRef]
Hermens, F.; Anker, W.; Noten, C. Gaze Zone Classification for Driving Studies Using YOLOv8 Image Classification. Sensors 2024, 24, 7254. [Google Scholar] [CrossRef] [PubMed]
Vasli, B.; Martin, S.; Trivedi, M.M. On driver gaze estimation: Explorations and fusion of geometric and data driven approaches. In Proceedings of the 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Rio de Janeiro, Brazil, 1–4 November 2016; pp. 655–660. [Google Scholar]
Varghese, E.B.; Qaraqe, M.; Al-Thani, D. Attention level evaluation in children with autism: Leveraging head pose and gaze parameters from videos for educational intervention. IEEE Trans. Learn. Technol. 2024, 17, 1737–1753. [Google Scholar] [CrossRef]
Xu, J.; Huang, Z.; Liu, L.; Li, X.; Wei, K. Eye-Gaze controlled wheelchair based on deep learning. Sensors 2023, 23, 6239. [Google Scholar] [CrossRef]
Bohušík, M.; Stenchlák, V.; Císar, M.; Bulej, V.; Kuric, I.; Dodok, T.; Bencel, A. Mechatronic Device Control by Artificial Intelligence. Sensors 2023, 23, 5872. [Google Scholar] [CrossRef]
Amri bin Suhaimi, M.S.; Matsushita, K.; Kitamura, T.; Laksono, P.W.; Sasaki, M. Object grasp control of a 3D robot arm by combining EOG gaze estimation and camera-based object recognition. Biomimetics 2023, 8, 208. [Google Scholar] [CrossRef] [PubMed]
Clay, V.; König, P.; Koenig, S. Eye tracking in virtual reality. J. Eye Mov. Res. 2019, 12, 1–18. [Google Scholar] [CrossRef]
Kim, N. Capturing Initial Gaze Attraction in Branded Spaces Through VR Eye-Tracking Technology. Int. J. Hum.-Comput. Interact. 2024, 41, 4392–4405. [Google Scholar] [CrossRef]
Pathirana, P.; Senarath, S.; Meedeniya, D.; Jayarathna, S. Eye gaze estimation: A survey on deep learning-based approaches. Expert Syst. Appl. 2022, 199, 116894. [Google Scholar] [CrossRef]
Nie, W.; Wang, Z.; Ren, W.; Zhang, H.; Liu, H. Iris Geometric Transformation Guided Deep Appearance-based Gaze Estimation. IEEE Trans. Image Process. 2025, 34, 1616–1631. [Google Scholar] [CrossRef]
Smith, B.A.; Yin, Q.; Feiner, S.K.; Nayar, S.K. Gaze locking: Passive eye contact detection for human-object interaction. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, St. Andrews, UK, 8–11 October 2013; pp. 271–280. [Google Scholar]
Sugano, Y.; Matsushita, Y.; Sato, Y. Learning-by-synthesis for appearance-based 3D gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1821–1828. [Google Scholar]
Wan, Z.; Wang, X.; Zhou, K.; Chen, X.; Wang, X. A novel method for estimating free space 3D point-of-regard using pupillary reflex and line-of-sight convergence points. Sensors 2018, 18, 2292. [Google Scholar] [CrossRef]
EyeLis: A 3D Gaze and Head Pose Dataset for Unconstrained Human Interaction, Version 1.0. 2025. Available online: https://im.lis-lab.fr/datasets/EyeLis/V1/ (accessed on 13 May 2025).
Bendimered, A.; Cherif, R.; Iguernaissi, R.; Nawaf, M.M.; Thümmler, S.; Dubuisson, S.; Merad, D. The Feasibility of RGB-D Gaze Intention Measurement in Children with Autism Using Azure Kinect. Bioengineering 2025, 12, 370. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 2299–2308. [Google Scholar]
Fischer, T.; Chang, H.J.; Demiris, Y. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–352. [Google Scholar]
Wood, E.; Baltrušaitis, T.; Morency, L.P.; Robinson, P.; Bulling, A. Learning an appearance-based gaze estimator from one million synthesised images. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications, Charleston, SC, USA, 14–17 March 2016; pp. 131–138. [Google Scholar]
Huang, M.X.; Li, J.; Ngai, G.; Leong, H.V. Screenglint: Practical, in-situ gaze estimation on smartphones. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 2546–2557. [Google Scholar]
Liu, J.; Chi, J.; Sun, H. An automatic calibration method for Kappa angle based on a binocular gaze constraint. Sensors 2023, 23, 3929. [Google Scholar] [CrossRef] [PubMed]
Mokatren, M.; Kuflik, T.; Shimshoni, I. 3D Gaze Estimation Using RGB-IR Cameras. Sensors 2022, 23, 381. [Google Scholar] [CrossRef]
Shen, K.; Li, Y.; Guo, Z.; Gao, J.; Wu, Y. Model-Based 3D Gaze Estimation Using a TOF Camera. Sensors 2024, 24, 1070. [Google Scholar] [CrossRef] [PubMed]
Jha, S.; Busso, C. Estimation of Driver’s Gaze Region From Head Position and Orientation Using Probabilistic Confidence Regions. IEEE Trans. Intell. Veh. 2023, 8, 59–72. [Google Scholar] [CrossRef]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4511–4520. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Park, S.; Spurr, A.; Hilliges, O. Deep Pictorial Gaze Estimation. In Computer Vision—ECCV 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 741–757. [Google Scholar]
Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze Estimation by Exploring Two-Eye Asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef]
Li, C.; Li, F.; Zhang, K.; Chen, N.; Pan, Z. Gaze Estimation Network Based on Multi-Head Attention, Fusion, and Interaction. Sensors 2025, 25, 1893. [Google Scholar] [CrossRef]
Yan, C.; Pan, W.; Dai, S.; Xu, B.; Xu, C.; Liu, H.; Li, X. FSKT-GE: Feature maps similarity knowledge transfer for low-resolution gaze estimation. IET Image Process. 2024, 18, 1642–1654. [Google Scholar] [CrossRef]
Martinikorena, I.; Cabeza, R.; Villanueva, A.; Urtasun, I.; Larumbe, A. Fast and robust ellipse detection algorithm for head-mounted eye tracking systems. Mach. Vis. Appl. 2018, 29, 845–860. [Google Scholar] [CrossRef]
Zhou, X.; Lin, J.; Jiang, J.; Chen, S. Learning a 3D Gaze Estimator with Improved Itracker Combined with Bidirectional LSTM. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; IEEE: New York, NY, USA, 2019; pp. 850–855. [Google Scholar] [CrossRef]
Koochaki, F.; Najafizadeh, L. Eye gaze-based early intent prediction utilizing cnn-lstm. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; pp. 1310–1313. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Li, Y.; Chen, J.; Ma, J.; Wang, X.; Zhang, W. Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism. Sensors 2023, 23, 6226. [Google Scholar] [CrossRef]
Funes Mora, K.A.; Monay, F.; Odobez, J.-M. EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA ’14), Safety Harbor, FL, USA, 26–28 March 2014; ACM: New York, NY, USA, 2014; pp. 255–258. [Google Scholar] [CrossRef]
Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.M.; Matusik, W.; Torralba, A. Eye Tracking for Everyone. arXiv 2016, arXiv:1606.05814. [Google Scholar] [CrossRef]
Zhang, X.; Park, S.; Beeler, T.; Bradley, D.; Tang, S.; Hilliges, O. ETH-XGaze: A Large Scale Dataset for Gaze Estimation Under Extreme Head Pose and Gaze Variation. In Computer Vision—ECCV 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 365–381. [Google Scholar] [CrossRef]
Zhao, R.; Wang, Y.; Luo, S.; Shou, S.; Tang, P. Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism. Electronics 2024, 13, 328. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I. [Google Scholar]
OpenCV Team. OpenCV Haar Cascade Classifier Models. 2024. Available online: https://github.com/opencv/opencv/tree/master/data/haarcascades (accessed on 16 April 2025).
Schmidt, A.; Kasiński, A. The performance of the haar cascade classifiers applied to the face and eyes detection. In Computer Recognition Systems 2; Springer: Berlin/Heidelberg, Germany, 2007; pp. 816–823. [Google Scholar]
Chen, Z.; Shi, B.E. Appearance-Based Gaze Estimation Using Dilated-Convolutions. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2–6 December 2018; pp. 309–324. [Google Scholar]
Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A Coarse-to-Fine Adaptive Network for Appearance-Based Gaze Estimation. In Proceedings of the Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020. [Google Scholar]
Hu, D.; Huang, K. GFNet: Gaze Focus Network using Attention for Gaze Estimation. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2399–2404. [Google Scholar]
Xia, C.; Tao, Z. CTA-Net: A Gaze Estimation network based on Dual Feature Aggregation and Attention Cross Fusion. Comput. Sci. Inf. Syst. 2024, 21, 831–850. [Google Scholar] [CrossRef]
Murthy, L.R.D.; Brahmbhatt, S.; Arjun, S.; Biswas, P. I2DNet—Design and Real-Time Evaluation of Appearance-Based Gaze Estimation System. J. Eye Mov. Res. 2021, 14, 1–15. [Google Scholar] [CrossRef]
Singh, V.; Langde, C.; Lakotia, S.; Kannan, V.; Ahmed, S. EG-SIF: Improving Appearance Based Gaze Estimation using Self Improving Features. In Proceedings of the 2nd Gaze Meets ML Workshop, New Orleans, LA, USA, 16 December 2024; pp. 219–235. Available online: https://proceedings.mlr.press/v226/singh24a.html (accessed on 15 June 2025).
Zemblys, R.; Niehorster, D.C.; Holmqvist, K. GazeNet: End-to-end eye-movement event detection with deep neural networks. Behav. Res. Methods 2019, 51, 840–864. [Google Scholar] [CrossRef]
Wang, Y.; Liu, H.; Feng, Y.; Li, Z.; Wu, X.; Zhu, C. HeadDiff: Exploring Rotation Uncertainty with Diffusion Models for Head Pose Estimation. IEEE Trans. Image Process. 2024, 33, 1868–1882. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed 3D gaze estimation pipeline. (a) Eye detection and preprocessing using Haar cascade and OpenCV. (b) Dual-branch architecture with CNN and MLP for feature extraction and gaze prediction.

Figure 2. Dual Focus-3D feature fusion module.

Figure 3. MAE heatmap across gaze directions on the MPIIFaceGaze test set.

Figure 4. MAE distribution relative to head orientation evaluated on the MPIIFaceGaze test set.

Figure 5. Angular error distribution across head–gaze angle quantiles on theMPIIFaceGaze test set.

Figure 6. Illustrative results obtained by applying the trained Dual Focus-3D model to evaluate gaze estimation on the EyeLis-DB database, including three vectors: the blue vector represents the ground-truth gaze direction, the green vector represents the ground-truth head orientation, and the black vector represents the predicted gaze direction.

Table 1. Comparison of key characteristics across gaze estimation datasets.

Characteristic	MPIIFaceGaze	RT-GENE	UnityEye	EyeLis-DB (Ours)
Acquisition Hardware	Laptop webcam	Kinect v2 (Microsoft Corporation, Redmond, WA, USA) + Motion capture system	Synthetic images generated	Azure Kinect v3 (Microsoft Corporation, Redmond, WA, USA)
Lighting Conditions	Uniform artificial lighting	Natural indoor lighting	Synthetic	Natural indoor or uniform artificial lighting
Subject Count	15	15 (17 sessions)	66,316 different synthetic images	8 (40 sessions)
Acquisition Setup	Controlled laboratory setting	Controlled laboratory setting	Synthetic	Controlled laboratory setting

Table 2. Results on the MPIIFaceGaze dataset.

Method	Head	Eye Image	Angular Error (°)
Dilated-Net [43]	Head image	✓	4.42
CA-Net [44]	Head image	✓	4.27
GFNet [45]	Head image	✓	3.96
CTA-Net [46]	Head image	✓	3.91
MAFI-Gaze [29]	Head image	✓	3.84
Dual Focus-3D (Ours)	3D vector	✓	3.24

Table 3. Results on the RT-GENE dataset.

Method	Head	Eye Image	Angular Error (°)
I2DNet [47]	Head image	✓	8.40
EG-SIF [48]		✓	7.41
Dual Focus-3D (Ours)	✓	✓	1.73

Table 4. Results on the UnityEye dataset.

Method	Head	Eye Image	Angular Error (°)
GazeNet [49]		✓	14.00
UnityEye [19]		✓	9.95
Dual Focus-3D (Ours)	✓	✓	2.39

Table 5. Cross-dataset evaluation results.

Training Dataset	Test Dataset	MAE (°)
MPIIFaceGaze [17]	EyeLis [15]	12.22
MPIIFaceGaze [17]	RT-GENE [18]	9.19
MPIIFaceGaze [17]	UnityEye [19]	12.87
RT-GENE [18]	MPIIFaceGaze [17]	8.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bendimered, A.; Iguernaissi, R.; Nawaf, M.M.; Cherif, R.; Dubuisson, S.; Merad, D. Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation. Sensors 2025, 25, 4086. https://doi.org/10.3390/s25134086

AMA Style

Bendimered A, Iguernaissi R, Nawaf MM, Cherif R, Dubuisson S, Merad D. Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation. Sensors. 2025; 25(13):4086. https://doi.org/10.3390/s25134086

Chicago/Turabian Style

Bendimered, Abderrahmen, Rabah Iguernaissi, Mohamad Motasem Nawaf, Rim Cherif, Séverine Dubuisson, and Djamal Merad. 2025. "Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation" Sensors 25, no. 13: 4086. https://doi.org/10.3390/s25134086

APA Style

Bendimered, A., Iguernaissi, R., Nawaf, M. M., Cherif, R., Dubuisson, S., & Merad, D. (2025). Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation. Sensors, 25(13), 4086. https://doi.org/10.3390/s25134086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual Focus-3D: A Hybrid Deep Learning Approach for Robust 3D Gaze Estimation

Abstract

1. Introduction

2. Related Work

2.1. Model-Based Methods

2.2. Appearance-Based Methods

3. Gaze Dataset Generation

4. Proposed Methodology

4.1. Eye Feature Extraction

4.2. Head Pose Feature Extraction

4.3. Feature Fusion

4.4. Gaze Regression Head

5. Experiments

5.1. Datasets

5.2. Experimental Results Analysis

5.3. Cross-Dataset Generalization

6. Discussion and Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI