Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network

Li, Changli; Tong, Enrui; Zhang, Kao; Cheng, Nenglun; Lai, Zhongyuan; Pan, Zhigeng

doi:10.3390/app15073684

Open AccessArticle

Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network

by

Changli Li

^1,*

,

Enrui Tong

²,

Kao Zhang

²

,

Nenglun Cheng

²,

Zhongyuan Lai

³ and

Zhigeng Pan

²

¹

School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

State Key Laboratory of Precision Blasting, Jianghan University, Wuhan 430056, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3684; https://doi.org/10.3390/app15073684

Submission received: 3 March 2025 / Revised: 18 March 2025 / Accepted: 26 March 2025 / Published: 27 March 2025

(This article belongs to the Topic Visual Computing and Understanding: New Developments and Trends)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, with the widespread application of deep learning networks, appearance-based gaze estimation has made breakthrough progress. However, most methods focus on feature extraction from the facial region while neglecting the critical role of the eye region in gaze estimation, leading to insufficient eye detail representation. To address this issue, this paper proposes a multi-stream multi-input network architecture (MSMI-Net) based on appearance. The model consists of two independent streams designed to extract high-dimensional eye features and low-dimensional features, integrating both eye and facial information. A parallel channel and spatial attention mechanism is employed to fuse low-dimensional eye and facial features, while an adaptive weight adjustment mechanism (AWAM) dynamically determines the contribution ratio of eye and facial features. The concatenated high-dimensional and fused low-dimensional features are processed through fully connected layers to predict the final gaze direction. Extensive experiments on the EYEDIAP, MPIIFaceGaze, and Gaze360 datasets validate the superiority of the proposed method.

Keywords:

gaze estimation; multi-stream network; adaptive feature fusion mechanism; deep learning

1. Introduction

Gazing is an act of looking at something, and as one of the non-verbal interaction cues, the direction of gaze reflects meaningful attention information. Gaze estimation aims to estimate the direction of the line of sight. Gaze estimation technology is widely applied in fields such as human-computer interaction [1,2], augmented reality or virtual reality [3,4], medical diagnosis [5,6], and assisted driving [7]. Therefore, the importance of gaze estimation in multiple fields has made it a widely studied area in computer vision in recent years.

In recent years, gaze estimation can mainly be divided into model-based methods [8,9,10] and appearance-based methods [11,12,13,14]. Model-based methods, as traditional approaches, primarily involve geometric modeling of the eye, utilizing the geometric relationships between different facial and ocular features (facial landmarks, cornea, pupil, etc.) to calculate the three-dimensional gaze direction. They rely on external light sources to detect ocular features or observe the pupil center and iris edge to infer the direction of gaze. Model-based methods can achieve high precision and tolerate head movements, thus being used in commercial eye trackers. However, they require personal calibration to achieve good accuracy, making the use of eye-tracking systems less user-friendly and reducing user experience [15].

Appearance-based gaze estimation methods are a technique that directly extracts features from ocular images and predicts gaze direction. These methods do not require complex 3D eye models or expensive specialized equipment but rely on extracting features from 2D images for gaze direction prediction. Early studies typically employed handcrafted features, such as Local Binary Patterns (LBP) or Histogram of Oriented Gradients (HOG), to describe ocular image characteristics [16]. However, these methods often encounter performance bottlenecks when handling complex scenarios, such as low resolution, occlusion, or variations in lighting conditions [17,18]. With the rapid advancement of deep learning technology, appearance-based gaze estimation methods have leveraged the powerful feature extraction capabilities of convolutional neural networks (CNN) to learn the complex mapping relationships between gaze direction and visual features from large-scale ocular image datasets. Zhang et al. [19] first proposed a CNN-based gaze estimation method, significantly improving the accuracy of appearance-based gaze estimation. Building on this, Cheng et al. [20] introduced an asymmetric regression network (AR-Net) by modeling the asymmetry between the two eyes, further optimizing estimation performance. To address the limitations of CNN in global feature modeling, Cheng et al. [21] integrated transformers with CNN to extract richer gaze features from facial images, achieving superior estimation results. In recent years, Shi et al. [22] leveraged the asymmetry in appearance and feature space by extracting low-level semantic features through the main branch while utilizing agent tasks to capture high-level asymmetric information. This approach optimized eye feature learning, significantly improving the accuracy and generalization ability of gaze estimation. The FSIGaze model [23] has further advanced this direction by incorporating both spatial and frequency domain information and introducing residual modules and the Frequency-Spatial Synergistic (FSS) module, effectively enhancing gaze estimation accuracy. Despite the significant progress of appearance-based gaze estimation algorithms in diverse scenarios, most existing methods rely on a single modality input (such as eye or facial images), failing to fully exploit the intrinsic correlation between the two. Additionally, while some eye-face fusion models consider multimodal feature integration, they often overlook the synergistic effect between independent ocular features and fused features, limiting further improvements in estimation performance.

To comprehensively integrate the contributions of facial and ocular information to gaze estimation and effectively extract critical features from eye images for gaze direction prediction, we propose a MSMI-Net. This network architecture can accommodate multiple image inputs and process them through two specifically designed streams: one for extracting high-dimensional features from eye images and the other for fusing eye-face correlation features. This multi-stream design leverages the complementary information between the eye and face regions, providing more precise information for gaze estimation and significantly enhancing prediction accuracy.

In summary, the contributions of this paper are as follows:

A novel MSMI-Net is proposed, which extracts high-dimensional eye features and low-dimensional facial features through a multi-stream design. The first stream focuses on eye feature extraction, while the second stream utilizes a shallow convolutional network to extract eye-face features. Both streams work collaboratively to optimize gaze direction estimation.
In the feature fusion stream, a Joint Convolutional Block Attention Module (JointCBAM) is introduced to efficiently integrate binocular and facial features. Furthermore, to adequately account for the weight distribution of eye and facial features during fusion, an AWAM is designed. This mechanism dynamically optimizes feature weight distribution during training, increasing the proportion of effective features and further enhancing gaze estimation accuracy.
Experiments on three benchmark datasets, MPIIFaceGaze, EYEDIAP, and Gaze360, demonstrate that MSMI-Net achieves state-of-the-art performance in gaze estimation. Ablation studies further validate the effectiveness of our contributions.

The remainder of this paper is organized as follows: Section 2 reviews the related work. Section 3 presents the network design and provides a detailed description of our MSMI-Net. Section 4 introduces the experimental setup and results. Section 5 concludes the paper.

2. Related Work

In the field of gaze estimation, researchers have proposed a variety of methods to address this challenge, with model-based methods and appearance-based methods forming the two major research directions.

2.1. Model-Based Gaze Estimation

In the field of model-based gaze estimation, researchers have developed various techniques to accurately predict visual focus, which typically rely on high-resolution images to fit and track local features used for estimating geometric parameters [24]. By analyzing corneal reflections and pupil center localization, researchers can estimate the gaze vector of the eyes [25]. Sei et al. [26] proposed an innovative model-based deep gaze estimation method, which improves estimation accuracy by iteratively updating individual facial shape parameters, advancing personalized gaze estimation techniques. Li et al. [27] introduced the EasyGaze3D framework, which utilizes a single RGB camera to estimate gaze direction by detecting 2D facial landmarks and reconstructing 3D facial shapes. Shen et al. [28] proposed a model-based 3D gaze estimation method that employs a Time-of-Flight (TOF) camera to capture low-resolution 3D images and computes the 3D eyeball center using single-eye corner points. Although model-based methods have achieved remarkable accuracy, their reliance on specialized hardware (e.g., structured-light scanners, depth cameras, or high-resolution infrared cameras) limits their applicability, increasing system complexity and cost while posing technical and environmental challenges for real-world deployment. Moreover, model-based methods exhibit limited generalization capability, often performing well in controlled environments or specific datasets but struggling under varying demographics, environments, or imaging conditions.

2.2. Appearance-Based Gaze Estimation

Appearance-based gaze estimation methods primarily rely on extracting information from the appearance features of the eyes or face to estimate an individual’s gaze direction. These methods typically use a single RGB camera to capture images and apply deep learning techniques to learn the mapping relationship between image data and gaze direction. Early work by Baluja et al. [29] proposed a gaze tracking system based on artificial neural networks, which uses a head-mounted device to capture individual eye images and employs neural networks to map gaze functions. To enhance the generalization capability of appearance-based methods, Sugano et al. [30] introduced a gaze estimation method based on visual saliency maps, which does not require explicit personal calibration. By using Gaussian process regression to establish a mapping between eye images and gaze points, and employing a feedback loop from the gaze estimator to the gaze probability map, this approach improves gaze estimation accuracy, making appearance-based methods applicable to different populations. However, early appearance-based gaze estimation techniques faced a series of challenges, primarily due to their high dependence on the sampling environment, posture, and individual differences. Due to these limitations, early appearance-based gaze estimation techniques struggled to adapt to different users, environments, and postures, making it difficult to achieve stable and accurate gaze tracking results in diverse application scenarios.

With the continuous advancement of deep learning technologies, CNNs have been widely applied in the field of computer vision. In 2016, Krafka et al. [31] collected and released a large-scale eye-tracking dataset named GazeCapture and developed an eye-tracking model based on CNN called iTracker. This model performed exceptionally well in comprehensive evaluations on the GazeCapture dataset, significantly outperforming existing methods at the time. Subsequently, Zhang et al. [32] proposed another important eye-tracking dataset, MPIIGaze, which was collected in real-world environments. They constructed a GazeNet model using a 16-layer VGG deep convolutional neural network to learn the mapping from head pose and eye images to 3D gaze direction. To overcome the limitations of using only eye or face images as input for gaze estimation, Cheng et al. [33] proposed the CA-Net model, which adopts a coarse-to-fine strategy to improve gaze direction estimation accuracy. CA-Net extracts coarse-grained features from facial images using Face-Net to estimate the basic gaze direction, while Eye-Net estimates gaze residuals from left and right eye images to refine the base gaze direction. However, this method processes eye and face images separately during feature extraction, without fully considering the potential feature correlations between them, thus limiting the model’s performance. To address this issue, Ren et al. [34] introduced a feature fusion method based on multi-level information element integration. This method extracts multi-level features from the raw images and constructs a multi-level information element matrix while incorporating the motion transmission principles of human eye gaze behavior to design a combination pattern for information elements. This approach improves the interpretability of appearance-based gaze estimation models while enhancing their practicality. However, although this method strengthens the correlation between features, it gives limited consideration to the specific impact of eye features and fused features on gaze estimation results, leading to certain limitations in specific cases. Additionally, 3DGazeNet [35] detects five key points on the face, crops the full face image along with the left and right eye images, and stacks these three images along the channel dimension as input. Using a parallel network structure and a special loss function, it completes the gaze estimation task. Although this method uses a multi-input structure to simultaneously utilize both eye and face information, it does not model the independence and effectiveness of facial and eye image features separately, which may affect the rationality of feature fusion and the final estimation accuracy. In summary, existing methods have explored multi-input structures and feature fusion strategies but still lack sufficient exploration of the synergistic relationships between eye and face features. Therefore, this paper proposes a novel framework that fully utilizes independent eye features and eye-face fused features to more accurately perform gaze estimation tasks.

3. Proposed Method

3.1. Network Overview

The MSMI-Net architecture, as shown in Figure 1, comprises two independent streams designed to enhance gaze estimation performance by comprehensively processing eye and face images. The eye image stream uses left and right eye images as input, while the fusion stream combines the left and right eye images with the full face image, totaling five input channels. To enhance the comparison with benchmark models and adapt to the smaller eye image size in the dataset, all inputs are cropped to 96 × 96. In the eye image stream, the input images first pass through a convolutional layer with a kernel size of 4 and a stride of 2, followed by normalization. The images then enter a ConvNext module [36] consisting of four stages, where the ConvNext Block is configured as [3,3,27,3], with the channel dimension gradually expanded from 128 to 1024. Downsampling operations are applied before each stage to reduce the feature map size, thereby constructing a deep CNN specifically designed for high-dimensional feature extraction from the eye region. Meanwhile, the fusion stream adopts a lightweight convolutional structure to efficiently extract low-dimensional features from both the eyes and the face, while avoiding excessive computational overhead. This stream utilizes a 5-layer 3 × 3 convolutional network to gradually increase the feature channel dimension from 3 to 256, and introduces residual connections during feature extraction to preserve the original eye feature information, thereby enhancing the impact of local features on gaze estimation. Additionally, two max-pooling operations are incorporated into the five convolutional layers to adapt to input image size variations while improving the model’s robustness to features at different scales. Given the varying importance of eye and facial images in gaze estimation tasks, MSMI-Net designs an AWAM to dynamically optimize the fusion weights of the eye and facial features. Furthermore, the improved convolutional block attention module (CBAM) [37] employs parallel spatial and channel attention mechanisms to achieve feature compression, information interaction, and weight adaptive adjustment, thereby enhancing the fusion of eye and facial features. Finally, the high-dimensional features extracted by the eye stream are combined with the features after integration from the fusion stream and processed through two fully connected layers to complete the gaze estimation task.

3.2. Eye Image Stream

In the field of gaze estimation, eye feature extraction is crucial. Traditional methods often rely on single-eye image inputs and use shallow convolutional modules to construct feature extraction networks, aiming to improve computational efficiency, simplify training, and alleviate overfitting. However, shallow convolutional networks have limitations in capturing complex patterns and higher-order features, making them ill-suited for gaze estimation tasks in complex environments. Compared to deep networks, their feature extraction capabilities are limited, making it difficult to effectively mine the deep structure and high-dimensional features of the data, which impacts the accuracy of gaze direction prediction. Considering the limitations of shallow convolutional modules, this study adopts an efficient stack of ConvNext modules in the eye image stream to build a deep convolutional neural network structure. Task-specific improvements have been made to the ConvNext Block, as shown in Figure 2. The ConvNext module is inspired by the optimization of the ResNet structure by the Swin Transformer, maintaining the simplicity of the standard ConvNext while avoiding the introduction of attention mechanisms. Its design combines macro-structural optimizations, the grouped convolution technique from ResNeXt, the inverted bottleneck structure, the use of large convolutional kernels, and diversified strategies at the micro level. To validate the advantages of ConvNext in eye feature extraction, this paper compares the performance of different backbone networks through experiments, demonstrating the superiority of ConvNext.

In the structural design of the improved ConvNext Block, this paper takes into account that eye images not only contain the eyeball itself but also the contour information around the eye, which is of great value for the gaze estimation task. Therefore, the ConvNext Block adopts a 7 × 7 large convolutional kernel to expand the receptive field at the same level in order to capture a wider range of contextual information while reducing the model parameters without sacrificing performance. Following the large convolutional kernel, two 1 × 1 convolutional kernels are specifically used to extract local features, which helps the model capture detailed information in the eye images and enhances the model’s ability to express features of different sizes, providing richer visual cues for gaze estimation. Moreover, due to individual differences in eye appearance, the combination of large and small convolutional kernels in the ConvNext Block can effectively improve the model’s adaptability and generalization ability to these differences. This multi-scale feature extraction strategy enables the network to better handle eye images from different individuals.

In the ConvNext network, inspired by the downsampling design of the Swin Transformer, the Down Sample module introduces grouped convolution and effectively reduces the spatial size of feature maps using a 2 × 2 convolution with a stride of 2. This design optimizes computational efficiency and enhances model stability and performance by incorporating Layer Normalization (LN). This approach is particularly crucial for gaze estimation tasks, as it expands the receptive field while reducing computational overhead, enabling better capture of key features in eye images. The independent downsampling module not only decreases computational complexity but also enhances the model’s generalization ability to complex visual scenes, thereby improving the accuracy and robustness of gaze estimation.

When constructing the eye image feature extraction network, we observed that the left and right eye images are not only structurally symmetrical but also share correlations in feature representation. Therefore, we introduced a weight-sharing strategy for left and right eye feature extraction in the network to enhance the capability of extracting binocular features. This strategy leverages symmetry and correlation to improve feature consistency and complementarity, reducing model parameters, increasing training efficiency, and helping the network better learn the common characteristics of both eyes, thereby improving the accuracy of gaze estimation.

In Section 4, within the experimental section, we evaluate the effectiveness of the eye image stream in extracting high-dimensional features. The experimental results indicate that while this stream significantly enhances feature extraction accuracy, it also reveals the potential risk of overfitting, which could negatively impact the performance of the gaze estimation task. To address this issue, we designed a fusion stream that integrates eye features with facial attribute information, enriching the network’s feature representation and mitigating the overfitting problem caused by a single-stream approach.

3.3. Fusion Stream

Previous studies have demonstrated the importance of facial images in gaze estimation tasks. However, directly extracting features from facial images does not always provide accurate guidance for gaze estimation and may even introduce misleading information. Given the high correlation between eye images and facial images, this study proposes leveraging eye features to guide the effective extraction of facial features, thereby enhancing their representation in gaze estimation. Considering the limitations of high-dimensional eye features in terms of interpretability and guidance, as well as the potential overfitting issues caused by a single high-dimensional feature stream, we introduce a dedicated new stream in addition to the eye image stream to further optimize feature extraction and fusion. This stream focuses on extracting low-dimensional features from both eye and facial images and employs the JointCBAM to fuse these features. This fusion strategy aims to leverage eye features to better guide the extraction of critical facial features relevant to the gaze estimation task. With this design, the fusion stream effectively integrates information from both eye and facial regions, enhancing feature representation quality and improving the model’s generalization ability. As shown in Figure 3, the structural design of the fusion stream aims to optimize the feature extraction process, ensuring higher accuracy and robustness in the gaze estimation task.

During the binocular feature extraction stage, this study adopts an independent input approach for left and right eye images and applies 3 × 3 convolution kernels to extract key visual features. Subsequently, data normalization techniques are employed to process the convolutional outputs, aiming to stabilize the training process and enhance the model’s generalization ability. Additionally, the ReLU activation function is selected to introduce non-linearity, further strengthening feature representation. To mitigate potential bias errors caused by convolutional layer parameter deviations while preserving image texture information, a 2 × 2 max pooling layer is incorporated. This design helps reduce the spatial dimensions of feature maps while maintaining sensitivity to fine details. Moreover, inspired by the design principles of Residual Networks (ResNet) [38], we construct residual modules to enhance the representation of original features within the network. Within the residual module, a 7 × 7 convolution kernel is specifically employed, contrasting with the 3 × 3 convolution kernels used in the main network. The primary purpose of this design is to expand the receptive field, allowing the network to capture a broader range of contextual information, thereby providing richer feature representations. Through this comprehensive network structure, the feature extraction process not only precisely captures local details of eye images but also effectively integrates more contextual information using larger convolutional kernels and residual connections.

Before fusing facial and eye features, considering the differences in acquisition conditions between facial and eye images in the dataset, this study designs an AWAM. As shown in Figure 4, this mechanism consists of three gating mechanisms, which adjust the features processed by the gating operations using a weight matrix initialized to 1. During training, the gating mechanism’s weights and the initially set weight matrix are adaptively optimized through backpropagation, enabling the adaptive adjustment of weights for the three feature types. Through network training, this mechanism dynamically adjusts the weight distribution between eye and facial features, allowing the model to adapt to different image characteristics and improve the overall accuracy of gaze estimation.

After passing through the shallow feature extraction network, the extracted facial and eye features are fed into the gating mechanism. The gating mechanism utilizes the sigmoid function to obtain the self-attention weight of each feature. These weights are then applied to the original input features through matrix multiplication to achieve adaptive self-weight adjustment. The specific function for computing the self-gating weights is as follows:

ω = σ (x) = \frac{1}{1 + e^{- x}}

(1)

where

x

is the input feature map, and

σ (x)

is the sigmoid function applied to the input

x

, generating a weight

ω

that is between 0 and 1.

After calculating the self-gating, the self-gated weights are applied to the input features. The specific function is as follows:

\hat{x} = x \cdot σ (x)

(2)

where

σ (x)

is the sigmoid function;

x

is the input feature, and

\hat{x}

is the weighted output.

In addition, to further enhance the model’s generalization capability and prevent overfitting of the weight parameters, we introduced an L2 regularization term into the loss function. L2 regularization effectively controls the model’s complexity by imposing a penalty on the model parameters, thereby improving the model’s performance on unseen data. The total cost function is as follows:

{L_{t o t a l} = L}_{M E S} + λ \cdot L

(3)

Here,

λ

is the regularization parameter, initialized to 0.0001, and we use the default parameter value.

L_{M E S}

represents the mean squared error loss;

L

is the regularization term, calculated as follows:

L = \sum_{i} ω_{i}^{2}

(4)

In this context,

ω

represents the model parameters.

After the AWAM, we concatenate the eye and face features along the dimension and feed the concatenated features into the JointCBAM for more effective feature fusion. The traditional CBAM module evaluates attention weights on the channel dimension first, then adjusts the weights on the spatial dimension through concatenation. However, the improved JointCBAM module proposed in this study processes the input features in parallel by dividing them into two parallel branches, which handle channel and spatial attention separately. Finally, the attention information from both branches is interacted with to achieve more efficient feature fusion. As shown in Figure 5, in the compression part, the input features are divided into two branches, with max pooling and average pooling used to obtain channel descriptors and spatial descriptors, respectively. In the interaction part, the channel and spatial descriptors are concatenated and processed by a multi-layer perceptron (MLP) to model the correlation between global spatial and global channel features. After sufficient interaction between the global spatial and channel features, they are split and restored to their original dimensions. The weights are calculated using a sigmoid function and applied to the input features. Finally, the weighted features are fused through element-wise summation. Compared to the traditional CBAM module, the JointCBAM module not only avoids separately analyzing the spatial and channel modeling process but also effectively merges channel attention and spatial attention information to achieve better space-channel interaction and generate more comprehensive features. To validate the effectiveness of our improved module, we conducted related comparative experiments in Section 4.

The JointCBAM channel attention module is shown in Figure 5. The input feature map

F

undergoes global average pooling and global max pooling along the spatial dimension, and after passing through a multi-layer perceptron, the results are summed to obtain the channel feature map. The detailed calculation formula is as follows:

M_{c} (F) = M L P ({A v g P o o l}_{s} (F)) + M L P ({M a x P o o l}_{s} (F))

(5)

where

M_{c} (F)

represents the channel features,

M L P

refers to the multi-layer perceptron operation, and

{A v g P o o l}_{s}

and

{M a x P o o l}_{s}

are the global average pooling and global max pooling operations on the spatial dimension.

The JointCBAM spatial attention operation, as shown in Figure 5, involves applying global average pooling and global max pooling on the channel dimension of the input feature map

F

. These operations result in two feature maps, each with the same size as

F

but with only one channel. The output feature maps are then passed through a multi-layer perceptron (MLP) and summed to obtain the spatial feature map. The detailed computation formula is as follows:

M_{s} (F) = M L P ({A v g P o o l}_{c} (F)) + M L P ({M a x P o o l}_{c} (F))

(6)

In this context,

M_{s} (F)

represents the channel features,

M L P

refers to the multi-layer perceptron operation, and

{A v g P o o l}_{c}

and

{M a x P o o l}_{c}

are the global average pooling and global max pooling operations performed on the channel dimension, respectively.

The channel features

M_{c} (F)

and spatial features

M_{s} (F)

are reshaped and concatenated along the first dimension to obtain the combined feature

F^{'}

, which contains both spatial and channel features. Then, a multi-layer perceptron operation is applied to

F^{'}

to achieve effective interaction between spatial and channel features. After this, the combined features are separated back into channel and spatial dimensions. The separated channel and spatial features are passed through a sigmoid activation function to obtain the channel and spatial weights. These weights are then used to adjust the original input features. Finally, the adjusted spatial and channel features are summed to obtain the fused features. The detailed formula is as follows:

\begin{matrix} {M_{s}}^{'} = σ (M_{s} (F^{'})) \\ {M_{c}}^{'} = σ (M_{c} (F^{'})) \\ F_{f} = ({M_{s}}^{'} \cdot F + {M_{c}}^{'} \cdot F) \end{matrix}

(7)

where

{M_{s}}^{'}

and

{M_{c}}^{'}

represent the spatial weight and channel weight, respectively,

σ

represents the sigmoid function, and

F_{f}

is the final fused feature.

3.4. Gaze Result Evaluation

During the model’s output process, the features extracted from both the eye stream and the fusion stream are first subjected to global average pooling, reducing the feature map size to 1 × 1, followed by processing through a fully connected layer. The fully connected features are then concatenated into a comprehensive feature vector, which is subsequently passed through two fully connected layers to predict the two-dimensional gaze direction vector. In this process, the channel dimension of the eye stream is reduced from 1024 to 256, while the channel dimension of the fusion stream remains unchanged. Finally, a specific angular computation method is used to convert the output two-dimensional vector into a three-dimensional gaze direction. This transformation process is implemented using the following angular computation function:

g a z e = (\begin{matrix} x = - \cos \emptyset \cdot \sin θ \\ y = - \sin \emptyset \\ z = - \cos \emptyset \cdot \cos \emptyset \end{matrix})

(8)

where

x

represents the horizontal component,

y

represents the vertical component, and

z

represents the depth component.

\emptyset

is the yaw angle predicted from the 2D vector, and

θ

is the pitch angle predicted from the 2D vector. Through the above transformation formula, the conversion from a 2D vector to a 3D gaze direction is completed.

The angular error between the predicted and ground truth values is calculated using the cosine similarity to determine the angle between them. The specific formula is as follows:

θ = \arccos (\min (\frac{g \cdot l}{‖g‖ \cdot ‖l‖}, 0.9999999)) \cdot \frac{180}{π}

(9)

where

g

represents the predicted value obtained through angle conversion, and

l

is the ground truth label. The function

m i n (\cdot, 0.9999999)

is used to prevent precision issues caused by floating-point operations, ensuring that the input to the

a r c c o s

function remains within the valid range [−1, 1].

4. Experiments

4.1. Settings

Dataset: In this study, we selected two datasets containing images of multiple individuals along with corresponding 3D gaze labels to validate the model, namely EyeDiap [39], MPIIFaceGaze [19] and Gaze360 [40]. These two datasets provide a wealth of visual information and accurate gaze annotations for evaluating the model’s performance in handling multiple subjects and complex scenes, thereby ensuring the reliability of the experimental results and the generalization ability of the model.

EyeDiap: The EyeDiap dataset is designed for gaze estimation using RGB and RGB-D cameras, featuring 94 video clips from 16 participants under varied lighting conditions and head poses, simulating real-world scenarios. A key strength of EyeDiap is its high-precision gaze annotations, including eye center positions and ping-pong ball locations, with 2D-to-3D coordinate conversion for more accurate gaze direction computation. It also provides RGB-D sensor calibration, screen calibration, camera synchronization, and head pose tracking, ensuring reliable training data.

MPIIFaceGaze: MPIIFaceGaze is an extended version of the MPIIGaze dataset, containing 213,659 images from 15 participants. It enhances the original eye image dataset by incorporating full-face images and point of regard (PoG) annotations, providing richer visual information. The inclusion of full-face images helps integrate eye features, facial expressions, and head poses, which influence gaze direction, thereby improving the model’s adaptability to complex scenes. For multi-stream architectures, this fusion of eye and facial features further enhances gaze estimation accuracy. Additionally, the large-scale dataset of over 210,000 images provides ample training samples for deep learning models, improving their generalization ability and ensuring stable performance across diverse environments.

Gaze360: Gaze360 is a large-scale 3D gaze estimation dataset designed to support robust gaze direction estimation in natural scenes. The dataset includes 238 subjects with a wide range of gender, age, and ethnicity, captured using a Ladybug5 360° camera, providing precise 3D gaze annotations with an average error of 2.9°. It contains over 2 million images, covering 360° head poses, with 129K training images, 17K validation images, and 26K test images. The dataset supports extreme head pose variations, partial occlusion, and dynamic video data. The diversity and real-world conditions of Gaze360 help improve the generalization ability of models, enabling them to adapt to complex environments. It is widely used in applications such as augmented reality, remote interaction, and driver monitoring.

Baseline: In this experiment, to validate the rationality and effectiveness of our model design, we selected the Dilated-Net [41] gaze estimation model, proposed in previous research, as the baseline for performance comparison. Dilated-Net effectively enhances the ability to capture appearance features by introducing dilated convolutions and a multi-region network design, thereby improving gaze estimation accuracy based on appearance. The model performs especially well when handling small angular changes. Additionally, Dilated-Net shares the same input data composition as the MSMI-Net proposed in this study, using left and right eye images along with facial images as input features. Therefore, performing a performance comparison under the same input conditions ensures fairness in the experiment and allows for a more accurate assessment of the improvements made by MSMI-Net in feature fusion and information extraction. As a model that achieves good performance in existing research, Dilated-Net is highly representative. Its network structure shares similarities with MSMI-Net but also contains key design differences. Thus, selecting Dilated-Net as the baseline model not only provides an effective performance comparison but also enables a deeper analysis of the advantages and improvements of MSMI-Net in feature representation, information fusion, and angular adaptability.

Dataset Processing: In this study, based on the review published by Cheng et al. [42], the EyeDiap, MPIIFaceGaze, and Gaze360 datasets were processed in a unified manner. For the EyeDiap video dataset, we extracted one image every 15 frames from the VGA resolution videos as potential gaze images to ensure consistency in the number of images across different videos. Additionally, we performed pixel value normalization on the extracted images, adjusting their range to [0, 1], to facilitate subsequent computational processing. When processing the MPIIFaceGaze and Gaze360 datasets, we noted that each eye image had a corresponding facial image. Therefore, we first normalized the full-face images to facilitate the cropping of eye images from them. For both the EyeDiap and MPIIFaceGaze datasets, we used the same evaluation strategy as in previous studies, namely leave-one-subject-out cross-validation, and computed the average results to assess the performance of gaze estimation. For Gaze360, we directly performed evaluation on the test set. This approach ensures the accuracy and reliability of the evaluation results and makes our research findings comparable to other methods in the existing literature.

Training Details: The proposed method was implemented using the PyTorch framework, version Torch-1.8.1. The experiments were run on a Windows 10 operating system with hardware consisting of an Intel Core i5-14600KF CPU@3.50GHz and an NVIDIA RTX 4080 GPU (16,384 MiB of VRAM). During the training phase, the model’s input data consisted of three parts: left eye image, right eye image, and facial image. The original left and right eye images had a size of 60 × 36, while the facial images had a size of 224 × 224. To standardize the input image sizes and ensure a fair comparison with the baseline method, all input images were cropped to 96 × 96 pixels. Additionally, the batch size for each input to the network during training was set to 64 to improve training efficiency. For different dataset sizes, we designed training strategies with 60, 100, and 200 epochs to ensure the model could learn effectively under different data conditions. The loss function used was L1 loss to measure the difference between predicted and true values, and the Adam optimizer [43] was employed for parameter updates, with the optimizer’s settings left at their default configuration. The model’s initial learning rate was set to 0.001, and it decayed to one-tenth of the original value every 30 epochs to accelerate convergence and improve model stability.

4.2. Performance

To evaluate the effectiveness of the proposed method, multiple methods were compared on three commonly used standard datasets: EyeDiap, MPIIFaceGaze, and Gaze360. Eleven state-of-the-art methods were selected as a control group, including AFF-Net [44], Dilated-Net [41], CA-Net [33], VGE-Net [45], FullFace [19], MSGazeNet [46], RT-GENE [47], ARE-Net [20], GazeCaps [48], UniGaze-H [49], and GazeSymCAT [50], and in order to validate the present chapter more comprehensively method performance and validity, Dilated-Net was selected as the experimental baseline in this study. The experimental results are shown in Table 1, demonstrating the advantages of the proposed method over other methods in recent years. In the performance on the three datasets, the proposed method achieves a significant improvement of 0.59°, 0.93°, and 2.61° in terms of error compared to Dilated-Net, respectively. This indicates that the proposed network not only retains the advantage of handling small angular variations but also significantly improves the accuracy of large head angle prediction.

The MPIIFaceGaze dataset is a recognized benchmark in the field of 3D gaze estimation. Compared to the MPIIGaze dataset, it includes full-face images, making it more suitable for studying the relationship between facial features and eye gaze direction. Additionally, the dataset has a smaller range of gaze angle variations, making it effective for evaluating model performance under subtle gaze changes. Experimental results show that the proposed method outperforms the earlier FullFace method on the MPIIFaceGaze dataset, reducing error by 1.08° and improving performance by 21%, which strongly demonstrates its advantages in gaze estimation tasks. Although the VGE-Net method has not been tested on the MPIIFaceGaze dataset, it achieved an accuracy of 3.9° on the MPIIGaze dataset, surpassing other methods and exhibiting strong performance. Further comparative analysis indicates that the proposed method achieves the best results on the MPIIFaceGaze dataset compared to GazeSymCAT and UniGaze-H, demonstrating strong generalization ability and robustness. Given the particularity of the MPIIFaceGaze dataset, this experiment employs a leave-one-out validation strategy to ensure a thorough evaluation of the model’s generalization ability across different subjects. Additionally, detailed statistics and analyses of training errors for each participant were conducted to explore the model’s adaptability and potential for optimization across individuals. The comparison results with baseline models can be found in Figure 6. Overall, the experimental results indicate that the proposed method has significant advantages in gaze estimation tasks, maintaining stable performance across different individuals while exhibiting high robustness and generalization ability.

To further enhance the model’s performance, this study conducts an in-depth training and comparative analysis on the EyeDiap dataset. This dataset encompasses various complex scenarios, including different head movement patterns, gaze deviations, and varying lighting conditions, making it a highly challenging testing platform. In particular, the complex lighting environment imposes higher robustness requirements on the model, facilitating a comprehensive evaluation of its adaptability in real-world applications. Therefore, training and testing on the EyeDiap dataset provide a more thorough verification of the proposed method’s stability and performance advantages under complex conditions.

As shown in Table 1, the proposed method achieves significant performance improvements on the EyeDiap dataset, demonstrating strong competitiveness. Specifically, compared to VGE-Net, the proposed method reduces the error by approximately 1.36°, achieving a performance improvement of about 20%. As a baseline method, Dilated-Net, which uses left eye, right eye, and face images as input, has an error of 6.17° on the EyeDiap dataset, while the proposed method further reduces it by 0.93°, highlighting its advantage among similar network architectures. Although there remains a performance gap between the proposed method and GazeCaps on the EyeDiap dataset, experiments on the MPIIFaceGaze dataset indicate that the proposed method excels in accuracy and maintains strong adaptability across different datasets. Overall, the method not only exhibits good generalization ability and stable gaze estimation performance but also maintains reliable estimation results even under complex lighting conditions. Additionally, compared to the recently proposed UniGaze-H method, our approach also demonstrates superior performance. To provide a more intuitive analysis of the proposed method’s performance across different individuals, we adopt a leave-one-out validation strategy to ensure the reliability of the evaluation results. Figure 7 further illustrates the error distribution of the proposed method and baseline methods when trained separately for each individual. From the overall trend, the proposed method exhibits relatively small error fluctuations across different individuals, indicating strong stability. This suggests that the method is not only effective for specific individuals but also capable of achieving accurate gaze estimation at the group level.

The Gaze360 dataset extends the research scope of the gaze estimation task by providing multi-view facial images over a 360° range, enabling the model to learn richer angular information, and thus improving its prediction ability under different viewpoints. Therefore, the experiments conducted on the Gaze360 dataset not only validate the prediction ability of the proposed method for small angle variations, but also examine its adaptability and generalization ability for a wide range of angle variations. The experimental results are shown in Table 1, where the proposed method achieves a performance improvement of 2.61° on the Gaze360 dataset compared to the baseline method, which significantly improves the accuracy of the multi-angle gaze estimation task. However, there is still a performance gap of 1.08° compared with the recently proposed GazeCaps method, which is optimized for hierarchical representation of gaze direction features and spatial relationship modeling, and shows stronger performance, especially in the large-angle gaze estimation task, which effectively captures the nonlinear changes in gaze direction with the help of capsule networks and exhibits stronger generalization ability. This gap mainly stems from the advantage of GazeCaps in processing 3D structural information. In comparison, the proposed method outperforms the UniGaze-H method by 1.25°. Experimental results on Gaze360 show that the proposed method not only maintains the accurate prediction of small angles, but also significantly enhances the prediction of a wide range of angles. Despite the good performance on the Gaze360 dataset, the proposed method still shows a certain performance gap in comparison with some state-of-the-art methods.

To validate the rationality of using ConvNext as the backbone, Figure 8 presents the experimental results under different network architectures. Compared to models using ResNet as the backbone, the proposed model achieves error reductions of 0.24°, 0.33°, and 0.61° across the three datasets, demonstrating the advantages of ConvNext in feature extraction and overall model performance. Similarly, when compared to models using MobileNet [51] as the backbone, the proposed model reduces errors by 0.27°, 0.39°, and 0.94°, further highlighting the superior performance of ConvNext across multiple datasets. These comparative experiments provide strong evidence that selecting ConvNext as the backbone significantly enhances model accuracy and robustness across different datasets.

To validate the effectiveness of the improved JointCBAM module, we compared it with different feature fusion modules in the experiment to fully demonstrate the improvements. We selected the classic SE [52] module and the original CBAM module as baselines. The experimental results, as shown in Table 2, indicate that the improved JointCBAM module outperforms others in gaze estimation tasks.

Figure 9 presents the gaze estimation visualization results on facial images from different datasets, demonstrating the model’s performance across various scenarios and highlighting its ability to generalize across individual differences, lighting conditions, and head pose variations. By comparing the predicted gaze directions with the ground truth annotations, the figure visually demonstrates the model’s accuracy and robustness, further proving that the method can effectively adapt to changes in facial features and viewing angles, enhancing its effectiveness in real-world applications.

4.3. Ablation Study

In this section, we conduct a series of ablation experiments to verify the effectiveness and advantages of each stream and module in our method. Our design employs a multi-stream and multi-input approach. For the gaze estimation task, we tested the performance of two independent streams and the improved module on the MPIIFaceGaze dataset and compared the results with existing techniques to confirm that the design concept of our model is both reasonable and effective.

First, in the gaze estimation task, we conducted experiments using only the eye stream. Specifically, only the left and right eye images were used as inputs, and deep ConvNeXt modules were employed to extract deep features. The experimental results are shown in Table 3. Data analysis indicates that by stacking deep convolutional modules and utilizing a large receptive field, the eye stream demonstrates significant improvements in ocular feature extraction. However, compared to baseline methods and other advanced approaches, there is still room for further optimization. This suggests that deep network structures are indeed advantageous for ocular feature extraction but may also lead to overfitting issues, potentially affecting the model’s optimal performance.

Subsequently, we conducted experiments on the fusion stream, using both eye images and facial images as inputs. The ablation study results show that incorporating facial image features can significantly enhance the accuracy of gaze estimation. This experimental finding validates the importance of facial information in gaze estimation tasks.

Ablation experiments were conducted to validate the contribution and effectiveness of each stream in the multi-stream, multi-input approach for gaze estimation. When using only the eye stream for gaze estimation, the error compared to MSMI-Net was 1.24°. However, when employing the fusion stream, which incorporates both left and right eye images along with facial images, the error difference decreased to 0.47° compared to MSMI-Net. These results indicate that the multi-stream, multi-input strategy, which integrates high-dimensional eye features and low-dimensional combined eye-face features, significantly enhances gaze estimation accuracy. This approach effectively leverages key information from both the eye and facial regions while capturing subtle variations, thereby improving gaze direction prediction capability. Additionally, the multi-stream framework increases model flexibility, allowing it to dynamically adjust its decision-making process to handle gaze estimation tasks of varying difficulty.

To evaluate the contribution of each module to the overall model performance, a series of detailed ablation experiments were conducted. The experimental results, as shown in Table 4, indicate that the model’s performance significantly declined after removing the AWAM module. Specifically, when this strategy module was eliminated and feature fusion was directly performed using the JointCBAM module, the performance noticeably deteriorated, demonstrating the crucial role of AWAM in the feature fusion process. This mechanism dynamically adjusts the weights of different features, allowing the model to prioritize the most discriminative features during fusion. As a result, it effectively enhances the accuracy of feature extraction and fusion, providing a more precise feature representation and ultimately improving gaze estimation accuracy.

Further ablation experiments revealed that the JointCBAM module plays a vital role in effectively integrating eye and facial features. By simultaneously focusing on both eye and facial information, the JointCBAM module fully captures the correlations among multidimensional features, strengthening the model’s ability to perceive variations in gaze direction. This effective fusion not only improves gaze estimation accuracy but also enhances the model’s robustness against variations in head pose and lighting conditions, further demonstrating the superiority of this module in gaze estimation tasks.

4.4. Results Analysis

In this section, we further analyze the experimental performance of MSMI-Net on the EyeDiap, MPIIFaceGaze, and Gaze360 datasets and validate the effectiveness of different stream designs and individual modules through ablation experiments. By comparing with other related methods, we further demonstrate the rationality and optimization of the stream design. Specifically, MSMI-Net achieved the best performance on the MPIIFaceGaze dataset and also showed significant performance improvement on the EyeDiap dataset. However, although our network performed excellently on the EyeDiap dataset, it still did not reach the optimal level. Additionally, the performance on the Gaze360 dataset showed a noticeable gap compared to the EyeDiap and MPIIFaceGaze datasets. A further analysis of the feature distributions across the three datasets, as shown in Figure 10, reveals that, compared to the MPIIFaceGaze and EyeDiap datasets, the head shift in the Gaze360 dataset is more significant. Particularly, on the Gaze360 dataset, our model shows a weaker ability to extract gaze features from images with large head shifts compared to images with small head shifts. This gap may stem from the network’s limitations in extracting global contextual information, failing to fully capture subtle features under large head pose variations. Therefore, although our network performs well under small head shifts, there is still a need for further improvement in global feature extraction and the robustness of the model in large head shift tasks. This analysis provides valuable insights for future improvements in gaze estimation tasks, especially in enhancing the network’s ability to extract features under large head movement.

5. Conclusions

In this work, we propose MSMI-Net, a novel multi-stream, multi-input gaze estimation model that effectively extracts both high-dimensional and low-dimensional features from eye images and fused eye-face images through two independent streams. To enhance feature fusion, we introduce the JointCBAM mechanism with adaptive weights, improving the quality of eye and face feature integration and thereby enhancing gaze estimation accuracy.

Through ablation studies, we validate the effectiveness of the dual-stream design and demonstrate the contribution of each module to improving model performance. Furthermore, we conduct extensive evaluations on EyeDiap, MPIIFaceGaze, and Gaze360, showing that MSMI-Net achieves strong robustness in complex scenes and multi-angle tasks. However, our experiments also reveal that the model struggles with large head pose variations, particularly in the Gaze360 dataset, highlighting a limitation in its ability to capture extreme head pose shifts.

To address this, future work will focus on enhancing head pose estimation capabilities and increasing dataset diversity, enabling better generalization in extreme gaze scenarios. Additionally, we aim to explore more advanced feature fusion mechanisms and leverage self-supervised learning to further improve model robustness and accuracy in real-world applications.

Author Contributions

Conceptualization, C.L.; Methodology, C.L.; Software, E.T.; Validation, E.T.; Formal analysis, K.Z.; Investigation, E.T., K.Z., N.C., Z.L. and Z.P.; Data curation, N.C.; Writing—original draft, E.T.; Writing—review & editing, C.L.; Supervision, C.L.; Project administration, Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available at https://gaze360.csail.mit.edu/ (accessed on 28 December 2023), https://www.idiap.ch/en/scientific-research/data/eyediap (accessed on 27 December 2023), and https://www.perceptualui.org/research/datasets/MPIIFaceGaze/ (accessed on 19 January 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Rajanna, V.; Hammond, T. A gaze-assisted multimodal approach to rich and accessible human-computer interaction. arXiv 2018, arXiv:1803.04713. [Google Scholar]
Andrist, S.; Tan, X.Z.; Gleicher, M.; Mutlu, B. Conversational gaze aversion for humanlike robots. In Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany, 3–6 March 2014; pp. 25–32. [Google Scholar]
Clay, V.; König, P.; Koenig, S. Eye tracking in virtual reality. J. Eye Mov. Res. 2019, 12, 10–16910. [Google Scholar] [CrossRef] [PubMed]
Patney, A.; Kim, J.; Salvi, M.; Kaplanyan, A.; Wyman, C.; Benty, N.; Lefohn, A.; Luebke, D. Perceptually-based foveated virtual reality. In Proceedings of the ACM SIGGRAPH 2016 Emerging Technologies, Anaheim, CA, USA, 24–28 July 2016; pp. 1–2. [Google Scholar]
Chatelain, P.; Sharma, H.; Drukker, L.; Papageorghiou, A.T.; Noble, J.A. Evaluation of gaze tracking calibration for longitudinal biomedical imaging studies. IEEE Trans. Cybern. 2018, 50, 153–163. [Google Scholar] [PubMed]
Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372. [Google Scholar]
Martin, S.; Vora, S.; Yuen, K.; Trivedi, M.M. Dynamics of driver’s gaze: Explorations in behavior modeling and maneuver prediction. IEEE Trans. Intell. Veh. 2018, 3, 141–150. [Google Scholar]
Zhu, Z.; Ji, Q. Eye gaze tracking under natural head movements. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 918–923. [Google Scholar]
Wang, K.; Ji, Q. Real time eye gaze tracking with 3d deformable eye-face model. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1003–1011. [Google Scholar]
Kaur, H.; Jindal, S.; Manduchi, R. Rethinking model-based gaze estimation. In Proceedings of the ACM on Computer Graphics and Interactive Techniques, Vancouver, BC, Canada, 7–11 August 2022; Volume 5, pp. 1–17. [Google Scholar]
Lu, F.; Okabe, T.; Sugano, Y.; Sato, Y. Learning gaze biases with head motion for head pose-free gaze estimation. Image Vis. Comput. 2014, 32, 169–179. [Google Scholar]
Li, J.; Chen, Z.; Zhong, Y.; Lam, H.K.; Han, J.; Ouyang, G.; Li, X.; Liu, H. Appearance-based gaze estimation for ASD diagnosis. IEEE Trans. Cybern. 2022, 52, 6504–6517. [Google Scholar]
Elfares, M.; Hu, Z.; Reisert, P.; Bulling, A.; Küsters, R. Federated learning for appearance-based gaze estimation in the wild. In Proceedings of the Gaze Meets Machine Learning Workshop, PMLR, New Orleans, LA, USA, 16 December 2023; pp. 20–36. [Google Scholar]
Elfares, M.; Reisert, P.; Hu, Z.; Tang, W.; Küsters, R.; Bulling, A. PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation. In Proceedings of the ACM on Human-Computer Interaction, O‘ahu, HI, USA, 11–16 May 2024; pp. 1–23. [Google Scholar]
Wang, K.; Ji, Q. 3D gaze estimation without explicit personal calibration. Pattern Recognit. 2018, 79, 216–227. [Google Scholar]
Sugano, Y.; Matsushita, Y.; Sato, Y. Learning-by-synthesis for appearance-based 3d gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1821–1828. [Google Scholar]
Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Sugano, Y.; Matsushita, Y.; Sato, Y. Generalizing eye tracking with bayesian adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11907–11916. [Google Scholar]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4511–4520. [Google Scholar]
Cheng, Y.; Lu, F.; Zhang, X. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 100–115. [Google Scholar]
Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 3341–3347. [Google Scholar]
Shi, Y.; Zhang, F.; Yang, W.; Wang, G.; Su, N. Agent-guided gaze estimation network by two-eye asymmetry exploration. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 2320–2326. [Google Scholar]
Jia, Y.; Liu, Z.; Lv, Y.; Lu, X.; Liu, X.; Chen, J. Frequency-spatial interaction network for gaze estimation. Displays 2025, 86, 102878. [Google Scholar]
Valenti, R.; Sebe, N.; Gevers, T. Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 2011, 21, 802–815. [Google Scholar] [CrossRef] [PubMed]
Nakazawa, A.; Nitschke, C. Point of gaze estimation through corneal surface reflection in an active illumination environment. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012, Proceedings, Part II 12; Springer: Berlin/Heidelberg, Germany, 2012; pp. 159–172. [Google Scholar]
Sei, M.; Utsumi, A.; Yamazoe, H.; Lee, J. Model-based deep gaze estimation using incrementally updated face-shape parame-ters. In Proceedings of the 2023 Symposium on Eye Tracking Research and Applications, Tubingen, Germany, 30 May–2 June 2023; pp. 1–2. [Google Scholar]
Li, J.; Yang, J.; Liu, Y.; Li, Z.; Yang, G.Z.; Guo, Y. EasyGaze3D: Towards effective and flexible 3D gaze estimation from a single RGB camera. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 6537–6543. [Google Scholar]
Shen, K.; Li, Y.; Guo, Z.; Gao, J.; Wu, Y. Model-Based 3D Gaze Estimation Using a TOF Camera. Sensors 2024, 24, 1070. [Google Scholar] [CrossRef] [PubMed]
Baluja, S.; Pomerleau, D. Non-intrusive gaze tracking using artificial neural networks. Adv. Neural Inf. Process. Syst. 1993, 6, 153–156. [Google Scholar]
Sugano, Y.; Matsushita, Y.; Sato, Y. Appearance-based gaze estimation using visual saliency. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 329–341. [Google Scholar] [CrossRef]
Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; Torralba, A. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2176–2184. [Google Scholar]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 162–175. [Google Scholar] [CrossRef]
Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
Ren, Z.; Fang, F.; Hou, G.; Li, Z.; Niu, R. Appearance-based gaze estimation with feature fusion of multi-level information elements. J. Comput. Des. Eng. 2023, 10, 1080–1109. [Google Scholar] [CrossRef]
Ververas, E.; Gkagkos, P.; Deng, J.; Doukas, M.C.; Guo, J.; Zafeiriou, S. 3DGazeNet: Generalizing 3D Gaze Estimation with Weak-Supervision from Synthetic Views. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 387–404. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 11976–11986. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Funes Mora, K.A.; Monay, F.; Odobez, J.M. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications, Safety Harbor, FL, USA, 26–28 March 2014; pp. 255–258. [Google Scholar]
Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6912–6921. [Google Scholar]
Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 309–324. [Google Scholar]
Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7509–7528. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Bao, Y.; Cheng, Y.; Liu, Y.; Lu, F. Adaptive feature fusion network for gaze tracking in mobile tablets. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9936–9943. [Google Scholar]
Huang, G.; Shi, J.; Xu, J.; Li, J.; Chen, S.; Du, Y.; Zhen, X.; Liu, H. Gaze estimation by attention-induced hierarchical variational auto-encoder. IEEE Trans. Cybern. 2023, 54, 2592–2605. [Google Scholar] [CrossRef]
Mahmud, Z.; Hungler, P.; Etemad, A. Multistream gaze estimation with anatomical eye region isolation by synthetic to real transfer learning. IEEE Trans. Artif. Intell. 2024, 5, 4232–4246. [Google Scholar] [CrossRef]
Fischer, T.; Chang, H.J.; Demiris, Y. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–352. [Google Scholar]
Wang, H.; Oh, J.O.; Chang, H.J.; Na, J.H.; Tae, M.; Zhang, Z.; Choi, S.I. Gazecaps: Gaze estimation with self-attention-routed capsules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2669–2677. [Google Scholar]
Qin, J.; Zhang, X.; Sugano, Y. UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training. arXiv 2025, arXiv:2502.02307. [Google Scholar]
Zhong, Y.; Lee, S.H. GazeSymCAT: A Symmetric Cross-Attention Transformer for Robust Gaze Estimation under Extreme Head Poses and Gaze Variations. J. Comput. Des. Eng. 2025, 12, qwaf017. [Google Scholar]
Sinha, D.; El-Sharkawy, M. Thin mobilenet: An enhanced mobilenet architecture. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; pp. 0280–0285. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]

Figure 1. The overall network structure of MSMI-Net consists of an eye image stream, where features are extracted through multiple layers of ConvNext modules, and a fusion stream, where low-dimensional features are extracted through convolution and feature fusion.

Figure 2. Eye Image Stream Detailed Structure Diagram, Fl and Fr represent the left and right eye image features learned through the eye stream.

Figure 3. Fusion Stream Architecture and Joint CBAM Feature Fusion Module Diagram.

Figure 4. Adaptive Weight Adjustment Mechanism.

Figure 5. Joint Convolutional Block Attention Module.

Figure 6. The baseline and FullFace, along with our model, were evaluated on the MPIIFaceGaze dataset using the leave-one-out validation strategy.

Figure 7. Leave-one-out validation of the baseline and our model on the EyeDiap dataset.

Figure 8. Comparison of Backbone Networks, where MSMI-Net(R) represents the model using ResNet as the backbone, and MSMI-Net(M) represents the model using MobileNet as the backbone.

Figure 9. Visualization results. The first row shows the input images, while the second and third rows display the ground truth and the estimated results from the proposed method, respectively.

Figure 10. Dataset data distribution.

Table 1. Experimental errors of related methods. We conduct experimental validation on the MPIIFaceGaze, EyeDiap, and Gaze360 datasets. Dilated-Net is selected as the baseline.

Model	MPIIFaceGaze	EyeDiap	Gaze360
AFF-Net	4.34	6.41	-
Dilated-Net	4.44	6.17	13.73
CA-Net	4.27	5.27	11.20
RT-GENE	4.66	6.02	12.26
FullFace	4.93	6.53	14.90
ARE-Net	5.00	6.1	-
MSGazeNet	4.64	5.86	-
VGE-Net	3.90	6.6	-
GazeCaps	4.06	5.10	10.04
UniGaze-H	4.51	5.88	12.37
GazeSymCAT	4.11	5.13	-
MSMI-Net	3.85	5.24	11.12

Bold text indicates the best performance among the current comparison methods.

Table 3. Ablation Experiments on Streams.

Model	MPIIFaceGaze
Without Fusion-Stream	5.09
Without Eye-Stream	4.32
MSMI-Net	3.85

Table 4. Ablation Study of the Fusion Stream Module.

Model	MPIIFaceGaze
Without JointCBAM	4.67
Without AWAM	4.43
MSMI-Net	3.85

Table 2. Feature fusion module comparison experiment: MSMI-SE uses SE as the feature fusion module; MSMI-CBAM uses CBAM as the feature fusion module.

Model	MPIIFaceGaze
MSMI-SE	4.19
MSMI-CBAM	3.97
MSMI-Net	3.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Tong, E.; Zhang, K.; Cheng, N.; Lai, Z.; Pan, Z. Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network. Appl. Sci. 2025, 15, 3684. https://doi.org/10.3390/app15073684

AMA Style

Li C, Tong E, Zhang K, Cheng N, Lai Z, Pan Z. Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network. Applied Sciences. 2025; 15(7):3684. https://doi.org/10.3390/app15073684

Chicago/Turabian Style

Li, Changli, Enrui Tong, Kao Zhang, Nenglun Cheng, Zhongyuan Lai, and Zhigeng Pan. 2025. "Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network" Applied Sciences 15, no. 7: 3684. https://doi.org/10.3390/app15073684

APA Style

Li, C., Tong, E., Zhang, K., Cheng, N., Lai, Z., & Pan, Z. (2025). Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network. Applied Sciences, 15(7), 3684. https://doi.org/10.3390/app15073684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network

Abstract

1. Introduction

2. Related Work

2.1. Model-Based Gaze Estimation

2.2. Appearance-Based Gaze Estimation

3. Proposed Method

3.1. Network Overview

3.2. Eye Image Stream

3.3. Fusion Stream

3.4. Gaze Result Evaluation

4. Experiments

4.1. Settings

4.2. Performance

4.3. Ablation Study

4.4. Results Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI