DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment

Zhang, Wei; Li, Pengcheng

doi:10.3390/app16073298

Open AccessArticle

DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment

by

Wei Zhang

and

Pengcheng Li

^*

School of Computer Science, University of South China, Hengyang 421001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3298; https://doi.org/10.3390/app16073298

Submission received: 7 March 2026 / Revised: 22 March 2026 / Accepted: 23 March 2026 / Published: 29 March 2026

(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Gaze estimation plays a crucial role in human-computer interaction and behavior analysis. However, in dynamic scenes, rigid head movements and rapid gaze shifts pose significant challenges to accurate gaze prediction. Most existing methods either process single-frame images independently or rely on long video sequences, making it difficult to simultaneously achieve strong performance and high computational efficiency. To address this issue, we propose DGAGaze, a gaze estimation framework based on a difference-driven spatiotemporal attention mechanism. This framework uses a geometry-aware temporal alignment module to mitigate interference from rigid head movements, compensating for them through pose estimation and affine feature warping, thereby achieving explicit decoupling between global head motion and local eye motion. Based on the aligned features, inter-frame differences are used to adjust spatial and channel attention weights, enhancing motion-sensitive representations without introducing an additional temporal modeling layer. Extensive experiments on the EyeDiap and Gaze360 datasets demonstrate the effectiveness of the proposed approach. DGAGaze achieves improved gaze estimation accuracy while maintaining a lightweight architecture based on a ResNet-18 backbone, outperforming existing state-of-the-art methods.

Keywords:

differential features; gaze estimation; ResNet-18; lightweight network; spatial and channel attention

1. Introduction

The direction of gaze is an important signal that reveals human intentions [1] and attention. Accurate gaze estimation can greatly advance fields such as human-computer interaction [2], assistive technology [3], cognitive science [4], and Virtual/Augmented Reality (VR/AR) [5]. Early gaze estimation methods primarily relied on geometric models [6], which used specialized hardware systems such as infrared light sources and multiple cameras to track the pupil center and corneal reflections [7] to estimate gaze direction.

With the recent advances in convolutional neural networks (CNNs), appearance-based methods have become the mainstream approach. These methods use end-to-end learning to directly map the appearance of the eyes or face in a single RGB image to gaze direction, enabling gaze estimation in unconstrained settings. Early works [8,9,10] mainly used eye images for estimation, as eyes contain the most direct gaze information. Subsequent studies [9,11,12,13,14] integrated full-face and eye inputs to utilize contextual information such as head pose but at the cost of increased computational complexity. To reduce computational overhead, recent studies [15,16,17] have increasingly explored more compact facial representations for gaze estimation. Although CNN-based methods are effective at extracting local appearance patterns, they often remain limited in their ability to capture global contextual relationships across the face. This limitation becomes more evident in gaze estimation, where the eye region occupies only a small portion of the full-face image and the relevant appearance variations are often subtle. As a result, CNN-based models may struggle to sufficiently emphasize gaze-related cues while suppressing irrelevant facial information under complex backgrounds [18,19].

To address this issue, several recent studies have introduced Transformer-based architectures into gaze estimation [20,21,22,23]. By modeling long-range dependencies across the whole face, these methods can better capture contextual interactions among different facial regions. However, they still exhibit several limitations. Partitioning facial images into non-overlapping patches may disrupt structural continuity and weaken the representation of fine-grained cues, especially in the eye region, where subtle appearance details are critical for accurate gaze estimation. In addition, the self-attention mechanism in Transformers typically incurs substantially higher computational cost. Therefore, although both CNN-based and Transformer-based methods have achieved encouraging progress, they still face challenges in balancing local detail preservation, global context modeling, and computational efficiency. More importantly, most existing methods mainly operate on single-frame inputs and thus overlook the temporal continuity of gaze behavior. In real-world scenarios, gaze usually changes in a smooth and continuous manner. Incorporating temporal information from consecutive frames can help suppress frame-level noise, capture motion dynamics, and improve the robustness and accuracy of gaze estimation.

To better exploit such temporal cues, we estimate gaze from facial video streams rather than isolated single frames. However, existing temporal methods typically rely on recurrent neural networks (RNNs) [24,25], Long Short-Term Memory networks (LSTMs) [15,25], or 3D convolutional neural networks (3D-CNNs) [26,27,28,29] to model video sequences. Although these methods can improve temporal representation, they usually increase model complexity and inference latency, which limits their applicability in real-time and resource-constrained scenarios. Furthermore, naive inter-frame differencing is highly sensitive to rigid head motion, leading to spatial misalignment and making it difficult to disentangle eye-specific motion from global head motion.

Motivated by this, we propose DGAGaze, a lightweight and efficient gaze estimation framework for real-time applications. To balance efficiency and representation capability, DGAGaze adopts ResNet-18 as the backbone to extract features from two consecutive facial frames and reformulates short-term temporal modeling as a two-frame dynamic estimation problem. Specifically, a geometry-aware temporal alignment module is introduced to compensate for rigid head motion before inter-frame differencing, thereby reducing motion contamination from head displacement. In addition to the aligned features, a dual-stream differential attention module is designed to enhance motion-sensitive channels and spatial responses, enabling the network to integrate static appearance cues with short-term dynamic information effectively. In this way, DGAGaze captures informative gaze variation with low computational overhead, without relying on long temporal windows or heavy sequence modeling modules. Our main contributions are as follows:

We propose a geometry-aware temporal alignment module that explicitly compensates for rigid head motion via pose estimation and affine warping. This pre-processing step ensures that subsequent inter-frame differences predominantly capture non-rigid eye movements, effectively decoupling head pose variations from eye motion analysis.
We introduce a novel dual-stream spatiotemporal differential attention module (DSDA) that integrates differential-enhanced channel attention with 3D spatial-channel attention within a phased hybrid attention flow. This design enables the model to capture both fine-grained local details and spatiotemporal context from consecutive frames, thereby improving representation learning for gaze-related motion.
We present DGAGaze as a lightweight dynamic gaze estimation framework that achieves competitive performance on both EyeDiap and Gaze360 while maintaining low parameter complexity and computational cost, demonstrating a favorable balance between temporal modeling capability and deployment efficiency.

The remainder of this paper is organized as follows. Section 2 reviews the related work on gaze estimation and spatiotemporal modeling. Section 3 presents the proposed DGAGaze framework. Section 4 describes the experimental settings and reports the quantitative and qualitative results on public benchmark datasets. Finally, Section 5 concludes this paper and discusses possible future research directions.

2. Related Work

With the rapid development of deep learning, gaze estimation methods have continued to evolve. Currently, appearance-based gaze estimation has become the mainstream paradigm because of its practicality and adaptability to unconstrained environments. Based on the temporal structure of the input, gaze estimation methods can be divided into image-based and video-based methods: the former relies on analyzing single-frame images, while the latter analyzes dynamic dependencies across continuous frames.

2.1. Image-Based Gaze Estimation Methods

In gaze estimation, image-based methods mainly rely on architectures such as convolutional neural networks (CNNs) and Transformers. These methods remain dominant in the literature and can generally be divided into two feature-extraction strategies: extraction from complete facial images and extraction from the eye area. Zhang [17] was the first person to use CNN for gaze estimation of full-face images. They applied spatial weighting to feature maps to enhance or suppress regions associated with the eyes and head pose. The facial features are encoded using convolutional layers to strengthen gaze-related regions.

To further enhance feature representation capabilities, subsequent studies have explored more complex mechanisms for feature interactions. For example, Yang [30] proposed a gaze-target detection method based on head-local and global collaboration. This method integrates head, local, and global features through a cross-view consistency mechanism. Wang [31] used EfficientNet as the main architecture, extracted multi-scale features using dilated convolutions, and performed frequency analysis using Fast Fourier Transform (FFT). The feature relocalization module further improved the extraction effect of global features. Nagpure [32] explored the optimal architecture on the ETH-XGaze dataset [33]. They used a multi-resolution feature extractor and a Transformer to extract global and local facial features for gaze regression.

In addition, to achieve a better balance between speed and accuracy, many studies have explored how attention mechanisms can be more effectively integrated into lightweight network designs. For example, Oh [34] improved the efficiency of self-attention by combining convolution and deconvolution: they used convolution to project face images to capture local context and then deconvolution to recover spatial resolution. Han [35] proposed a collaborative alignment domain adaptation method in which a hybrid network comprising local convolutional layers and self-attention layers generates sophisticated pseudo-labels to capture local and global gaze-related features. Cheng [36] replaced depthwise convolution with unidirectional convolution and added spatial channel attention to highlight key features. Abdelrahman et al. [37] proposed MobGazeNet, which adopts a hierarchical architecture and integrates a Squeeze-and-Excitation unit, a convolutional block attention mechanism, and a channel attention structure in an incremental manner. This model emphasizes strengthening the nuances of the periocular region and accounts for local spatial entanglement and extensive global spatial interrelationships at multiple representation scales. Chen [38] proposed a gaze estimation method for multi-scale attention using sequence-mask decoupling, which distinguishes information related to eye-movement intention from redundant data independent of visual focus.

In parallel with these CNN-based and hybrid designs, Transformer-based architectures have recently demonstrated powerful global modeling capabilities in various computer vision tasks. Cheng [20] initially proposed a hybrid CNN and Transformer architecture whose test results surpassed those of pure CNN or pure Transformer architectures, demonstrating the potential of Transformers for gaze estimation. Zhao [39] proposed a binary model topology that utilizes Transformer branches for long-range contextual association of interdependent features and uses CNN modules to maintain the fidelity of local texture construction. To reduce subject-specific bias, Chen [40] proposed a decomposition strategy that decomposes the task into a gaze-independent component predicted by a deep convolutional neural network (CNN) and a gaze-related offset corrected using a small number of data samples. Gaze estimation is then performed by jointly optimizing both components.

Overall, while these image-based methods perform well in controlled environments, they are often limited in real-world scenarios with motion variations. This limitation mainly arises because such methods do not explicitly model the temporal continuity of gaze behavior.

2.2. Video-Based Temporal Gaze Estimation

Video-based methods model the temporal sequence of images to capture dynamic changes in gaze, thereby improving the model’s robustness to sudden occlusions or head movements. These approaches typically incorporate recurrent or attention mechanisms to capture dependencies between consecutive frames. Kellnhofer [15] was the first to introduce temporal modeling (a bidirectional LSTM) for single-frame gaze direction estimation, improving robustness to large head poses by fusing multi-frame contextual information and by introducing an uncertainty estimation mechanism to predict model confidence.

With the popularization of the Transformer architecture, subsequent temporal models have also begun to leverage its powerful sequence modeling capabilities. For example, Vuillecard [41] combined 3D video datasets with gaze-following labels to generate pseudo-3D annotations and used a Gaze Transformer (GaT) to process both images and videos. Melnyk [42] evaluated performance on different eye movement types (gaze and saccades) using Transformer-based temporal sequence representation learning (TST), Kalman-filter-embedded eye movement models, and LSTMs. Guan [43] captured head-face-eye spatiotemporal interactions, using a query-based learning method to extract per-frame features into video tensors, enabling intra-frame feature exchange and temporal interaction for gaze regression. Jindal [25] tracked spatial dynamics in videos with spatial attention, combined with LSTMs to model temporal dependencies, and used Gaussian processes for personalized gaze estimation.

Although video-based methods can provide stronger temporal modeling, they often rely on long input sequences and relatively complex architectures, which increase computational cost and inference latency. This limitation makes them less suitable for real-time or resource-constrained deployment. In contrast, our work focuses on a lightweight temporal modeling strategy based on only two consecutive frames. The key idea is to capture short-term gaze-related motion through geometry-aware alignment and difference-guided attention, thereby preserving useful temporal cues without introducing the overhead of long-range sequence modeling.

3. Methods

This section presents the proposed DGAGaze framework for appearance-based gaze estimation. Section 3.1 provides an overview of the overall framework. Section 3.2 introduces the feature-extraction module for generating spatial representations from adjacent frames. Section 3.3 describes the geometry-aware temporal alignment module, which compensates for rigid head motion before inter-frame differencing. Section 3.4 presents the dual-stream spatiotemporal difference attention module for motion-aware feature enhancement. Section 3.5 details the output layer and the loss formulation for gaze regression. Finally, Section 3.6 summarizes the overall forward procedure of DGAGaze in the form of pseudocode for clarity.

The methodological design of DGAGaze is built upon both classical and recent studies in gaze estimation. Earlier references primarily support foundational aspects of this work, including benchmark datasets, baseline formulations, and optimization and feature modeling strategies that remain widely adopted in current research. Meanwhile, the design of the proposed framework and its comparative analysis are grounded in the latest advances in this field.

3.1. Architecture Overview

This framework employs a geometrically normalized differential attention mechanism to construct transient gaze trajectories. This mechanism takes two adjacent facial image frames as input to capture subtle motion cues related to gaze shifts.

As shown in Figure 1, this architecture consists of four components: the feature extraction module, the geometry-aware temporal alignment module, the dual-stream spatio-temporal differential attention module, and the regression head that outputs the final gaze prediction.

The model processes two consecutive frames through a shared backbone network to obtain a compact spatial feature representation, providing an appearance foundation for subsequent temporal inference. However, directly comparing these features may capture differences primarily driven by global head motion rather than gaze-related dynamic changes. To address this issue, a geometry-aware temporal alignment module is introduced before computing the difference features. This module estimates relative changes in head pose between adjacent frames and applies feature-level affine transformations to compensate for rigid-body motion. After this alignment step, the remaining inter-frame differences are more likely to reflect subtle gaze-specific motions rather than large-scale head rotations or translations.

Refined dynamic features are aggregated and projected onto the gaze angle using a lightweight regression head. Under gaze supervision, the framework is trained end-to-end, with head pose supervision to ensure stable geometric alignment. The architecture integrating geometric alignment and differential attention captures basic gaze dynamics from just two consecutive frames. This design achieves excellent predictive accuracy while remaining lightweight.

The proposed framework is intentionally built around two consecutive frames rather than a longer sequence. Our motivation is that short-term gaze variation is usually sufficient to reveal meaningful motion cues for appearance-based gaze estimation. At the same time, longer temporal windows often introduce redundant information, higher computational cost, and increased sensitivity to frame misalignment. Experimental results show that extending the temporal input from 2 to 3 or 5 frames does not consistently improve accuracy but substantially increases model size and FLOPs. Therefore, the two-frame design is adopted as a practical trade-off between temporal sensitivity and lightweight deployment.

3.2. Feature Extraction Module

Feature extraction is fundamental to gaze estimation. Since DGAGaze is designed to model dynamic gaze cues efficiently, a more compact backbone is better suited to the overall framework. Compared with deeper backbones such as ResNet-50, ResNet-18 maintains the model’s lightweight nature while still providing sufficient representational capacity for facial appearance feature extraction. Therefore, ResNet-18 is adopted as the backbone network in this work. We initialize ResNet-18 with GazeTR-H-ETH [20] pretrained weights. Furthermore, instead of directly using the standard fully connected layer at the end of ResNet-18 as the feature output, this study introduces a 1 × 1 projection layer after the last residual block to remap the high-dimensional feature maps produced by the backbone network, reducing the channel dimension from 512 to 32. This design not only lowers the feature dimensionality but also preserves the two-dimensional spatial structure, thereby providing a suitable basis for subsequent spatial alignment, difference modeling, and attention enhancement. In contrast, a standard fully connected layer typically requires flattening the feature map into a one-dimensional vector first, which may destroy the original spatial layout information and is therefore unfavorable for modeling local region relationships and inter-frame spatial correspondences. By comparison, the 1 × 1 convolution can perform linear combination and remapping across channels without changing the spatial resolution.

We select two consecutive frames,

X_{t - 1}

and

X_{t}

, and feed them into the same feature extractor with shared weights, yielding two spatial feature maps,

F_{t - 1}

and

F_{t}

, of size

7 \times 7 \times 32

. These feature maps serve as the shared input to the subsequent temporal modeling pipeline. In particular, the previous-frame feature is first geometrically aligned with the current frame to suppress rigid head-motion interference, and the aligned pair is then forwarded to the differential attention module for motion-aware feature enhancement.

3.3. Geometry-Aware Temporal Alignment

In dynamic gaze estimation, inter-frame variations contain not only gaze-related information from subtle eye movements but also global motion components introduced by head rotation. To extract gaze-related dynamic information more accurately, the proposed method first estimates the relative head pose between adjacent frames using geometric information. Then it performs an affine transformation at the feature level to achieve temporal alignment, as illustrated in Figure 2. This process effectively suppresses the influence of global displacement caused by head movement. As a result, the subsequently computed feature differences no longer primarily reflect the overall raw inter-frame motion but are instead transformed into geometrically normalized dynamic representations. This enables the model to capture eye movement patterns more accurately and provides more reliable, discriminative dynamic cues to the subsequent attention module.

Based on the given two feature maps

F_{t - 1}, F_{t} \in R^{H^{'} \times W^{'} \times C}

, we first estimate their corresponding head pose parameters through a lightweight pose regression branch. The

PoseHead

is a small multilayer perceptron (MLP) consisting of two fully connected layers and ReLU activations. It takes pooled feature vectors globally as input and outputs yaw, pitch, and roll angles. Formally,

p_{t} = PoseHead (F_{t}),

(1)

p_{t - 1} = PoseHead (F_{t - 1}) .

(2)

The relative pose variation between the two frames is then computed as the difference between their pose parameters:

Δ p = p_{t} - p_{t - 1} .

(3)

Based on the estimated pose difference, we generate an affine transformation matrix that aligns the previous frame’s feature map to the current frame. This is achieved by another lightweight MLP,

AffineHead

, which projects the 3D pose difference

Δ p

into a 6-dimensional parameter vector that defines a 2D affine transformation, including translation, rotation, scaling, and shear. The affine matrix

T \in R^{2 \times 3}

is then constructed from these parameters:

T = AffineHead (Δ p) .

(4)

Using

T

, we warp the previous feature map to align it with the current frame. The warping operation applies a spatial transformation to each channel independently using bilinear interpolation, which is differentiable and allows gradients to flow back to the pose estimation heads:

F_{t - 1}^{aligned} = Warp (F_{t - 1}, T) .

(5)

After alignment, the motion-aware differential feature is computed as the element-wise difference between the current feature map and the aligned previous feature map:

F_{diff} = F_{t} - F_{t - 1}^{aligned} .

(6)

The resulting differential representation primarily captures non-rigid changes, such as eye movements and subtle facial variations, while suppressing variations caused by global head rotation. In this way, the alignment module decouples rigid motion from local ocular dynamics, providing better-aligned inputs to the dual-stream attention module. Concretely, the aligned previous-frame feature and the current-frame feature are jointly used to construct the residual differential representation, which subsequently guides both channel-wise and spatial attention refinement in the DSDA module. The effectiveness of this design is further validated by the ablation results in Table 1; detailed discussion is provided in Section 4.3.

Moreover, we supervise the learning of head pose features using pre-generated pseudo-labels from 6DRepNet [44] to enhance its robustness. In practical settings, only the yaw and pitch angles are considered, as the roll angle contributes minimally to gaze direction estimation.

3.4. Dual-Stream Spatiotemporal Difference Attention Module

The dual-stream spatiotemporal difference attention module (DSDA) is the core component of this framework. It models gaze dynamics from a difference-driven perspective to obtain motion-sensitive cues, thereby avoiding the computational overhead of long-sequence modeling. The physiological characteristics inform this design of gaze dynamics: gaze behavior is characterized mainly by continuous, low-frequency changes, while eye movements, although frequent, are typically brief. A 33-millisecond sampling interval is sufficient to capture meaningful spatiotemporal differences [45] without the need for modeling long-range dependencies in long sequences. Based on this, we construct a two-stream structure, as shown in (Figure 3a). One stream retains the static appearance information of the current frame, while the other stream uses difference cues to emphasize motion-sensitive features. The inter-frame differential representation is computed as:

F_{{diff}^{'}} = F_{t}^{'} - F_{t - 1}^{{aligned}^{'}} .

(7)

The differential feature provides an informative signal for attention modulation. To emphasize channels sensitive to gaze-related motion, a difference-aware channel modulation mechanism is introduced, as shown in Figure 3b. Guided by

F_{{diff}^{'}}

, channel-wise weights

W_{diff}

are computed to amplify motion-sensitive features while suppressing redundant or static responses. The process is formulated as:

w = w_{o r i g} \times (1 + α \cdot w_{d i f f}) .

(8)

Here,

o r i g

denotes element-wise multiplication, and

α = 0.1

is a tunable hyperparameter that controls the strength of the multiplicative modulation imposed by the differential feature on the original channel attention weights. In this design, the differential branch is not intended to replace the static appearance stream but rather to serve as a lightweight motion-guidance signal that refines the appearance-driven channel responses. Therefore,

α

is set to

0.1

to prevent the differential feature from overwhelming the original attention distribution. Empirically, using a small modulation coefficient leads to more stable optimization and reduces the risk of over-amplifying noisy inter-frame variations. Under this mechanism, when a channel exhibits more pronounced temporal changes,

W_{d i f f}

becomes larger, thereby increasing the corresponding channel weight and highlighting more informative dynamic responses.

Following channel-wise modulation, spatial attention refinement is further guided by differential cues. Compared with generic self-attention mechanisms, gaze estimation typically relies on subtle local variations around the eyes and adjacent facial regions rather than long-range semantic interactions across the entire face. Therefore, a lightweight neuron-level importance modeling strategy is better suited to this task. The adopted energy formulation highlights neurons that deviate from the local statistical context, which is consistent with the observation that gaze-related responses are often sparse and localized. This design enables the model to enhance fine-grained motion-sensitive regions while adaptively emphasizing spatial locations undergoing meaningful changes, without incurring the substantial computational overhead of full self-attention.

Based on energy function theory in neuroscience, the module evaluates the relationships among neurons within the same channel to determine their importance. For each channel, an energy function

E_{t}

is calculated to assess the interaction between the target neuron and all other neurons

X_{i}

in that channel. For a given feature map, the energy function is defined as:

\begin{matrix} E_{t} (w_{t}, b_{t}, x_{i}) = & \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} \\ + {(1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2} . \end{matrix}

(9)

By minimizing the above energy function, an analytical solution can be obtained, greatly simplifying computation. The minimum energy

E_{t}^{*}

can be expressed as:

E_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(\hat{t} - \hat{μ})}^{2} + 2 {\hat{σ}}^{2} + 2 λ} .

(10)

Here,

\hat{t}

represents the value of the target neuron t.

\hat{μ}

and

{\hat{σ}}^{2}

are the mean and variance of all other neurons in the same channel, excluding the target neuron t. M is the total number of neurons in that channel, and

λ

is a regularization hyperparameter. According to neuroscience theory, neurons with lower energy values have greater differences from their surrounding neurons and are therefore more important. Accordingly, a neuron’s importance is inversely proportional to its energy, i.e.,

1 / E_{t}^{*}

. To simplify the computation of the energy-based attention, we adopt a channel-wise approximation in which the mean

μ

and variance

σ^{2}

of the entire channel are used to approximate

\hat{μ}

and

{\hat{σ}}^{2}

. To quantify the effect of this approximation, we further compared the approximate and exact formulations in terms of both accuracy and efficiency under the current experimental setting. The results show that the average absolute difference between the resulting attention maps is only

2.46 \times 10^{- 3}

, while the final angular error changes by only approximately

0 . 01^{\circ}

∼0.03°. In addition, under the current module-level benchmark, the approximation reduces the runtime of the SimAM attention computation by 45.8% and the overall DSA module runtime by 29.5%. Although this approximation does not change the asymptotic complexity order, it simplifies the leave-one-out statistical computation and provides clear practical benefits in computational efficiency. With this approximation, the energy function is simplified as:

e (t) = \frac{{(t - μ)}^{2}}{4 (σ^{2} + λ)} + 0.5 .

(11)

Finally, the attention weights are obtained using a Sigmoid activation

A = s i g m o i d (- e (t))

. These weights are then applied element-wise to the input feature map, allowing the network to suppress redundant information and enhance key feature representations. The attention-refined feature maps from the two streams,

F_{t - 1}^{'}

and

F_{t}^{'}

, are then globally average-pooled across spatial dimensions to produce feature vectors

V_{t - 1}

and

V_{t}

. The differential vector is computed as

V_{d i f f} = V_{t} - V_{t - 1}

. Finally, these vectors are concatenated to form the dynamic feature vector:

V_{dynamic} = concat (V_{t - 1}, V_{diff}, V_{t}) .

(12)

The dynamic feature vector preserves the static appearance information of two adjacent frames while explicitly encoding their differential motion cues. It provides the regression head with both appearance context and short-term dynamic information, thereby supporting more precise gaze estimation.

3.5. Output Layer

We use a structured multilayer perceptron (MLP) to regress the high-dimensional dynamic feature vector

V_{d y n a m i c}

from the DSDA module into pitch (

\hat{p}

) and yaw (

\hat{y}

) angles. The MLP maps the input features to a new dimension and applies normalization and a nonlinear activation after the mapping, helping the network learn features more effectively and accelerating convergence. A hyperbolic tangent (Tanh) function is used as the final activation, constraining the output values to the range [−1,1]. Finally, the output of the Tanh function is multiplied by

π / 2

to linearly map it to the gaze angle range

[- π / 2, π / 2]

in radians, producing the final pitch and yaw predictions

\hat{g} = (\hat{p}, \hat{y})

. During training, we employ the L1 loss to measure the absolute difference between the prediction and the target [46]. The standard L1 loss is defined as:

L_{1} = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{z}}_{i} - z_{i}| .

(13)

Based on this formulation, the overall training objective is constructed as a weighted sum of the gaze regression loss and the auxiliary head pose supervision loss. The overall loss is given by:

L = \frac{1}{N} \sum_{i = 1}^{N} {∥{\hat{g}}_{i} - g_{i}∥}_{1} + λ \frac{1}{N} \sum_{i = 1}^{N} {∥{\hat{h}}_{i} - h_{i}∥}_{1} .

(14)

Here,

λ

represents the weight assigned to the head pose loss function and is set to 0.1 empirically. This relatively small weight allows head-pose supervision to regularize the geometry-aware alignment module without overwhelming the primary gaze regression objective. In this way, pose estimation serves as an auxiliary geometric guidance signal rather than a competing task. The gradient of the loss function is computed via backpropagation, and the network weights are updated using the Adam optimizer [47]. The detailed optimization settings are provided in Section 4.1.

3.6. Overall Forward Procedure

To further clarify the execution pipeline of the proposed DGAGaze framework, the overall forward procedure is summarized in Algorithm 1. Specifically, given two consecutive face frames, a shared ResNet-18 backbone with a

1 \times 1

projection layer is first used to extract compact spatial feature maps. The geometry-aware temporal alignment module then estimates the relative head motion. It warps the previous-frame feature to the current-frame feature space, so that rigid motion can be reduced before temporal differencing. Based on the aligned feature pair, the dual-stream differential attention module performs motion-guided feature enhancement and constructs a compact dynamic representation. The final gaze angles are then obtained through the regression head.

Algorithm 1 Overall Forward Procedure of DGAGaze

Require:: Two consecutive face frames

X_{t - 1}, X_{t}

Ensure: Predicted gaze angles

\hat{g} = (\hat{p}, \hat{y})

1:: Shared feature extraction:
2:: $F_{t - 1} \leftarrow ϕ (X_{t - 1}), F_{t} \leftarrow ϕ (X_{t})$
3:: Geometry-aware temporal alignment:
4:: $p_{t - 1} \leftarrow PoseHead (F_{t - 1}), p_{t} \leftarrow PoseHead (F_{t})$
5:: $Δ p \leftarrow p_{t} - p_{t - 1}$
6:: $T \leftarrow AffineHead (Δ p)$
7:: $F_{t - 1}^{aligned} \leftarrow Warp (F_{t - 1}, T)$
8:: Difference-guided dual-stream attention:
9:: $F_{diff} \leftarrow F_{t} - F_{t - 1}^{aligned}$
10:: Compute difference-aware channel modulation using $F_{diff}$
11:: Refine spatial responses of $F_{t - 1}^{aligned}$ and $F_{t}$ with Simam-guided attention
12:: Spatial aggregation and dynamic feature construction:
13:: $V_{t - 1} \leftarrow Pool (F_{t - 1}^{aligned}, A_{before})$
14:: $V_{t} \leftarrow Pool (F_{t}, A_{after})$
15:: $V_{diff} \leftarrow V_{t} - V_{t - 1}$
16:: $V_{dynamic} \leftarrow concat (V_{t - 1}, V_{diff}, V_{t})$
17:: Gaze regression:
18:: $\hat{g} \leftarrow MLP (V_{dynamic})$
19:: return $\hat{g}$

4. Experiments

This section presents the experimental evaluation of the proposed DGAGaze model on public benchmark datasets. Section 4.1 first describes the implementation details, including the training settings, optimization strategy, and evaluation metric. Section 4.2 then introduces the Gaze360 and EyeDiap datasets used in this work. Section 4.3 reports the ablation studies, including component analysis, temporal-frame comparison, and validation of statistical significance. Section 4.4 compares the proposed method with existing state-of-the-art gaze estimation approaches in terms of both accuracy and efficiency.

4.1. Implementation Details

All experiments were conducted on Windows 10. The hardware platform consisted of an NVIDIA TITAN RTX GPU, an AMD Ryzen 3960X CPU, and 64 GB RAM. The proposed model was implemented using PyTorch [48] 2.4.1, together with CUDA 11.8, torchvision 0.19.1, and OpenCV 4.13.0 for training and evaluation. Data preprocessing resized all input images to

224 \times 224

pixels. The batch sizes for the Gaze360 and EyeDiap datasets were 256 and 128, respectively. End-to-end training was performed for 60 epochs using the Adam optimizer. The initial learning rate was set to

5.0 \times 10^{- 4}

. The weight decay was set to

1.0 \times 10^{- 4}

. The momentum parameters

β_{1}

and

β_{2}

were fixed at 0.9 and 0.999, respectively. The first 5 epochs were used for stabilization via learning rate warm-up. Afterward, a stepwise decay was used, decreasing the learning rate to 0.1 times its original value every 20 epochs. For standard evaluation in 3D angular space, the predicted pitch

\hat{p}

and yaw

\hat{y}

are converted into a unit-length 3D gaze

\hat{g} = (g_{x}, g_{y}, g_{z})

, which is then directly compared with the ground-truth gaze vector in the same 3D coordinate system. The conversion from spherical to Cartesian coordinates is as follows:

g_{x} = cos (\hat{p}) sin (\hat{y})

(15)

g_{y} = sin (\hat{p})

(16)

g_{z} = cos (\hat{p}) cos (\hat{y})

(17)

We use Mean Angle Error (MAE) to evaluate the model’s performance. This metric measures the average angle between the predicted gaze vector and the true gaze vector, as shown in the following formula:

MAE = \frac{1}{N} \sum_{i = 1}^{N} (\frac{180}{π} arccos (\frac{g_{i} \cdot g_{g t, i}}{∥ g_{i} ∥ ∥ g_{g t, i} ∥}))

(18)

Here, N stands for the total count of test samples, and

g_{i}

represents the predicted and ground-truth 3D gaze vectors for the i-th sample. The formula computes the dot product of the two vectors and normalizes it by their magnitudes to obtain the cosine of the angle between them. The angle is then calculated using the arccosine function and converted to degrees.

4.2. Datasets

To evaluate the performance and generalization ability of DGAGaze under different conditions, we conducted experiments on two gaze estimation benchmark datasets, Gaze360 [15] and EyeDiap [49]. The Gaze360 dataset, collected in unrestricted real-world scenes, contains over 172,000 visual instances from 238 subjects, making it one of the largest and most authoritative datasets in gaze estimation research. Its acquisition scenes cover enclosed indoor buildings and open outdoor environments, and the data includes a wide range of head pose variations and varying incident lighting. EyeDiap is a multimodal gaze estimation dataset collected entirely in a tightly controlled laboratory environment using a Microsoft Kinect v2 sensor developed by Microsoft Corporation (Redmond, WA, USA). It contains a 237-min video sequence of 16 participants performing a specific task, providing synchronized RGB, depth, and infrared images. These datasets are widely recognized for their complexity and diversity in gaze estimation tasks. As shown in Figure 4, the gaze direction distribution maps reveal that the samples in the EyeDiap and Gaze360 datasets are mainly concentrated within angular intervals of approximately

\pm 25^{\circ}

and

\pm 50^{\circ}

, exhibiting a clear non-uniform distribution. These two distributional characteristics together support the evaluation of the model’s robustness and generalization ability under different gaze-variation conditions. The dataset was processed following the method in [50].

4.3. Ablation Study

To analyze the individual contribution of each component in the proposed DGAGaze, we perform a series of ablation studies on the Gaze360 and EyeDiap datasets. All variants are trained with the same settings as in the primary experiments, and results are reported as mean angular error (MAE) in degrees. All results are summarized in Table 1.

(1) DGAGaze (Ours): The complete model incorporating two-frame temporal input, differential feature modeling, geometry-aware temporal alignment, difference-enhanced channel attention, and Simam spatial refinement.

(2) w/o Alignment: Removing the geometry-aware temporal alignment module and reverting to direct feature differencing without pose compensation. This causes MAE increases of 0.13° on EyeDiap and 0.13° on Gaze360, confirming that compensating for rigid head motion is essential for obtaining clean motion cues.

(3) w/o SE-diff: Substituting the difference-modulated channel attention with a standard SE block. MAE rises by 0.15° on EyeDiap and 0.12° on Gaze360, indicating that motion-guided channel weighting significantly enhances gaze-relevant features.

(4) w/o Simam: Disabling the SimAM-based spatial refinement after channel modulation. This leads to performance drops of 0.22° and 0.16°, respectively, highlighting the importance of neuron-level spatial refinement.

(5) w/o Differential Feature: To evaluate the contribution of explicit temporal differential coding, we replace the differential-based data stream with feature concatenation of two consecutive frames along the channel dimension, i.e.,

Concat (F_{t}, F_{t - 1})

. Here, the previous frame

F_{t - 1}

is used in its original, unaligned form because the geometry-aware temporal alignment is primarily designed to correct pose differences for the subtraction operation; its effect on naive concatenation is negligible. This baseline discards the differential signal while retaining all static information from both frames. On the EyeDiap and Gaze360 datasets, the MAE increases by 0.27° and 0.25°, respectively.

We also performed qualitative comparisons on differential features. Figure 5 and Figure 6 illustrate the visualizations of the Gaze360 and EyeDiap datasets, respectively. On the Gaze360 dataset (Figure 5b) and the EyeDiap dataset (Figure 6b), the model equipped with differential features shows significantly closer attention to the eye region, indicating that the geometry-aware temporal alignment and differential-guided attention mechanism effectively highlight gaze-related regions. However, as shown in Figure 5c and Figure 6c, models that rely on feature stacking tend to produce less accurate gaze estimates and show weaker attention to gaze-critical regions. These observations suggest that explicitly encoding inter-frame motion cues provides more informative dynamic representations than stacking features from consecutive frames, thereby improving the robustness and accuracy of gaze estimation.

Table 1. Ablation Study on Model Components. Source: contribution by the authors.

Method	EyeDiap (MAE)	Gaze360 (MAE)
w/o Alignment	5.16	10.45
w/o SE-diff	5.18	10.44
w/o Simam	5.25	10.48
w/o Differential Feature	5.30	10.57
Ours (DGAGaze)	5.03	10.32

To evaluate whether two-frame input is sufficient for modeling gaze dynamics, we extended DGAGaze to three-frame and five-frame versions while keeping the backbone architecture unchanged. For inputs exceeding two frames, inter-frame differences were computed sequentially and integrated using the same difference modeling strategy as the proposed module. The results are summarized in Table 2. Using three frames resulted in only a slight change in performance compared to the two-frame setting, with negligible changes in mean absolute error (MAE) across datasets. When further increased to five frames, performance on both benchmarks decreased, with MAE increasing to 5.15° and 10.41°, respectively. Conversely, increasing the number of input frames led to substantial increases in model parameters and floating-point operations (FLOPs), thereby increasing model complexity. These findings suggest that two consecutive frames are sufficient to capture the key temporal cues required for gaze estimation. Adding more frames does not consistently improve accuracy but may introduce redundant temporal information and significantly increase computational costs. Therefore, the dual-frame design achieves an effective balance between performance and efficiency.

Overall, the ablation results demonstrate that each component of the proposed DGAGaze framework plays a meaningful role in improving performance. In particular, integrating geometry-normalized differential modeling, motion-guided channel modulation, and lightweight spatial refinement consistently enhances gaze estimation accuracy. At the same time, the model maintains low computational complexity by relying on minimal temporal input.

To assess whether the performance gains of DGAGaze are statistically reliable rather than caused by incidental variation, we further perform paired statistical significance analysis against two internal baselines, namely w/o Alignment and w/o Differential Feature, on the same test samples. The per-sample angular error is used as the analysis unit, and the Wilcoxon signed-rank test is adopted because the underlying error distribution may not strictly satisfy the normality assumption.

As shown in Table 3, DGAGaze consistently achieves lower mean angular error than both internal baselines on EyeDiap and Gaze360. Compared with w/o Alignment, DGAGaze reduces the MAE from 5.16° to 5.03° on EyeDiap and from 10.45° to 10.32° on Gaze360, with corresponding p-values of 0.021 and 0.009, respectively. Compared with w/o Differential Feature, DGAGaze reduces the MAE from 5.30° to 5.03° on EyeDiap and from 10.57° to 10.32° on Gaze360, with corresponding p-values of 0.002 and <0.001, respectively. Since all comparisons yield p-values below 0.05, the observed improvements are statistically significant. These results indicate that the gains introduced by the geometry-aware temporal alignment and explicit differential modeling are stable and reproducible rather than incidental.

4.4. Experimental Results

We evaluated DGAGaze against state-of-the-art gaze estimation methods on two public benchmarks: Gaze360 (outdoor) and EyeDiap (controlled). As shown in Table 4, DGAGaze achieves mean angular errors of 5.03° on EyeDiap and 10.32° on Gaze360, demonstrating competitive performance against both recent Transformer-based methods and representative CNN/RNN-based approaches. Compared with classical CNN-based designs such as CA-Net [33] and RT-Gene [12], the proposed method consistently achieves lower error on both benchmarks. It also remains competitive with more recent Transformer-based models, including GazeTR [20], CADSE [34], SUGE [51], and GazesymCAT [23].

Notably, DGAGaze performs comparably to the recent Transformer-based SUGE [51] on the controlled EyeDiap benchmark but achieved a 1.9% improvement in accuracy on the more challenging Gaze360 dataset. This result indirectly suggests that the proposed strategy of explicit head-motion compensation and short-term differential cue modeling exhibits good adaptability in unconstrained scenarios.

In addition to accuracy improvements, DGAGaze maintains a compact design with only 11.38 M parameters and 7.31 GFLOPs (under 224 × 224 input). As shown in Table 5, this represents a reduction of over 7× in the number of parameters compared to AGE-Net (109.00 M) [56] and CADSE (74.80 M) [34], and an order of magnitude lower FLOPs (e.g., 7.31 vs. 35.7 for AGE-Net). Even when compared to efficient CNN baselines such as L2CS-Net (23.52 M, 16.53 GFLOPs) [57] and CA-Net (34.00 M, 15.6 GFLOPs) [33], DGAGaze offers a favorable balance, achieving higher accuracy with comparable or lower computational cost.

These results validate the rationality of adopting ResNet-18 as the backbone in DGAGaze. Since the proposed method is designed for lightweight dynamic gaze estimation, using a deeper backbone, such as ResNet-50, would make it difficult to maintain its design objectives of low computational complexity and a compact model size. The experimental results show that DGAGaze based on ResNet-18 achieves strong performance on both EyeDiap and Gaze360 while maintaining a relatively small model size and low computational cost. This indicates that the proposed geometry-aware temporal alignment module and differential attention mechanism can operate effectively with a lightweight backbone, thereby achieving a favorable balance between accuracy and efficiency.

To qualitatively evaluate the generalization capability of DGAGaze, several prediction examples from the two datasets are presented in Figure 7. The results show that the model achieves good gaze estimation performance under different head poses, lighting conditions, and background environments.

Nevertheless, the overall experimental results still indicate that the proposed method has certain limitations under complex conditions. In particular, gaze estimation remains challenging under extreme head poses and partial occlusions around the eye region. This is mainly because, in such scenarios, the reliability of visible appearance cues may decrease, thereby affecting the effectiveness of geometry-aware alignment and differential modeling. Therefore, although DGAGaze achieves a good balance between robustness and lightweight design, there is still room for further improvement in more complex real-world scenarios.

5. Conclusions

We introduced DGAGaze, a lightweight gaze estimation framework that captures temporal dynamics through difference-guided attention using only two consecutive frames. The geometry-aware alignment module explicitly compensates for rigid head motion, ensuring that inter-frame differences primarily reflect non-rigid eye movements. Building on this, the dual-stream spatiotemporal attention module dynamically enhances channel and spatial features modulated by differential cues, enabling the model to efficiently capture critical dynamic cues caused by subtle facial movements. Extensive experiments demonstrate the superiority of our approach. On the challenging in-the-wild Gaze360 dataset and the controlled EyeDiap dataset, DGAGaze achieves highly competitive results, reducing the mean angular error to 10.32° and 5.03°, respectively. Furthermore, our model relies only on a lightweight ResNet-18 backbone and two-frame input, achieving substantially higher computational efficiency than complex Transformer architectures or models that require long sequence inputs.

Although the proposed method has achieved promising results, there is still room for further improvement in more complex real-world scenarios. For example, the robustness of the current framework could be further enhanced under conditions such as severe occlusion, extreme head poses, and rapid motion changes. In addition, existing methods primarily focus on visual appearance cues, whereas integrating facial landmarks, depth cues, or multimodal contextual signals may further improve estimation accuracy and generalization performance. These directions will be the focus of our future work.

Author Contributions

Conceptualization, W.Z. and P.L.; methodology, W.Z.; software, W.Z.; resources, W.Z. and P.L.; data curation, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z. and P.L.; visualization, W.Z.; supervision, P.L.; project administration, P.L.; funding acquisition, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of Hunan Province, project number 2025JJ50418.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These data are subject to restrictions on availability. The datasets used in this study were obtained from third-party sources, including Gaze360 and EYEDIAP. Gaze360 was originally released by its authors for non-commercial research use and can be accessed via the dataset authors’ repository at https://github.com/erkil1452/gaze360 (accessed on 15 November 2024). EYEDIAP is available from its data providers at https://www.idiap.ch/en/scientific-research/data/eyediap (accessed on 10 December 2024), subject to permission and the corresponding access conditions and license terms. The implementation code for this study is publicly available at https://github.com/bobo5349016-stack/DGAGaze (accessed on 15 November 2024).

Acknowledgments

We sincerely thank everyone who provided us with valuable support, encouragement, and guidance throughout the research process.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rahal, R.M.; Fiedler, S. Understanding cognitive and affective mechanisms in social psychology through eye-tracking. J. Exp. Soc. Psychol. 2019, 85, 103842. [Google Scholar] [CrossRef]
Göktaş, O.; Ergin, E.; Çetin, G.; Özkoç, H.H.; Firat, A.; Gazel, G.G. Investigation of user-product interaction by determining the focal points of visual interest in different types of kitchen furniture: An eye-tracking study. Displays 2024, 83, 102745. [Google Scholar] [CrossRef]
Jiang, M.; Zhao, Q. Learning visual attention to identify people with autism spectrum disorder. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22-29 October 2017; IEEE: New York, NY, USA, 2017; pp. 3267–3276. [Google Scholar] [CrossRef]
Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372. [Google Scholar] [CrossRef] [PubMed]
McAnally, K.; Grove, P.; Wallis, G. Vergence eye movements in virtual reality. Displays 2024, 83, 102683. [Google Scholar] [CrossRef]
Valenti, R.; Sebe, N.; Gevers, T. Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 2011, 21, 802–815. [Google Scholar] [CrossRef]
Huang, M.X.; Li, J.; Ngai, G.; Leong, H.V. Screenglint: Practical, in-situ gaze estimation on smartphones. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6-11 May 2017; ACM Digital Library: New York, NY, USA, 2017; pp. 2546–2557. [Google Scholar] [CrossRef]
Lian, D.; Hu, L.; Luo, W.; Xu, Y.; Duan, L.; Yu, J.; Gao, S. Multiview multitask gaze estimation with deep convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3010–3023. [Google Scholar] [CrossRef]
Park, S.; Spurr, A.; Hilliges, O. Deep pictorial gaze estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8-14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 741–757. [Google Scholar] [CrossRef]
Park, S.; Zhang, X.; Bulling, A.; Hilliges, O. Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14-17 June 2018; ACM Digital Library: New York, NY, USA, 2018; pp. 1–10. [Google Scholar] [CrossRef]
Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef]
Fischer, T.; Chang, H.J.; Demiris, Y. RT-GENE: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8-14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 339–357. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, J.; Lu, C.; Huang, H.; Yang, F.; Li, L.; Guo, Y. Learning to detect head movement in unconstrained remote gaze estimation in the wild. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1-5 March 2020; IEEE: New York, NY, USA, 2020; pp. 3432–3441. [Google Scholar] [CrossRef]
Yu, Z.; Huang, X.; Zhang, X.; Shen, H.; Li, Q.; Deng, W.; Tang, J.; Yang, Y.; Ye, J. A multi-modal approach for driver gaze prediction to remove identity bias. In Proceedings of the 2020 International Conference on Multimodal Interaction, Virtual Event, Netherlands, 25-29 October 2020; ACM Digital Library: New York, NY, USA, 2020; pp. 768–776. [Google Scholar] [CrossRef]
Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6911–6920. [Google Scholar] [CrossRef]
Zhang, X.; Park, S.; Beeler, T.; Bradley, D.; Tang, S.; Hilliges, O. ETH-XGaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual Event, 23-28 August 2020; ACM Digital Library: New York, NY, USA, 2020; pp. 365–381. [Google Scholar] [CrossRef]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21-26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2299–2308. [Google Scholar] [CrossRef]
Chen, Z.; Shi, B.E. Towards high performance low complexity calibration in appearance based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1174–1188. [Google Scholar] [CrossRef]
Zhu, Z.; Zhang, D.; Chi, C.; Li, M.; Lee, D.J. A complementary dual-branch network for appearance-based gaze estimation from low-resolution facial image. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 1323–1334. [Google Scholar] [CrossRef]
Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21-25 August 2022; IEEE: New York, NY, USA, 2022; pp. 3341–3347. [Google Scholar] [CrossRef]
Karmi, R.; Mastouri, R.; Rahmany, I.; Khlifa, N. An Appearance-based VisionTransformer Network for Enhanced Gaze Estimation. Signal Image Video Process. 2025, 19, 742. [Google Scholar] [CrossRef]
Wu, L.; Shi, B.E. Merging multiple datasets for improved appearance-based gaze estimation. In Pattern Recognition. ICPR 2024; Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, C.L., Bhattacharya, S., Pal, U., Eds.; Springer: Cham, Switzerland, 2025; pp. 77–90. [Google Scholar] [CrossRef]
Zhong, Y.; Lee, S.H. GazeSymCAT: A symmetric cross-attention transformer for robust gaze estimation under extreme head poses and gaze variations. J. Comput. Des. Eng. 2025, 12, 115–129. [Google Scholar] [CrossRef]
Palmero, C.; Selva, J.; Bagheri, M.A.; Escalera, S. Recurrent CNN for 3D gaze estimation using appearance and shape cues. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3-6 September 2018; BMVA Press: Surrey, UK, 2018; p. 251. [Google Scholar]
Jindal, S.; Yadav, M.; Manduchi, R. Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17-18 June 2024; IEEE: New York, NY, USA, 2024; pp. 604–614. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21-26 July 2017; IEEE: New York, NY, USA, 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6201–6210. [Google Scholar] [CrossRef]
Li, J.; Liu, X.; Zhang, M.; Wang, D. Spatio-temporal deformable 3D ConvNets with attention for action recognition. Pattern Recognit. 2020, 98, 107037. [Google Scholar] [CrossRef]
Wang, X.; Gao, L.; Wang, P.; Sun, X.; Liu, X. Two-stream 3-D ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 2017, 20, 634–644. [Google Scholar] [CrossRef]
Yang, Y.; Lu, F. Gaze Target Detection Based on Head-Local-Global Coordination. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 305–322. [Google Scholar] [CrossRef]
Wang, Y.; Xia, G. EfficientNet-Gaze: Integrating Multi-Scale Feature Extraction with Frequency Domain Analysis for Efficient Gaze Estimation. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Nagpure, V.; Okuma, K. Searching efficient neural architecture with multi-resolution fusion transformer for appearance-based gaze estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; IEEE: New York, NY, USA, 2023; pp. 890–899. [Google Scholar] [CrossRef]
Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10623–10630. [Google Scholar] [CrossRef]
Oh, J.; Chang, H.J.; Choi, S.I. Self-attention with convolution and deconvolution for efficient eye gaze estimation from a full face image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: New York, NY, USA, 2022; pp. 4988–4996. [Google Scholar] [CrossRef]
Han, Y.; Ying, H.; Zhu, H.; Gao, F.; Zhou, W. Synergistic alignment-based domain adaptation for gaze estimation. In Biometric Recognition; Springer Nature Singapore: Singapore, 2025; pp. 254–263. [Google Scholar] [CrossRef]
Cheng, Z.; Wang, Y. Multi-task Gaze Estimation Via Unidirectional Convolution. arXiv 2024, arXiv:2411.18061. [Google Scholar] [CrossRef]
Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Strazdas, D.; Al-Hamadi, A. MobGazeNet: Robust gaze estimation mobile network based on progressive attention mechanisms. Mach. Vis. Appl. 2025, 36, 76. [Google Scholar] [CrossRef]
Chen, H.; Liu, H.; Lan, S.; Wang, W.; Qiao, Y.; Li, Y.; Deng, G. DMAGaze: Gaze estimation based on feature disentanglement and multi-scale attention. arXiv 2025, arXiv:2504.11160. [Google Scholar] [CrossRef]
Zhao, R.; Wang, Y.; Luo, S.; Shou, S.; Tang, P. Gaze-Swin: Enhancing gaze estimation with a hybrid CNN-transformer network and dropkey mechanism. Electronics 2024, 13, 328. [Google Scholar] [CrossRef]
Chen, Z.; Shi, B.E. Offset calibration for appearance-based gaze estimation via gaze decomposition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1-5 March 2020; IEEE: New York, NY, USA, 2020; pp. 259–268. [Google Scholar] [CrossRef]
Vuillecard, P.; Odobez, J.M. Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; IEEE: New York, NY, USA, 2025; pp. 13508–13518. [Google Scholar] [CrossRef]
Melnyk, K.; Friedman, L.; Katrychuk, D.; Komogortsev, O. Gaze prediction as a function of eye movement type and individual differences. In Proceedings of the 2025 Symposium on Eye Tracking Research and Applications, Tokyo, Japan, 26–29 May 2025; ACM Digital Library: New York, NY, USA, 2025; pp. 7:1–7:11. [Google Scholar] [CrossRef]
Guan, Y.; Chen, Z.; Zeng, W.; Cao, Z.; Xiao, Y. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context. IEEE Signal Process. Lett. 2023, 30, 1687–1691. [Google Scholar] [CrossRef]
Hempel, T.; Abdelrahman, A.A.; Al-Hamadi, A. 6D Rotation Representation for Unconstrained Head Pose Estimation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16-19 October 2022; IEEE: New York, NY, USA, 2022; pp. 2496–2500. [Google Scholar] [CrossRef]
Duchowski, A.T. Eye Tracking Methodology: Theory and Practice, 3rd ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 20–22. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Mora, K.A.F.; Monay, F.; Odobez, J.M. EYEDIAP: A database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications, Safety Harbor, FL, USA, 26-28 March 2014; ACM Digital Library: New York, NY, USA, 2014; pp. 255–258. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7509–7528. [Google Scholar] [CrossRef]
Wang, S.; Huang, Y. Suppressing uncertainty in gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20-27 February 2024; AAAI Press: Washington, DC, USA, 2024; pp. 5581–5589. [Google Scholar] [CrossRef]
Farkhondeh, A.; Palmero, C.; Scardapane, S.; Escalera, S. Towards self-supervised gaze estimation. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 21–24 November 2022; p. 549. [Google Scholar]
Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2-6 December 2018; ACM Digital Library: New York, NY, USA, 2018; pp. 309–324. [Google Scholar] [CrossRef]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 4511–4520. [Google Scholar] [CrossRef]
Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. MPIIGaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 162–175. [Google Scholar] [CrossRef]
Biswas, P. Appearance-based gaze estimation using attention and difference mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19-25 June 2021; IEEE: New York, NY, USA, 2021; pp. 3137–3146. [Google Scholar] [CrossRef]
Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A.; Dinges, L. L2CS-Net: Fine-grained gaze estimation in unconstrained environments. In Proceedings of the 2023 8th International Conference on Frontiers of Signal Processing (ICFSP), Corfu, Greece, 23-25 October 2023; IEEE: New York, NY, USA, 2023; pp. 98–102. [Google Scholar] [CrossRef]

Figure 1. Overview of our proposed gaze estimation architecture. Arrows indicate the direction of feature propagation and supervision flow during training and inference. Source: contribution by the authors.

Figure 2. Framework of the geometry-aware temporal alignment module. Source: contribution by the authors.

Figure 3. Framework of Dual-Stream Spatiotemporal difference attention module. Source: contribution by the authors.

Figure 4. Distribution of gaze directions in the benchmark datasets. Different colors indicate different concentrations of gaze labels, where brighter colors denote regions with a higher density of samples. The figure shows the distribution characteristics of gaze angles in Gaze360 and EyeDiap, highlighting the non-uniformity and dataset-specific variability of gaze directions. Source: statistical analysis based on the datasets, with visualization generated by the authors.

Figure 5. Visualization of the predicted gaze of the DGAGaze model on the Gaze360 dataset, using differential features. Warmer colors (e.g., red and yellow) denote regions with stronger attention activation, whereas cooler colors (e.g., blue and green) denote regions with weaker attention activation. (a) Input image from Gaze360, (b) with differential features, (c) without differential features. Source: example images are from the Gaze360 dataset, and the visualization is generated by the authors.

Figure 6. Visualization of the predicted gaze of the DGAGaze model on the EyeDiap dataset, using differential features. Warmer colors (e.g., red and yellow) denote regions with stronger attention activation, whereas cooler colors (e.g., blue and green) denote regions with weaker attention activation. (a) Input image from EyeDiap, (b) with differential features, (c) without differential features. Source: example images are from the EyeDiap dataset, and the authors generate the visualization.

Figure 7. Qualitative results on dataset images. The first row shows the input images, while the second and third rows show the ground truth (GT) and the predicted gaze, respectively. Green arrows denote the ground-truth gaze directions, and red arrows denote the predicted gaze directions. Source: input images and GT annotations are from the benchmark dataset; predicted gaze is generated by the proposed model, and the visualization is produced by the authors.

Table 2. Ablation on Number of Temporal Frames (lightweight variants). Source: contribution by the authors.

Method	EyeDiap (MAE)	Gaze360 (MAE)	Params (M)	FLOPs (G)
Ours (2 frames)	5.03	10.32	11.38	7.31
3-frame variant	5.02	10.33	15.13	10.52
5-frame variant	5.15	10.41	21.05	14.03

Table 3. Wilcoxon signed-rank test results comparing DGAGaze with two internal ablation baselines. Source: contribution by the authors.

Comparison	Dataset	Baseline MAE	DGAGaze MAE	p- Value
w/o Alignment	EyeDiap	5.16	5.03	0.021
w/o Alignment	Gaze360	10.45	10.32	0.009
w/o Differential Feature	EyeDiap	5.30	5.03	0.002
w/o Differential Feature	Gaze360	10.57	10.32	<0.001

Table 4. A comparison of state-of-the-art gaze estimation methods on the EyeDiap and Gaze360 datasets. Above the horizontal line are gaze estimation models based on Transformer, and below the horizontal line are gaze estimation models based on CNN or RNN. Source: the results of the proposed method are obtained by the authors, while the compared results are collected from the relevant literature, including originally reported or subsequently reproduced results.

Methods	EyeDiap	Gaze360
GazeTR-Pure [20]	5.72	13.58
GazeTR-Hybrid [20]	5.33	11.00
CADSE [34]	5.25	10.70
swAT [52]	–	11.60
SUGE [51]	5.04	10.51
GazesymCAT [23]	5.13	–
FullFace [17]	6.53	14.99
Dilateted-Net [53]	6.19	13.73
RT-Gene [12]	6.02	12.26
Mnist [54]	7.37	–
GazeNet [55]	6.79	–
RCNN [24]	5.31	11.23
Gaze360 [15]	5.36	11.04
CA-Net [33]	5.27	11.20
MobGazenet [37]	–	10.48
Ours	5.03	10.32

Table 5. Comparison of model parameters and computational complexity. Source: the result of the proposed method is produced by the authors, while the compared results are taken from the corresponding references.

Methods	Backbone	Params (M)	GFLOPs
AGE-Net [56]	Tr	109.00	35.7
CADSE [34]	Tr	74.80	19.75
Gaze360 [15]	RNN	14.60	12.70
RT-Gene [12]	CNN	30.00	30.8
CA-Net [33]	CNN	34.00	15.6
L2CS-Net [57]	CNN	23.52	16.53
Ours	CNN	11.38	7.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Li, P. DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment. Appl. Sci. 2026, 16, 3298. https://doi.org/10.3390/app16073298

AMA Style

Zhang W, Li P. DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment. Applied Sciences. 2026; 16(7):3298. https://doi.org/10.3390/app16073298

Chicago/Turabian Style

Zhang, Wei, and Pengcheng Li. 2026. "DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment" Applied Sciences 16, no. 7: 3298. https://doi.org/10.3390/app16073298

APA Style

Zhang, W., & Li, P. (2026). DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment. Applied Sciences, 16(7), 3298. https://doi.org/10.3390/app16073298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment

Abstract

1. Introduction

2. Related Work

2.1. Image-Based Gaze Estimation Methods

2.2. Video-Based Temporal Gaze Estimation

3. Methods

3.1. Architecture Overview

3.2. Feature Extraction Module

3.3. Geometry-Aware Temporal Alignment

3.4. Dual-Stream Spatiotemporal Difference Attention Module

3.5. Output Layer

3.6. Overall Forward Procedure

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.3. Ablation Study

4.4. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI