DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment
Abstract
1. Introduction
- We propose a geometry-aware temporal alignment module that explicitly compensates for rigid head motion via pose estimation and affine warping. This pre-processing step ensures that subsequent inter-frame differences predominantly capture non-rigid eye movements, effectively decoupling head pose variations from eye motion analysis.
- We introduce a novel dual-stream spatiotemporal differential attention module (DSDA) that integrates differential-enhanced channel attention with 3D spatial-channel attention within a phased hybrid attention flow. This design enables the model to capture both fine-grained local details and spatiotemporal context from consecutive frames, thereby improving representation learning for gaze-related motion.
- We present DGAGaze as a lightweight dynamic gaze estimation framework that achieves competitive performance on both EyeDiap and Gaze360 while maintaining low parameter complexity and computational cost, demonstrating a favorable balance between temporal modeling capability and deployment efficiency.
2. Related Work
2.1. Image-Based Gaze Estimation Methods
2.2. Video-Based Temporal Gaze Estimation
3. Methods
3.1. Architecture Overview
3.2. Feature Extraction Module
3.3. Geometry-Aware Temporal Alignment
3.4. Dual-Stream Spatiotemporal Difference Attention Module
3.5. Output Layer
3.6. Overall Forward Procedure
| Algorithm 1 Overall Forward Procedure of DGAGaze |
| Require:: Two consecutive face frames Ensure: Predicted gaze angles
|
4. Experiments
4.1. Implementation Details
4.2. Datasets
4.3. Ablation Study
| Method | EyeDiap (MAE) | Gaze360 (MAE) |
|---|---|---|
| w/o Alignment | 5.16 | 10.45 |
| w/o SE-diff | 5.18 | 10.44 |
| w/o Simam | 5.25 | 10.48 |
| w/o Differential Feature | 5.30 | 10.57 |
| Ours (DGAGaze) | 5.03 | 10.32 |
4.4. Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rahal, R.M.; Fiedler, S. Understanding cognitive and affective mechanisms in social psychology through eye-tracking. J. Exp. Soc. Psychol. 2019, 85, 103842. [Google Scholar] [CrossRef]
- Göktaş, O.; Ergin, E.; Çetin, G.; Özkoç, H.H.; Firat, A.; Gazel, G.G. Investigation of user-product interaction by determining the focal points of visual interest in different types of kitchen furniture: An eye-tracking study. Displays 2024, 83, 102745. [Google Scholar] [CrossRef]
- Jiang, M.; Zhao, Q. Learning visual attention to identify people with autism spectrum disorder. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22-29 October 2017; IEEE: New York, NY, USA, 2017; pp. 3267–3276. [Google Scholar] [CrossRef]
- Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372. [Google Scholar] [CrossRef] [PubMed]
- McAnally, K.; Grove, P.; Wallis, G. Vergence eye movements in virtual reality. Displays 2024, 83, 102683. [Google Scholar] [CrossRef]
- Valenti, R.; Sebe, N.; Gevers, T. Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 2011, 21, 802–815. [Google Scholar] [CrossRef]
- Huang, M.X.; Li, J.; Ngai, G.; Leong, H.V. Screenglint: Practical, in-situ gaze estimation on smartphones. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6-11 May 2017; ACM Digital Library: New York, NY, USA, 2017; pp. 2546–2557. [Google Scholar] [CrossRef]
- Lian, D.; Hu, L.; Luo, W.; Xu, Y.; Duan, L.; Yu, J.; Gao, S. Multiview multitask gaze estimation with deep convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3010–3023. [Google Scholar] [CrossRef]
- Park, S.; Spurr, A.; Hilliges, O. Deep pictorial gaze estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8-14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 741–757. [Google Scholar] [CrossRef]
- Park, S.; Zhang, X.; Bulling, A.; Hilliges, O. Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14-17 June 2018; ACM Digital Library: New York, NY, USA, 2018; pp. 1–10. [Google Scholar] [CrossRef]
- Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef]
- Fischer, T.; Chang, H.J.; Demiris, Y. RT-GENE: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8-14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 339–357. [Google Scholar] [CrossRef]
- Wang, Z.; Zhao, J.; Lu, C.; Huang, H.; Yang, F.; Li, L.; Guo, Y. Learning to detect head movement in unconstrained remote gaze estimation in the wild. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1-5 March 2020; IEEE: New York, NY, USA, 2020; pp. 3432–3441. [Google Scholar] [CrossRef]
- Yu, Z.; Huang, X.; Zhang, X.; Shen, H.; Li, Q.; Deng, W.; Tang, J.; Yang, Y.; Ye, J. A multi-modal approach for driver gaze prediction to remove identity bias. In Proceedings of the 2020 International Conference on Multimodal Interaction, Virtual Event, Netherlands, 25-29 October 2020; ACM Digital Library: New York, NY, USA, 2020; pp. 768–776. [Google Scholar] [CrossRef]
- Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6911–6920. [Google Scholar] [CrossRef]
- Zhang, X.; Park, S.; Beeler, T.; Bradley, D.; Tang, S.; Hilliges, O. ETH-XGaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual Event, 23-28 August 2020; ACM Digital Library: New York, NY, USA, 2020; pp. 365–381. [Google Scholar] [CrossRef]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21-26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2299–2308. [Google Scholar] [CrossRef]
- Chen, Z.; Shi, B.E. Towards high performance low complexity calibration in appearance based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1174–1188. [Google Scholar] [CrossRef]
- Zhu, Z.; Zhang, D.; Chi, C.; Li, M.; Lee, D.J. A complementary dual-branch network for appearance-based gaze estimation from low-resolution facial image. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 1323–1334. [Google Scholar] [CrossRef]
- Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21-25 August 2022; IEEE: New York, NY, USA, 2022; pp. 3341–3347. [Google Scholar] [CrossRef]
- Karmi, R.; Mastouri, R.; Rahmany, I.; Khlifa, N. An Appearance-based VisionTransformer Network for Enhanced Gaze Estimation. Signal Image Video Process. 2025, 19, 742. [Google Scholar] [CrossRef]
- Wu, L.; Shi, B.E. Merging multiple datasets for improved appearance-based gaze estimation. In Pattern Recognition. ICPR 2024; Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, C.L., Bhattacharya, S., Pal, U., Eds.; Springer: Cham, Switzerland, 2025; pp. 77–90. [Google Scholar] [CrossRef]
- Zhong, Y.; Lee, S.H. GazeSymCAT: A symmetric cross-attention transformer for robust gaze estimation under extreme head poses and gaze variations. J. Comput. Des. Eng. 2025, 12, 115–129. [Google Scholar] [CrossRef]
- Palmero, C.; Selva, J.; Bagheri, M.A.; Escalera, S. Recurrent CNN for 3D gaze estimation using appearance and shape cues. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3-6 September 2018; BMVA Press: Surrey, UK, 2018; p. 251. [Google Scholar]
- Jindal, S.; Yadav, M.; Manduchi, R. Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17-18 June 2024; IEEE: New York, NY, USA, 2024; pp. 604–614. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21-26 July 2017; IEEE: New York, NY, USA, 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6201–6210. [Google Scholar] [CrossRef]
- Li, J.; Liu, X.; Zhang, M.; Wang, D. Spatio-temporal deformable 3D ConvNets with attention for action recognition. Pattern Recognit. 2020, 98, 107037. [Google Scholar] [CrossRef]
- Wang, X.; Gao, L.; Wang, P.; Sun, X.; Liu, X. Two-stream 3-D ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 2017, 20, 634–644. [Google Scholar] [CrossRef]
- Yang, Y.; Lu, F. Gaze Target Detection Based on Head-Local-Global Coordination. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 305–322. [Google Scholar] [CrossRef]
- Wang, Y.; Xia, G. EfficientNet-Gaze: Integrating Multi-Scale Feature Extraction with Frequency Domain Analysis for Efficient Gaze Estimation. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Nagpure, V.; Okuma, K. Searching efficient neural architecture with multi-resolution fusion transformer for appearance-based gaze estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; IEEE: New York, NY, USA, 2023; pp. 890–899. [Google Scholar] [CrossRef]
- Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10623–10630. [Google Scholar] [CrossRef]
- Oh, J.; Chang, H.J.; Choi, S.I. Self-attention with convolution and deconvolution for efficient eye gaze estimation from a full face image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: New York, NY, USA, 2022; pp. 4988–4996. [Google Scholar] [CrossRef]
- Han, Y.; Ying, H.; Zhu, H.; Gao, F.; Zhou, W. Synergistic alignment-based domain adaptation for gaze estimation. In Biometric Recognition; Springer Nature Singapore: Singapore, 2025; pp. 254–263. [Google Scholar] [CrossRef]
- Cheng, Z.; Wang, Y. Multi-task Gaze Estimation Via Unidirectional Convolution. arXiv 2024, arXiv:2411.18061. [Google Scholar] [CrossRef]
- Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Strazdas, D.; Al-Hamadi, A. MobGazeNet: Robust gaze estimation mobile network based on progressive attention mechanisms. Mach. Vis. Appl. 2025, 36, 76. [Google Scholar] [CrossRef]
- Chen, H.; Liu, H.; Lan, S.; Wang, W.; Qiao, Y.; Li, Y.; Deng, G. DMAGaze: Gaze estimation based on feature disentanglement and multi-scale attention. arXiv 2025, arXiv:2504.11160. [Google Scholar] [CrossRef]
- Zhao, R.; Wang, Y.; Luo, S.; Shou, S.; Tang, P. Gaze-Swin: Enhancing gaze estimation with a hybrid CNN-transformer network and dropkey mechanism. Electronics 2024, 13, 328. [Google Scholar] [CrossRef]
- Chen, Z.; Shi, B.E. Offset calibration for appearance-based gaze estimation via gaze decomposition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1-5 March 2020; IEEE: New York, NY, USA, 2020; pp. 259–268. [Google Scholar] [CrossRef]
- Vuillecard, P.; Odobez, J.M. Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; IEEE: New York, NY, USA, 2025; pp. 13508–13518. [Google Scholar] [CrossRef]
- Melnyk, K.; Friedman, L.; Katrychuk, D.; Komogortsev, O. Gaze prediction as a function of eye movement type and individual differences. In Proceedings of the 2025 Symposium on Eye Tracking Research and Applications, Tokyo, Japan, 26–29 May 2025; ACM Digital Library: New York, NY, USA, 2025; pp. 7:1–7:11. [Google Scholar] [CrossRef]
- Guan, Y.; Chen, Z.; Zeng, W.; Cao, Z.; Xiao, Y. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context. IEEE Signal Process. Lett. 2023, 30, 1687–1691. [Google Scholar] [CrossRef]
- Hempel, T.; Abdelrahman, A.A.; Al-Hamadi, A. 6D Rotation Representation for Unconstrained Head Pose Estimation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16-19 October 2022; IEEE: New York, NY, USA, 2022; pp. 2496–2500. [Google Scholar] [CrossRef]
- Duchowski, A.T. Eye Tracking Methodology: Theory and Practice, 3rd ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 20–22. [Google Scholar] [CrossRef]
- Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
- Mora, K.A.F.; Monay, F.; Odobez, J.M. EYEDIAP: A database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications, Safety Harbor, FL, USA, 26-28 March 2014; ACM Digital Library: New York, NY, USA, 2014; pp. 255–258. [Google Scholar] [CrossRef]
- Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7509–7528. [Google Scholar] [CrossRef]
- Wang, S.; Huang, Y. Suppressing uncertainty in gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20-27 February 2024; AAAI Press: Washington, DC, USA, 2024; pp. 5581–5589. [Google Scholar] [CrossRef]
- Farkhondeh, A.; Palmero, C.; Scardapane, S.; Escalera, S. Towards self-supervised gaze estimation. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 21–24 November 2022; p. 549. [Google Scholar]
- Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2-6 December 2018; ACM Digital Library: New York, NY, USA, 2018; pp. 309–324. [Google Scholar] [CrossRef]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 4511–4520. [Google Scholar] [CrossRef]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. MPIIGaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 162–175. [Google Scholar] [CrossRef]
- Biswas, P. Appearance-based gaze estimation using attention and difference mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19-25 June 2021; IEEE: New York, NY, USA, 2021; pp. 3137–3146. [Google Scholar] [CrossRef]
- Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A.; Dinges, L. L2CS-Net: Fine-grained gaze estimation in unconstrained environments. In Proceedings of the 2023 8th International Conference on Frontiers of Signal Processing (ICFSP), Corfu, Greece, 23-25 October 2023; IEEE: New York, NY, USA, 2023; pp. 98–102. [Google Scholar] [CrossRef]







| Method | EyeDiap (MAE) | Gaze360 (MAE) | Params (M) | FLOPs (G) |
|---|---|---|---|---|
| Ours (2 frames) | 5.03 | 10.32 | 11.38 | 7.31 |
| 3-frame variant | 5.02 | 10.33 | 15.13 | 10.52 |
| 5-frame variant | 5.15 | 10.41 | 21.05 | 14.03 |
| Comparison | Dataset | Baseline MAE | DGAGaze MAE | p- Value |
|---|---|---|---|---|
| w/o Alignment | EyeDiap | 5.16 | 5.03 | 0.021 |
| Gaze360 | 10.45 | 10.32 | 0.009 | |
| w/o Differential Feature | EyeDiap | 5.30 | 5.03 | 0.002 |
| Gaze360 | 10.57 | 10.32 | <0.001 |
| Methods | EyeDiap | Gaze360 |
|---|---|---|
| GazeTR-Pure [20] | 5.72 | 13.58 |
| GazeTR-Hybrid [20] | 5.33 | 11.00 |
| CADSE [34] | 5.25 | 10.70 |
| swAT [52] | – | 11.60 |
| SUGE [51] | 5.04 | 10.51 |
| GazesymCAT [23] | 5.13 | – |
| FullFace [17] | 6.53 | 14.99 |
| Dilateted-Net [53] | 6.19 | 13.73 |
| RT-Gene [12] | 6.02 | 12.26 |
| Mnist [54] | 7.37 | – |
| GazeNet [55] | 6.79 | – |
| RCNN [24] | 5.31 | 11.23 |
| Gaze360 [15] | 5.36 | 11.04 |
| CA-Net [33] | 5.27 | 11.20 |
| MobGazenet [37] | – | 10.48 |
| Ours | 5.03 | 10.32 |
| Methods | Backbone | Params (M) | GFLOPs |
|---|---|---|---|
| AGE-Net [56] | Tr | 109.00 | 35.7 |
| CADSE [34] | Tr | 74.80 | 19.75 |
| Gaze360 [15] | RNN | 14.60 | 12.70 |
| RT-Gene [12] | CNN | 30.00 | 30.8 |
| CA-Net [33] | CNN | 34.00 | 15.6 |
| L2CS-Net [57] | CNN | 23.52 | 16.53 |
| Ours | CNN | 11.38 | 7.31 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, W.; Li, P. DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment. Appl. Sci. 2026, 16, 3298. https://doi.org/10.3390/app16073298
Zhang W, Li P. DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment. Applied Sciences. 2026; 16(7):3298. https://doi.org/10.3390/app16073298
Chicago/Turabian StyleZhang, Wei, and Pengcheng Li. 2026. "DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment" Applied Sciences 16, no. 7: 3298. https://doi.org/10.3390/app16073298
APA StyleZhang, W., & Li, P. (2026). DGAGaze: Gaze Estimation with Dual-Stream Differential Attention and Geometry-Aware Temporal Alignment. Applied Sciences, 16(7), 3298. https://doi.org/10.3390/app16073298

