HFGAD: Hierarchical Fine-Grained Attention Decoder for Gaze Estimation
Abstract
1. Introduction
- We propose HFGAD, which introduces a multi-scale channel-spatial attention mechanism and a selective feature fusion module to address the challenges of fine-grained feature extraction and effective feature aggregation, respectively.
- HFGAD is innovatively designed with a lightweight and plug-and-play decoder, achieving an excellent balance between performance and computational efficiency.
- Extensive experimental results demonstrate the competitive advantage of HFGAD, achieving competitive performance on MPIIFaceGaze, Gaze360, and IVGaze datasets.
2. Related Work
2.1. Geometry-Based Gaze Estimation
2.2. Appearance-Based Gaze Estimation
3. Methodology
3.1. Overview
3.2. Multi-Scale Channel-Spatial Attention
3.3. Selective Feature Fusion Module (SFM)
3.4. Efficient Convolutional Downsample (ECD)
3.5. Estimation Head and Loss Function
4. Experiments
4.1. Evaluation Metrics
4.1.1. Gaze360 [9]
4.1.2. MPIIFaceGaze [7]
4.1.3. IVGaze [16]
4.2. Implementation Details
4.3. Comparison with State-of-the-Art
4.4. Visualization Analysis
4.5. Ablation Study
4.5.1. Ablation Study of Each Module
4.5.2. Generalization of HFGAD Across Different Backbones
4.6. Performance Distribution
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
DMS | Driver Monitoring Systems |
CNN | Convolutional Neural Network |
HFGAD | Hierarchical Fine-Grained Attention Decoder |
MSCSA | Multi-Scale Channel-Spatial Attention |
SFM | Selective Fusion Module |
ECD | Efficient Convolutional Downsample |
EH | Estimation Head |
GC | Group Convolution |
GN | Group Normalization |
References
- Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7509–7528. [Google Scholar] [CrossRef] [PubMed]
- Steil, J.; Huang, M.X.; Bulling, A. Fixation detection for head-mounted eye tracking based on visual similarity of gaze targets. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14–17 June 2018; pp. 1–9. [Google Scholar]
- Palazzi, A.; Abati, D.; Solera, F.; Cucchiara, R. Predicting the driver’s focus of attention: The dr (eye) ve project. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1720–1733. [Google Scholar] [CrossRef] [PubMed]
- Martin, S.; Vora, S.; Yuen, K.; Trivedi, M.M. Dynamics of Driver’s Gaze: Explorations in Behavior Modeling and Maneuver Prediction. IEEE Trans. Intell. Veh. 2018, 3, 141–150. [Google Scholar] [CrossRef]
- Tan, D.; Tian, W.; Wang, C.; Chen, L.; Xiong, L. Driver Distraction Behavior Recognition for Autonomous Driving: Approaches, Datasets and Challenges. IEEE Trans. Intell. Veh. 2024, 9, 8000–8026. [Google Scholar] [CrossRef]
- Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 51–60. [Google Scholar]
- Fischer, T.; Chang, H.J.; Demiris, Y. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–352. [Google Scholar]
- Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6912–6921. [Google Scholar]
- Cheng, Y.; Lu, F.; Zhang, X. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 100–115. [Google Scholar]
- Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef] [PubMed]
- Biswas, P. Appearance-based gaze estimation using attention and difference mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3143–3152. [Google Scholar]
- Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montréal, QC, Canada, 21–25 August 2022; IEEE: New York, NY, USA, 2022; pp. 3341–3347. [Google Scholar]
- Li, T.; Zhang, Y.; Li, Q. Appearance-Based Driver 3D Gaze Estimation Using GRM and Mixed Loss Strategies. IEEE Internet Things J. 2024, 11, 38410–38424. [Google Scholar] [CrossRef]
- Huang, Q.; Veeraraghavan, A.; Sabharwal, A. Tabletgaze: Dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets. Mach. Vis. Appl. 2017, 28, 445–461. [Google Scholar] [CrossRef]
- Cheng, Y.; Zhu, Y.; Wang, Z.; Hao, H.; Liu, Y.; Cheng, S.; Wang, X.; Chang, H.J. What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 1556–1565. [Google Scholar]
- Morimoto, C.H.; Mimica, M.R. Eye gaze tracking techniques for interactive applications. Comput. Vis. Image Underst. 2005, 98, 4–24. [Google Scholar] [CrossRef]
- Stampe, D.M. Heuristic filtering and reliable calibration methods for video-based pupil-tracking systems. Behav. Res. Methods Instrum. Comput. 1993, 25, 137–142. [Google Scholar] [CrossRef]
- Ji, Q.; Yang, X. Real-time eye, gaze, and face pose tracking for monitoring driver vigilance. Real-Time Imaging 2002, 8, 357–377. [Google Scholar] [CrossRef]
- Guestrin, E.D.; Eizenman, M. General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Trans. Biomed. Eng. 2006, 53, 1124–1133. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Z.; Ji, Q. Novel eye gaze tracking techniques under natural head movement. IEEE Trans. Biomed. Eng. 2007, 54, 2246–2260. [Google Scholar] [CrossRef] [PubMed]
- Valenti, R.; Sebe, N.; Gevers, T. Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 2011, 21, 802–815. [Google Scholar] [CrossRef] [PubMed]
- Alberto Funes Mora, K.; Odobez, J.M. Geometric generative gaze estimation (g3e) for remote rgb-d cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1773–1780. [Google Scholar]
- Park, S.; Spurr, A.; Hilliges, O. Deep pictorial gaze estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 721–738. [Google Scholar]
- Cai, X.; Zeng, J.; Shan, S.; Chen, X. Source-free adaptive gaze estimation by uncertainty reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22035–22045. [Google Scholar]
- Cheng, Y.; Bao, Y.; Lu, F. Puregaze: Purifying gaze feature for generalizable gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 436–443. [Google Scholar]
- Cheng, Y.; Lu, F. Dvgaze: Dual-view gaze estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 20632–20641. [Google Scholar]
- Li, F.; Yan, H.; Shi, L. Multi-scale coupled attention for visual object detection. Sci. Rep. 2024, 14, 11191. [Google Scholar] [CrossRef] [PubMed]
- Shang, C.; Wang, Z.; Wang, H.; Meng, X. SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 13051–13060. [Google Scholar]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Method | Params | Flops | Gaze360 | MPII | IVGaze |
---|---|---|---|---|---|
FullFace | 196.6 M | 2.99 G | 14.99° | 4.93° | 13.67° |
RT-Gene | 82.0 M | 30.81 G | 12.26° | 4.66° | - |
Gaze360 | 14.6 M | 12.78 G | 11.04° | 4.06° | 8.15° |
CA-Net | 34.1 M | 15.6 G | 11.20° | 4.27° | - |
GazeTR | 11.4 M | 1.82 G | 10.62° | 4.00° | 7.33° |
GazePTR | 12.1 M | 3.75 G | 10.59° | 3.98° | 7.04° |
DGE-GM | 87.7 M | 15.16 G | 10.62° | 3.76° | - |
Ours | 12.0M | 1.95 G | 10.46° | 3.88° | 7.02° |
MPII | Gaze360 | IVGaze | ||||
---|---|---|---|---|---|---|
Model | 224 × 224 | 448 × 448 | 224 × 224 | 448 × 448 | 224 × 224 | 448 × 448 |
ResNet-18 | 4.20° ± 1.25° 1 [3.82, 4.58] 2 | 3.95° ± 1.18° [3.61, 4.29] (5.9%↓) 3 | 10.91° ± 2.35° [10.12, 11.70] | 10.42° ± 2.20° [9.68, 11.16] (4.4%↓) | 8.56° ± 1.85° [7.92, 9.20] | 8.02° ± 1.72° [7.42, 8.62] (6.3%↓) |
Gaze360 | 4.06° ± 1.20° [3.70, 4.42] | 3.84° ± 1.12° [3.52, 4.16] (5.4%↓) | 11.32° ± 2.40° [10.52, 12.12] | 10.88° ± 2.25° [10.13, 11.63] (3.8%↓) | 8.15° ± 1.78° [7.52, 8.78] | 7.74° ± 1.65° [7.16, 8.32] (5.0%↓) |
GazeTR | 4.00° ± 1.15° [3.66, 4.34] | 3.82° ± 1.08° [3.52, 4.12] (4.5%↓) | 10.63° ± 2.20° [9.88, 11.38] | 10.18° ± 2.08° [9.47, 10.89] (4.2%↓) | 7.33° ± 1.65° [6.78, 7.88] | 6.97° ± 1.55° [6.45, 7.49] (4.9%↓) |
Ours | 3.88° ± 1.08° [3.58, 4.18] | 3.52° ± 0.98° [3.25, 3.79] (9.2%↓) | 10.46° ± 2.05° [9.76, 11.16] | 9.81° ± 1.88° [9.16, 10.46] (6.2%↓) | 7.02° ± 1.52° [6.49, 7.55] | 6.48° ± 1.40° [6.00, 6.96] (7.6%↓) |
Model | Params(M) | Flops(G) | Gaze360 | MPII | IVGaze |
---|---|---|---|---|---|
Resnet18 | 11.203 | 1.827 | 10.91° | 4.20° | 8.56° |
Gaze360 | 14.642 | 12.782 | 11.32° | 4.06° | 8.15° |
GazeTR | 11.448 | 1.824 | 10.63° | 4.00° | 7.33° |
Resnet18 + SFM | 11.409 | 1.945 | 10.74° | 4.07° | 7.67° |
Resnet18 + MSCSA | 11.617 | 1.932 | 10.65° | 3.97° | 7.41° |
Ours (Resnet18 + SFM + MSCSA) | 12.041 | 1.953 | 10.46° | 3.88° | 7.02° |
Model | Params (M) | Error (°) |
---|---|---|
ResNet-18 [35] | 11.2 | 10.91° |
PVTv2-B0 [36] | 3.4 | 11.32° |
PVTv2-B2 [36] | 24.9 | 10.63° |
ResNet-18 + HFGAD | 12.0 | 10.46° |
PVTv2-B0 + HFGAD | 3.7 | 11.00° |
PVTv2-B2 + HFGAD | 25.8 | 10.24° |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, S.; Wang, T.; Liu, W.; Piao, Y.; Su, J.; Cai, G.; Xu, H. HFGAD: Hierarchical Fine-Grained Attention Decoder for Gaze Estimation. Algorithms 2025, 18, 538. https://doi.org/10.3390/a18090538
Huang S, Wang T, Liu W, Piao Y, Su J, Cai G, Xu H. HFGAD: Hierarchical Fine-Grained Attention Decoder for Gaze Estimation. Algorithms. 2025; 18(9):538. https://doi.org/10.3390/a18090538
Chicago/Turabian StyleHuang, Shaojie, Tianzhong Wang, Weiquan Liu, Yingchao Piao, Jinhe Su, Guorong Cai, and Huilin Xu. 2025. "HFGAD: Hierarchical Fine-Grained Attention Decoder for Gaze Estimation" Algorithms 18, no. 9: 538. https://doi.org/10.3390/a18090538
APA StyleHuang, S., Wang, T., Liu, W., Piao, Y., Su, J., Cai, G., & Xu, H. (2025). HFGAD: Hierarchical Fine-Grained Attention Decoder for Gaze Estimation. Algorithms, 18(9), 538. https://doi.org/10.3390/a18090538