A Gaze Estimation Method Based on Spatial and Channel Reconstructed ResNet Combined with Multi-Clue Fusion
Abstract
:1. Introduction
- A gaze estimation feature extraction backbone network, ResNetSC, combining ResNet and SCConv, is proposed. By replacing the traditional 3 × 3 convolution with SCConv, the network not only enhances the model’s ability to extract important features but also significantly reduces spatial and channel redundancy, thereby decreasing the number of model parameters.
- The ResNetSC backbone network is combined with the joint localization of the head, eyes, and face to jointly optimize the video gaze estimation model. A new gaze estimation method for online learning is proposed, which improves performance and accuracy.
2. Related Work
3. Methodology
3.1. Spatial-Channel Reconstruction Convolution
3.2. Spatiotemporal Query Interaction
3.3. Clue Localization Heads and Gaze Fusion Heads
3.4. Loss Function
4. Experimental Results and Analysis
4.1. Datasets
4.2. Experimental Setup
4.3. Experiments on the Detectable Face Subset of Gaze360
4.4. Experiments on the Entire Gaze360 Dataset
4.5. Experimental Results Analysis
- (1)
- The proposed model, RSP-MCGaze, was tested on the detectable face subset of Gaze360 and achieved the lowest angular errors in all three metrics: 360°, 180°, and front face. This demonstrates the superiority of the RSP-MCGaze model.
- (2)
- After comparing the RSP-MCGaze model with the MCGaze model, RSP-MCGaze achieved lower angular errors in the front face, 360°, and 180° key metrics. This demonstrates that the ResNet backbone, optimized by SCConv for feature extraction, can focus more on the important features of the head, face, and eyes through spatial and channel reconstruction. As a result, it achieves lower errors and superior performance in the gaze estimation task.
- (3)
- Through experiments on the Gaze360 dataset using the RSP-MCGaze model, it was found that the model’s prediction performance is insufficient for parts of the data in the Gaze360 dataset where faces cannot be detected. The reason is that the RSP-MCGaze model is a gaze estimation model that focuses on the interrelationships between the head, face, and eyes. It requires clear images or videos of the face, head, and eyes to ensure the model’s gaze estimation performance.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef] [PubMed]
- Nonaka, S.; Nobuhara, S.; Nishino, K. Dynamic 3d gaze from afar: Deep gaze estimation from temporal eye-head-body coordination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2192–2201. [Google Scholar]
- Bao, Y.; Cheng, Y.; Liu, Y.; Lu, F. Adaptive feature fusion network for gaze tracking in mobile tablets. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9936–9943. [Google Scholar]
- Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10623–10630. [Google Scholar]
- Bao, J.; Liu, B.; Yu, J. An individual-difference-aware model for cross-person gaze estimation. IEEE Trans. Image Process. 2022, 31, 3322–3333. [Google Scholar] [PubMed]
- Guan, Y.; Chen, Z.; Zeng, W.; Cao, Z.; Xiao, Y. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context. IEEE Signal Process. Lett. 2023, 30, 1687–1691. [Google Scholar]
- Huang, G.; Shi, J.; Xu, J.; Li, J.; Chen, S.; Du, Y.; Zhen, X.; Liu, H. Gaze estimation by attention-induced hierarchical variational auto-encoder. IEEE Trans. Cybern. 2023, 54, 2592–2605. [Google Scholar]
- Xu, M.; Wang, H.; Lu, F. Learning a generalized gaze estimator from gaze-consistent feature. Proc. AAAI Conf. Artif. Intell. 2023, 37, 3027–3035. [Google Scholar] [CrossRef]
- Hisadome, Y.; Wu, T.; Qin, J.; Sugano, Y. Rotation-Constrained Cross-View FeatureFusion for Multi-View Appearance-based Gaze Estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5985–5994. [Google Scholar]
- Yin, P.; Zeng, G.; Wang, J.; Xie, D. CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6729–6737. [Google Scholar]
- Oh, J.O.; Chang, H.J.; Choi, S.I. Self-attention with convolution and deconvolution for efficient eye gaze estimation from a full face image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4992–5000. [Google Scholar]
- Biswas, P. Appearance-based gaze estimation using attention and difference mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Li, Y.; Huang, L.; Chen, J.; Tan, B. Appearance-based gaze estimation method using static transformer temporal differential network. Mathematics 2023, 11, 686. [Google Scholar] [CrossRef]
- Wu, X.; Li, L.; Zhu, H.; Zhou, G.; Li, L.; Su, F.; He, S.; Wang, Y.; Long, X. EG-Net: Appearance-based eye gaze estimation using an efficient gaze network with attention mechanism. Expert Syst. Appl. 2023, 238, 122363. [Google Scholar]
- Zhang, X.; Park, S.; Beeler, T.; Bradley, D.; Tang, S.; Hilliges, O. Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 365–381. [Google Scholar]
- Zhang, M.; Liu, Y.; Lu, F. Gazeonce: Realtime multi-person gaze estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4197–4206. [Google Scholar]
- Balim, H.; Park, S.; Wang, X.; Zhang, X.; Hilliges, O. Efe: End-to-end frame-to-gaze estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2688–2697. [Google Scholar]
- Chen, J.; He, T.; Zhuo, W.; Ma, L.; Ha, S.; Chan, S.-H.G. Tvconv: Efficient translation variant convolution for layout-aware visual processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12548–12558. [Google Scholar]
- Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Liu, Z. Mobileformer: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279. [Google Scholar]
- Sun, X.; Hassani, A.; Wang, Z.; Huang, G.; Shi, H. Disparse: Disentangled sparsification for multitask model compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12382–12392. [Google Scholar]
- Xia, M.; Zhong, Z.; Chen, D. Structured pruning learn compact and accurate models. arXiv 2022, arXiv:2204.00408,. [Google Scholar]
- Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
- Yang, S.; Wang, X.; Li, Y.; Fang, Y.; Fang, J.; Liu, W.; Zhao, X.; Shan, Y. Temporally efficient vision transformer for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2885–2895. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 3341–3347. [Google Scholar]
- Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A. L2csnet: Fine-grained gaze estimation in unconstrained environments. arXiv 2022, arXiv:2203.03339. [Google Scholar]
- Yan, C.; Pan, W.; Xu, C.; Dai, S.; Li, X. Gaze estimation via strip pooling and multi-criss-cross attention networks. Appl. Sci. 2023, 13, 5901. [Google Scholar] [CrossRef]
- Nagpure, V.; Okuma, K. Searching efficient neural architecture with multi-resolution fusion transformer for appearance-based gaze estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; pp. 890–899. [Google Scholar]
- Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6912–6921. [Google Scholar]
Method | 360° | 180° | Front Face |
---|---|---|---|
Gaze360 [30] | 13.50 | 11.40 | 11.10 |
MCGaze [6] | 12.96 | 10.74 | 10.02 |
RSP-MCGaze (Ours) | 13.03 | 10.88 | 9.86 |
Method | Detectable Faces | Front 180° | Front Face |
---|---|---|---|
Gaze360 [30] | 11.04 | N/A | N/A |
CA-Net [11] | 11.20 | N/A | N/A |
GazeTR [26] | 10.62 | N/A | N/A |
L2CS-Net [27] | 10.60 | 10.41 | 9.04 |
SPMCCA-Net [28] | N/A | 10.13 | 8.40 |
CADSE [12] | 10.70 | N/A | N/A |
GazeNAS-ETH [29] | 10.52 | N/A | N/A |
MCGaze [6] | 10.02 | 9.81 | 7.57 |
RSP-MCGaze (Ours) | 9.66 | 9.08 | 7.11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shou, Z.; Lin, Y.; Mo, J.; Wu, Z. A Gaze Estimation Method Based on Spatial and Channel Reconstructed ResNet Combined with Multi-Clue Fusion. J. Imaging 2025, 11, 99. https://doi.org/10.3390/jimaging11040099
Shou Z, Lin Y, Mo J, Wu Z. A Gaze Estimation Method Based on Spatial and Channel Reconstructed ResNet Combined with Multi-Clue Fusion. Journal of Imaging. 2025; 11(4):99. https://doi.org/10.3390/jimaging11040099
Chicago/Turabian StyleShou, Zhaoyu, Yanjun Lin, Jianwen Mo, and Ziyong Wu. 2025. "A Gaze Estimation Method Based on Spatial and Channel Reconstructed ResNet Combined with Multi-Clue Fusion" Journal of Imaging 11, no. 4: 99. https://doi.org/10.3390/jimaging11040099
APA StyleShou, Z., Lin, Y., Mo, J., & Wu, Z. (2025). A Gaze Estimation Method Based on Spatial and Channel Reconstructed ResNet Combined with Multi-Clue Fusion. Journal of Imaging, 11(4), 99. https://doi.org/10.3390/jimaging11040099