Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism
Abstract
:1. Introduction
2. Related Work
2.1. Appearance-Based Gaze Estimation Using CNN
2.2. Appearance-Based Gaze Estimation Using Transformer
3. Gaze-Swin
3.1. Overall Architecture
3.2. Transformer-Based Global Feature Extractor
3.3. CNN-based Local Feature Extractor
3.4. Unified Prediction Module
3.5. DA-Attention for Gaze-Swin
4. Experiment
4.1. Datasets
4.2. Setup
4.2.1. Training
4.2.2. Evaluation Metric
4.3. Comparison with State of the Art
4.4. Hyper-Parameters in Gaze-Swin
4.4.1. Transformer Layers
4.4.2. Input Channel C
4.4.3. Heads N
4.4.4. CNN Layers L2
4.5. Dropkey Parameter Analysis and Comparison with Dropout
4.6. Ablation Study
4.6.1. Gaze-Swin without CNN Branch
4.6.2. Gaze-Swin without Transformer Branch
4.7. Comparison with Visual Transformer
4.8. Impact of Pre-Training
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372. [Google Scholar] [CrossRef] [PubMed]
- Jacob, R.J.; Karn, K.S. Eye tracking in human-computer interaction and usability research: Ready to deliver the promises. In The Mind’s Eye; Elsevier: Amsterdam, The Netherlands, 2003; pp. 573–605. [Google Scholar]
- Mutlu, B.; Shiwa, T.; Kanda, T.; Ishiguro, H.; Hagita, N. Footing in human-robot conversations: How robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction, La Jolla, CA, USA, 9–13 March 2009; pp. 61–68. [Google Scholar]
- Morimoto, C.H.; Mimica, M.R. Eye gaze tracking techniques for interactive applications. Comput. Vis. Image Underst. 2005, 98, 4–24. [Google Scholar] [CrossRef]
- Patney, A.; Kim, J.; Salvi, M.; Kaplanyan, A.; Wyman, C.; Benty, N.; Lefohn, A.; Luebke, D. Perceptually-based foveated virtual reality. In Proceedings of the SIGGRAPH ’16: ACM SIGGRAPH 2016 Emerging Technologies, Anaheim, CA, USA, 24–28 July 2016; pp. 1–2. [Google Scholar]
- Demiris, Y. Prediction of intent in robotics and multi-agent systems. Cogn. Process. 2007, 8, 151–158. [Google Scholar] [CrossRef] [PubMed]
- Park, H.S.; Jain, E.; Sheikh, Y. Predicting primary gaze behavior using social saliency fields. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 3503–3510. [Google Scholar]
- Yoo, D.H.; Chung, M.J. A novel non-intrusive eye gaze estimation using cross-ratio under large head motion. Comput. Vis. Image Underst. 2005, 98, 25–51. [Google Scholar] [CrossRef]
- Zhu, Z.; Ji, Q. Eye gaze tracking under natural head movements. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 918–923. [Google Scholar]
- Zhu, Z.; Ji, Q.; Bennett, K.P. Nonlinear eye gaze mapping function estimation via support vector regression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 1, pp. 1132–1135. [Google Scholar]
- Hennessey, C.; Noureddin, B.; Lawrence, P. A single camera eye-gaze tracking system with free head motion. In Proceedings of the 2006 Symposium on Eye Tracking Research & Applications, San Diego, CA, USA, 27–29 March 2006; pp. 87–94. [Google Scholar]
- Ishikawa, T.; Baker, S.; Matthews, I.; Kanade, T. Passive Driver Gaze Tracking with Active Appearance Models. In Proceedings of the 11th World Congress on Intelligent Transportation Systems, Nagoya, Japan, 18–22 October 2004. [Google Scholar]
- Chen, J.; Ji, Q. 3D gaze estimation with a single camera without IR illumination. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–4. [Google Scholar]
- Valenti, R.; Sebe, N.; Gevers, T. Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 2011, 21, 802–815. [Google Scholar] [CrossRef] [PubMed]
- Hansen, D.W.; Pece, A.E. Eye tracking in the wild. Comput. Vis. Image Underst. 2005, 98, 155–181. [Google Scholar] [CrossRef]
- Huang, M.X.; Li, J.; Ngai, G.; Leong, H.V. Screenglint: Practical, in-situ gaze estimation on smartphones. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 2546–2557. [Google Scholar]
- Ansari, M.F.; Kasprowski, P.; Obetkal, M. Gaze tracking using an unmodified web camera and convolutional neural network. Appl. Sci. 2021, 11, 9068. [Google Scholar] [CrossRef]
- Li, Y.; Huang, L.; Chen, J.; Wang, X.; Tan, B. Appearance-Based Gaze Estimation Method Using Static Transformer Temporal Differential Network. Mathematics 2023, 11, 686. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
- Li, B.; Hu, Y.; Nie, X.; Han, C.; Jiang, X.; Guo, T.; Liu, L. DropKey for Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22700–22709. [Google Scholar]
- Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3341–3347. [Google Scholar]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4511–4520. [Google Scholar]
- Fischer, T.; Chang, H.J.; Demiris, Y. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–352. [Google Scholar]
- Cheng, Y.; Lu, F.; Zhang, X. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 100–115. [Google Scholar]
- Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; Torralba, A. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2176–2184. [Google Scholar]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 51–60. [Google Scholar]
- Cheng, Y.; Lu, F. DVGaze: Dual-View Gaze Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 20632–20641. [Google Scholar]
- Nagpure, V.; Okuma, K. Searching Efficient Neural Architecture with Multi-resolution Fusion Transformer for Appearance-based Gaze Estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 890–899. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zhang, X.; Park, S.; Beeler, T.; Bradley, D.; Tang, S.; Hilliges, O. Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Cham, Switzerland, 2020; pp. 365–381. [Google Scholar]
- Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6912–6921. [Google Scholar]
- Funes Mora, K.A.; Monay, F.; Odobez, J.M. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications, Safety Harbor, FL, USA, 26–28 March 2014; pp. 255–258. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision, Perth, WA, Australia, 2–6 December 2018; Springer: Cham, Switzerland, 2018; pp. 309–324. [Google Scholar]
- Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
- Palmero, C.; Selva, J.; Bagheri, M.; Escalera, S. Recurrent cnn for 3d gaze estimation using appearance and shape cues. arXiv 2018, arXiv:1805.03064. [Google Scholar]
- Oh, J.O.; Chang, H.J.; Choi, S.I. Self-attention with convolution and deconvolution for efficient eye gaze estimation from a full face image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4992–5000. [Google Scholar]
Method | Gaze360 | RT-Gene | EyeDiap |
---|---|---|---|
FullFace [29] | 14.99° | 10.00° | 6.53° |
RT-Gene [26] | 12.26° | 8.60° | 6.02° |
Dilated-Net [37] | 13.73° | 8.38° | 6.19° |
CA-Net [38] | 11.20° | 8.27° | 5.27° |
RCNN [39] | 11.23° | 10.30° | 5.31° |
Gaze360 [34] | 11.04° | 7.06° | 5.36° |
CADSE [40] | 10.70° | 7.00° | 5.25° |
GazeTR-Pure [24] | 13.58° | 8.06° | 5.72° |
GazeTR-Hybrid [24] | 10.62° | 6.55° | 5.17° |
GazeNAS-ETH [31] | 10.52° | 6.40° | 5.00° |
Gaze-Swin | 10.14° | 6.38° | 4.90° |
Hyper-Parameter | Gaze360 | RT-Gene | EyeDiap | |
---|---|---|---|---|
Transformer layers | 3 | 10.22° | 6.69° | 5.16° |
6 | 10.14° | 6.38° | 4.90° | |
12 | 10.04° | 6.83° | 5.27° | |
Input channel C | 48 | 10.17° | 6.87° | 5.46° |
96 | 10.14° | 6.38° | 4.90° | |
192 | 10.29° | 6.63° | 5.14° | |
Heads N | 6 | 10.14° | 6.38° | 4.90° |
12 | 10.16° | 6.67° | 5.12° | |
24 | 10.05° | 7.01° | 5.30° | |
CNN layers | 18 | 10.14° | 6.38° | 4.90° |
34 | 10.04° | 6.73° | 4.98° | |
50 | 10.01° | 6.97° | 5.06° |
Method | Hyper-Parameter | Gaze360 | RT-Gene | EyeDiap | |
---|---|---|---|---|---|
d | α | ||||
Dropkey | 0.1 | 0 | 10.19° | 6.52° | 5.01° |
0.005 | 10.14° | 6.38° | 4.90° | ||
0.01 | 10.12° | 6.50° | 5.06° | ||
0.2 | 0 | 10.18° | 6.49° | 4.99° | |
0.01 | 10.18° | 6.42° | 4.89° | ||
0.02 | 10.15° | 6.51° | 4.97° | ||
0.3 | 0 | 10.36° | 6.57° | 5.28° | |
0.01 | 10.31° | 6.48° | 5.23° | ||
0.02 | 10.27° | 6.51° | 5.08° | ||
Dropout | 0 | 0 | 10.26° | 6.49° | 5.13° |
0.1 | 0 | 10.20° | 6.51° | 5.06° | |
0.2 | 0 | 10.22° | 6.43° | 5.11° | |
0.3 | 0 | 10.41° | 6.54° | 5.44° |
Method | Gaze360 | RT-Gene | EyeDiap |
---|---|---|---|
Gaze-Swin | 10.14° | 6.38° | 4.90° |
Without CNN branch | 10.40° | 6.61° | 5.19° |
Without Transformer branch | 10.77° | 6.64° | 5.23° |
Method | Params | FLOPs | Gaze360 | RT-Gene | EyeDiap |
---|---|---|---|---|---|
Gaze-Swin | 32.28 M | 5.17 G | 10.14° | 6.38° | 4.90° |
Visual Transformer | 227.19 M | 44.84 G | 13.58° | 8.06° | 5.72° |
Gaze-ViT | 238.37 M | 46.66 G | 10.36° | 6.60° | 5.06° |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, R.; Wang, Y.; Luo, S.; Shou, S.; Tang, P. Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism. Electronics 2024, 13, 328. https://doi.org/10.3390/electronics13020328
Zhao R, Wang Y, Luo S, Shou S, Tang P. Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism. Electronics. 2024; 13(2):328. https://doi.org/10.3390/electronics13020328
Chicago/Turabian StyleZhao, Ruijie, Yuhuan Wang, Sihui Luo, Suyao Shou, and Pinyan Tang. 2024. "Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism" Electronics 13, no. 2: 328. https://doi.org/10.3390/electronics13020328
APA StyleZhao, R., Wang, Y., Luo, S., Shou, S., & Tang, P. (2024). Gaze-Swin: Enhancing Gaze Estimation with a Hybrid CNN-Transformer Network and Dropkey Mechanism. Electronics, 13(2), 328. https://doi.org/10.3390/electronics13020328