Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism
Abstract
:1. Introduction
- (1)
- We introduced Swin-T into gaze estimation tasks and proposed a gaze estimation network model based on Swin-T.
- (2)
- We proposed a novel network model that combined a CNN and Swin-T and achieved improved performance on two public datasets.
- (3)
- We analyzed the improvement in network performance caused by using a CNN as a pre-feature extraction module.
2. Related Works
3. Gaze Estimation Based on Swin-T
3.1. SwinT-GE Applied to Gaze Estimation
3.2. Res-Swin-GE Applied to Gaze Estimation
4. Experiments
4.1. Experimental Details
4.2. Datasets
4.3. Evaluation Metric
4.4. Window Size Parameter Analysis of Res-Swin-GE
4.5. Experimental Results Analysis
4.5.1. Angular Error of SwinT-GE and Res-Swin-GE on Different Subjects
4.5.2. Angular Error of SwinT-GE and Res-Swin-GE at Different Gaze Angles
4.5.3. Comparison of Slicing-and-Mapping Mechanism and ResNet Block Feature Extraction Effect
4.5.4. Comparison to the State-of-the-Art
4.6. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- He, H.; She, Y.; Xiahou, J.; Yao, J.; Li, J.; Hong, Q.; Ji, Y. Real-time eye-gaze based interaction for human intention prediction and emotion analysis. In Proceedings of the Computer Graphics International 2018, Bintan Island, Indonesia, 11–14 June 2018; pp. 185–194. [Google Scholar]
- Breen, M.; Reed, T.; Nishitani, Y.; Jones, M.; Breen, H.M.; Breen, M.S. Wearable and Non-Invasive Sensors for Rock Climbing Applications: Science-Based Training and Performance Optimization. Sensors 2023, 23, 5080. [Google Scholar] [CrossRef]
- Canavan, S.; Chen, M.; Chen, S.; Valdez, R.; Yaeger, M.; Lin, H.; Yin, L. Combining gaze and demographic feature descriptors for autism classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3750–3754. [Google Scholar]
- Patney, A.; Kim, J.; Salvi, M.; Kaplanyan, A.; Wyman, C.; Benty, N.; Lefohn, A.; Luebke, D. Perceptually-based foveated virtual reality. In Proceedings of the ACM SIGGRAPH 2016 Emerging Technologies, Anaheim, CA, USA, 24–28 July 2016; pp. 1–2. [Google Scholar]
- Pérez-Reynoso, F.D.; Rodríguez-Guerrero, L.; Salgado-Ramírez, J.C.; Ortega-Palacios, R. Human–Machine Interface: Multiclass Classification by Machine Learning on 1D EOG Signals for the Control of an Omnidirectional Robot. Sensors 2021, 21, 5882. [Google Scholar] [CrossRef]
- Mohammad, Y.; Nishida, T. Controlling gaze with an embodied interactive control architecture. Appl. Intell. 2010, 32, 148–163. [Google Scholar] [CrossRef]
- Roy, K.; Chanda, D. A Robust Webcam-based Eye Gaze Estimation System for Human-Computer Interaction. In Proceedings of the 2022 International Conference on Innovations in Science, Engineering and Technology (ICISET), Istanbul, Turkey, 23–25 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 146–151. [Google Scholar]
- Lystbæk, M.N.; Pfeuffer, K.; Grønbæk, J.E.S.; Gellersen, H. Exploring gaze for assisting freehand selection-based text entry in ar. Proc. ACM-Hum.-Comput. Interact. 2022, 6, 141. [Google Scholar] [CrossRef]
- dos Santos, R.d.O.J.; de Oliveira, J.H.C.; Rocha, J.B.; Giraldi, J.d.M.E. Eye tracking in neuromarketing: A research agenda for marketing studies. Int. J. Psychol. Stud. 2015, 7, 32. [Google Scholar] [CrossRef]
- Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.M.; Matusik, W.; Torralba, A. Eye Tracking for Everyone. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2176–2184. [Google Scholar]
- Recasens, A.; Khosla, A.; Vondrick, C.; Torralba, A. Where are they looking? Adv. Neural Inf. Process. Syst. 2015, 28, 1251. [Google Scholar]
- Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
- Alberto Funes Mora, K.; Odobez, J.M. Geometric generative gaze estimation (g3e) for remote rgb-d cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1773–1780. [Google Scholar]
- Liu, J.; Chi, J.; Sun, H. An Automatic Calibration Method for Kappa Angle Based on a Binocular Gaze Constraint. Sensors 2023, 23, 3929. [Google Scholar] [CrossRef] [PubMed]
- Guestrin, E.D.; Eizenman, M. General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Trans. Biomed. Eng. 2006, 53, 1124–1133. [Google Scholar] [CrossRef]
- Mokatren, M.; Kuflik, T.; Shimshoni, I. 3D Gaze Estimation Using RGB-IR Cameras. Sensors 2023, 23, 381. [Google Scholar] [CrossRef]
- Martinikorena, I.; Cabeza, R.; Villanueva, A.; Urtasun, I.; Larumbe, A. Fast and robust ellipse detection algorithm for head-mounted eye tracking systems. Mach. Vis. Appl. 2018, 29, 845–860. [Google Scholar] [CrossRef] [Green Version]
- Baluja, S.; Pomerleau, D. Non-intrusive gaze tracking using artificial neural networks. Adv. Neural Inf. Process. Syst. 1993, 6, 753–760. [Google Scholar]
- Tan, K.H.; Kriegman, D.J.; Ahuja, N. Appearance-based eye gaze estimation. In Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision (WACV 2002), Orlando, FL, USA, 3–4 December 2002; IEEE: Piscataway, NJ, USA, 2002; pp. 191–195. [Google Scholar]
- Sugano, Y.; Matsushita, Y.; Sato, Y. Appearance-based gaze estimation using visual saliency. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 329–341. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4511–4520. [Google Scholar]
- Fischer, T.; Chang, H.J.; Demiris, Y. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–352. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Cheng, Y.; Lu, F.; Zhang, X. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 100–115. [Google Scholar]
- Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef]
- Park, S.; Spurr, A.; Hilliges, O. Deep pictorial gaze estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 721–738. [Google Scholar]
- Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part VI. Springer: Berlin/Heidelberg, Germany, 2019; pp. 309–324. [Google Scholar]
- Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6912–6921. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12175–12185. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Cheng, Y.; Lu, F. Gaze estimation using transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3341–3347. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 162–175. [Google Scholar] [CrossRef] [Green Version]
- Funes Mora, K.A.; Monay, F.; Odobez, J.M. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications, Safety Harbor, FL, USA, 26–28 March 2014; pp. 255–258. [Google Scholar]
- Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based gaze estimation with deep learning: A review and benchmark. arXiv 2021, arXiv:2104.12668. [Google Scholar]
- Zhou, X.; Cai, H.; Li, Y.; Liu, H. Two-eye model-based gaze estimation from a Kinect sensor. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1646–1653. [Google Scholar]
- Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 51–60. [Google Scholar]
- Bao, Y.; Cheng, Y.; Liu, Y.; Lu, F. Adaptive feature fusion network for gaze tracking in mobile tablets. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9936–9943. [Google Scholar]
Hyper-Parameters | MpiiFaceGaze [38] | Eyediap [39] | |
---|---|---|---|
Window Size WS | WS = 1 | 3.94° | 4.86° |
WS = 2 | 3.75° | 4.78° | |
WS = 3 | 4.76° | 4.85° | |
WS = 4 | 3.82° | 5.18° | |
WS = 5 | 5.13° | 4.98° | |
WS = 6 | 4.67° | 4.91° | |
WS = 7 | 4.60° | 4.95° |
Methods | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 |
SwinT-GE | 8.46° | 8.69° | 8.89° | 9.80° | 9.08° | 8.81° | 10.32° | 9.54° |
Res-Swin-GE | 2.22° | 2.66° | 3.60° | 3.84° | 3.01° | 3.49° | 3.16° | 4.80° |
Methods | P8 | P9 | P10 | P11 | P12 | P13 | P14 | Avg |
SwinT-GE | 9.79° | 9.08° | 8.74° | 8.51° | 8.65° | 9.11° | 10.83° | 9.29° |
Res-Swin-GE | 4.41° | 4.48° | 3.33° | 3.49° | 4.83° | 3.79° | 5.10° | 3.75° |
Methods | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 |
SwinT-GE | 10.36° | 9.67° | 9.84° | 10.63° | 10.48° | 10.45° | 10.68° | 10.44° | 9.21° |
Res-Swin-GE | 4.46° | 4.15° | 3.62° | 6.29° | 5.07° | 6.21° | 6.15° | 5.82° | 6.14° |
Methods | P10 | P11 | P12 | P13 | P14 | P15 | P16 | Avg | |
SwinT-GE | 10.75° | 9.55° | - | - | 8.28° | 9.59° | 9.95° | 9.99° | |
Res-Swin-GE | 5.00° | 4.16° | - | - | 3.18° | 3.39° | 3.32° | 4.78° |
Methods | MpiiFaceGaze [38] | Eyediap [39] |
---|---|---|
iTracker [10] | 7.33° | 7.13° |
RT-Gene [41] | 4.66° | 6.02° |
FullFace [42] | 4.93° | 6.53° |
Dilated-Net [22] | 4.42° | 6.19° |
CA-Net [12] | 4.27° | 5.27° |
Gaze360 [23] | 4.06° | 5.36° |
GazeTR-Hybrid [35] | 4.00° | 5.17° |
AFF-Net [43] | 3.73° | 6.41° |
Res-Swin-GE (ours) | 3.75° | 4.78° |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Chen, J.; Ma, J.; Wang, X.; Zhang, W. Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism. Sensors 2023, 23, 6226. https://doi.org/10.3390/s23136226
Li Y, Chen J, Ma J, Wang X, Zhang W. Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism. Sensors. 2023; 23(13):6226. https://doi.org/10.3390/s23136226
Chicago/Turabian StyleLi, Yujie, Jiahui Chen, Jiaxin Ma, Xiwen Wang, and Wei Zhang. 2023. "Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism" Sensors 23, no. 13: 6226. https://doi.org/10.3390/s23136226
APA StyleLi, Y., Chen, J., Ma, J., Wang, X., & Zhang, W. (2023). Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism. Sensors, 23(13), 6226. https://doi.org/10.3390/s23136226