A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights
Abstract
1. Introduction
- A novel inter-frame motion attention mechanism is proposed, which effectively reduces the spatiotemporal search range to hand related areas, enabling the model to more accurately focus on hand motion features and improve gesture recognition accuracy.
- A novel inter-modal attention weights loss is proposed to train the model, which can improve the interaction between depth modality and RGB modality, thereby reducing the amount of redundant information in the final fused features.
- Experiments are conducted on challenging datasets such as EgoGesture, NVGesture, and Jester. The results demonstrate that our proposed model outperforms other existing methods in terms of the accuracy of hand gesture recognition. Ablation experiments highlight the contributions of each component of our proposed model.
2. Related Work
2.1. Attention Mechanism
2.2. Multimodal Hand Gesture Recognition
3. Materials and Methods
3.1. Datasets
3.2. Computer Setup
3.3. Evaluation Metrics
- Top-1 Accuracy: This metric measures the percentage of predictions where the class with the highest predicted probability is identical to the ground-truth label.
- Top-5 Accuracy: This metric considers a prediction to be correct if the ground-truth label is among the top five classes with the highest predicted probabilities. It is particularly useful for datasets with many classes or fine-grained distinctions, where multiple predictions might be plausible.
3.4. Patch Embedding
3.5. Inter-Frame Motion Attention
3.5.1. Feature Extractor
3.5.2. Attention Calculator
3.6. Adaptive Down-Sampling Module
3.7. Inter-Modal Attention Weights Loss
4. Results
4.1. Hyperparameter Selection
4.1.1. Embedding Channel Number C
4.1.2. Loss Function Coefficient λ
4.1.3. Number of Feature Extractor Layers N
4.2. Ablation Study
4.2.1. Single-Modal Modules
4.2.2. Multimodal Modules
4.3. Comparison with State-of-the-Art Methods
4.3.1. Compared Methods and Naming Convention
- Backbone: Refers to an existing SotA model (e.g., SlowOnly) without our modules.
- Ours+Backbone: Refers to a single-modal (RGB) backbone enhanced with our proposed IFMA, ASDS, and ATDS modules. For example, Ours+SlowOnly denotes the SlowOnly model integrated with our complete single-modal framework.
- Ours+SlowOnly (RGB-D): Refers to our full multimodal model built upon the SlowOnly architecture, utilizing both the AMDS module for fusion and the IMAW loss for training.
4.3.2. Single-Modal (RGB) Performance
4.3.3. Multimodal (RGB-D) Performance
5. Discussion
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rahman, M.M.; Uzzaman, A.; Khatun, F.; Aktaruzzaman, M.; Siddique, N. A Comparative Study of Advanced Technologies and Methods in Hand Gesture Analysis and Recognition Systems. Expert Syst. Appl. 2025, 266, 125929. [Google Scholar] [CrossRef]
- Xu, C.; Wu, X.; Wang, M.; Qiu, F.; Liu, Y.; Ren, J. Improving Dynamic Gesture Recognition in Untrimmed Videos by an Online Lightweight Framework and a New Gesture Dataset ZJUGesture. Neurocomputing 2023, 523, 58–68. [Google Scholar] [CrossRef]
- Qi, J.; Ma, L.; Cui, Z.; Yu, Y. Computer Vision-Based Hand Gesture Recognition for Human-Robot Interaction: A Review. Complex Intell. Syst. 2024, 10, 1581–1606. [Google Scholar] [CrossRef]
- Yang, L.I.; Huang, J.; Feng, T.; Hong-An, W.; Guo-Zhong, D.A.I. Gesture Interaction in Virtual Reality. Virtual Real. Intell. Hardw. 2019, 1, 84–112. [Google Scholar] [CrossRef]
- Sharma, S.; Singh, S. Vision-Based Hand Gesture Recognition Using Deep Learning for the Interpretation of Sign Language. Expert Syst. Appl. 2021, 182, 115657. [Google Scholar] [CrossRef]
- Hashi, A.O.; Hashim, S.Z.M.; Asamah, A.B. A Systematic Review of Hand Gesture Recognition: An Update from 2018 to 2024. IEEE Access 2024, 12, 143599–143626. [Google Scholar] [CrossRef]
- Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.-G. Svformer: Semi-Supervised Video Transformer for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18816–18826. [Google Scholar]
- Gedamu, K.; Ji, Y.; Gao, L.; Yang, Y.; Shen, H.T. Relation-Mining Self-Attention Network for Skeleton-Based Human Action Recognition. Pattern Recognit. 2023, 139, 109455. [Google Scholar] [CrossRef]
- Esteva, A.; Chou, K.; Yeung, S.; Naik, N.; Madani, A.; Mottaghi, A.; Liu, Y.; Topol, E.; Dean, J.; Socher, R. Deep Learning-Enabled Medical Computer Vision. npj Digit. Med. 2021, 4, 5. [Google Scholar] [CrossRef]
- Zhao, D.; Yang, Q.; Zhou, X.; Li, H.; Yan, S. A Local Spatial–Temporal Synchronous Network to Dynamic Gesture Recognition. IEEE Trans. Comput. Soc. Syst. 2022, 10, 2226–2233. [Google Scholar] [CrossRef]
- Dong, S.; Wang, P.; Abbas, K. A Survey on Deep Learning and Its Applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
- Rastgoo, R.; Kiani, K.; Escalera, S. Sign Language Recognition: A Deep Survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
- Niu, Z.; Zhong, G.; Yu, H. A Review on the Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Saini, M.; Fatemi, M.; Alizad, A. Fast Inter-Frame Motion Correction in Contrast-Free Ultrasound Quantitative Microvasculature Imaging Using Deep Learning. Sci. Rep. 2024, 14, 26161. [Google Scholar] [CrossRef] [PubMed]
- Shao, Z.; Zhu, H.; Zhou, Y.; Xiang, X.; Liu, B.; Yao, R.; Ma, L. Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample. Int. J. Comput. Vis. 2025, 133, 1711–1726. [Google Scholar] [CrossRef]
- Wang, Y.; Yang, G.; Li, S.; Li, Y.; He, L.; Liu, D. Arrhythmia Classification Algorithm Based on Multi-Head Self-Attention Mechanism. Biomed. Signal Process. Control 2023, 79, 104206. [Google Scholar] [CrossRef]
- Li, X.; Li, M.; Yan, P.; Li, G.; Jiang, Y.; Luo, H.; Yin, S. Deep Learning Attention Mechanism in Medical Image Analysis: Basics and Beyonds. Int. J. Netw. Dyn. Intell. 2023, 2, 93–116. [Google Scholar] [CrossRef]
- Chen, Y.; Zhao, L.; Peng, X.; Yuan, J.; Metaxas, D.N. Construct Dynamic Graphs for Hand Gesture Recognition via Spatial-Temporal Attention. arXiv 2019, arXiv:1907.08871. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Miah, A.S.M.; Hasan, M.A.M.; Shin, J.; Okuyama, Y.; Tomioka, Y. Multistage Spatial Attention-Based Neural Network for Hand Gesture Recognition. Computers 2023, 12, 13. [Google Scholar] [CrossRef]
- Zhang, W.; Lin, Z.; Cheng, J.; Ma, C.; Deng, X.; Wang, H. STA-GCN: Two-Stream Graph Convolutional Network with Spatial–Temporal Attention for Hand Gesture Recognition. Vis. Comput. 2020, 36, 2433–2444. [Google Scholar] [CrossRef]
- Ohn-Bar, E.; Trivedi, M.M. Hand Gesture Recognition in Real Time for Automotive Interfaces: A Multimodal Vision-Based Approach and Evaluations. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2368–2377. [Google Scholar] [CrossRef]
- Miao, Q.; Li, Y.; Ouyang, W.; Ma, Z.; Xu, X.; Shi, W.; Cao, X. Multimodal Gesture Recognition Based on the Resc3d Network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3047–3055. [Google Scholar]
- Zhang, X.; Zeng, X.; Sun, W.; Ren, Y.; Xu, T. Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition. Comput. Syst. Sci. Eng. 2023, 46, 671–686. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, J.; Lan, F. Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks. IEEECAA J. Autom. Sin. 2020, 8, 110–120. [Google Scholar] [CrossRef]
- Elboushaki, A.; Hannane, R.; Afdel, K.; Koutti, L. MultiD-CNN: A Multi-Dimensional Feature Learning Approach Based on Deep Convolutional Networks for Gesture Recognition in RGB-D Image Sequences. Expert Syst. Appl. 2020, 139, 112829. [Google Scholar] [CrossRef]
- Yu, Z.; Zhou, B.; Wan, J.; Wang, P.; Chen, H.; Liu, X.; Li, S.Z.; Zhao, G. Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition. IEEE Trans. Image Process. 2021, 30, 5626–5640. [Google Scholar] [CrossRef] [PubMed]
- Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. TMMF: Temporal Multi-Modal Fusion for Single-Stage Continuous Gesture Recognition. IEEE Trans. Image Process. 2021, 30, 7689–7701. [Google Scholar] [CrossRef]
- Li, J.; Xie, X.; Pan, Q.; Cao, Y.; Zhao, Z.; Shi, G. SGM-Net: Skeleton-Guided Multimodal Network for Action Recognition. Pattern Recognit. 2020, 104, 107356. [Google Scholar] [CrossRef]
- Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE Trans. Multimed. 2018, 20, 1038–1050. [Google Scholar] [CrossRef]
- Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3d Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4207–4215. [Google Scholar]
- Materzynska, J.; Berger, G.; Bax, I.; Memisevic, R. The Jester Dataset: A Large-Scale Video Dataset of Human Gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9912, pp. 20–36. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3d Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Learning Spatio-Temporal Features with 3d Residual Networks for Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3154–3160. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Abavisani, M.; Joze, H.R.V.; Patel, V.M. Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1165–1174. [Google Scholar]
- Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. Tea: Temporal Excitation and Aggregation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 909–918. [Google Scholar]
- Lin, J.; Gan, C.; Han, S. Tsm: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
- Feichtenhofer, C. X3d: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
- Wang, Z.; She, Q.; Smolic, A. Action-Net: Multipath Excitation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13214–13223. [Google Scholar]
N | EgoGesture | Jester | ||
---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | |
0 | 92.2 | 96.8 | 94.3 | 98.2 |
1 | 94.2 | 98.8 | 96.2 | 98.6 |
2 | 95.2 | 99.9 | 98.3 | 99.9 |
3 | 94.5 | 98.9 | 96.5 | 98.6 |
4 | 89.3 | 96.2 | 91.2 | 98.0 |
Methods | EgoGesture | Jester | ||||
---|---|---|---|---|---|---|
IFMA | ASDS | ATDS | Top-1 | Top-5 | Top-1 | Top-5 |
- | - | - | 94.6 | 99.8 | 97.2 | 99.9 |
√ | - | - | 94.9 | 99.8 | 97.5 | 99.9 |
- | √ | - | 93.2 | 98.6 | 95.8 | 99.8 |
- | - | √ | 92.4 | 97.3 | 95.6 | 99.8 |
√ | √ | - | 95.1 | 99.9 | 97.6 | 99.9 |
√ | - | √ | 95.0 | 99.9 | 97.9 | 99.9 |
- | √ | √ | 89.8 | 96.8 | 93.2 | 98.6 |
√ | √ | √ | 95.2 | 99.9 | 98.3 | 99.9 |
Modality | Methods | EgoGesture | NVGesture | |||||
---|---|---|---|---|---|---|---|---|
RGB | Depth | FC | AMDS | IMAW | Top-1 | Top-5 | Top-1 | Top-5 |
√ | - | - | - | - | 95.2 | 99.9 | 80.2 | 92.6 |
- | √ | - | - | - | 93.5 | 98.3 | 78.5 | 91.3 |
√ | √ | √ | - | - | 94.3 | 98.6 | 82.3 | 92.7 |
√ | √ | - | √ | - | 95.2 | 99.9 | 83.2 | 96.5 |
√ | √ | √ | - | √ | 95.3 | 99.9 | 83.6 | 96.8 |
√ | √ | - | √ | √ | 95.8 | 99.9 | 84.2 | 98.6 |
Methods | EgoGesture | Jester | ||
---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | |
VGG16 [36] | 63.1 | 95.4 | 67.2 | 96.3 |
C3D [38] | 86.4 | 98.6 | 87.6 | 98.3 |
CatNet [41] | 90.1 | 99.1 | 91.2 | 98.5 |
TEA [42] | 92.1 | 98.2 | 96.7 | 99.8 |
TSN [37] | 79.6 | 98.3 | 82.1 | 98.7 |
TSM [43] | 92.2 | 98.6 | 94.5 | 99.8 |
X3D [44] | 93.5 | 98.5 | 95.6 | 98.5 |
Ours+X3D | 94.5 | 99.8 | 95.8 | 98.8 |
ACTION-Net [45] | 94.4 | 98.8 | 97.1 | 99.9 |
Ours+ACTION-Net | 95.1 | 99.8 | 98.2 | 99.9 |
SlowOnly [35] | 94.6 | 99.8 | 97.2 | 99.9 |
Ours+SlowOnly | 95.2 | 99.9 | 98.3 | 99.9 |
Methods | Modality | EgoGesture | NVGesture | ||
---|---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | ||
3D Resnet-50 [39] | RGB-D | 86.2 | 97.5 | 79.6 | 92.6 |
C3D [38] | RGB-D | 88.7 | 98.2 | 80.2 | 94.6 |
I3D+FC [40] | RGB-D | 92.6 | 98.6 | 83.6 | 96.5 |
SlowOnly+FC [35] | RGB-D | 95.1 | 99.9 | 83.8 | 98.3 |
Ours+SlowOnly | RGB-D | 95.8 | 99.9 | 84.2 | 98.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Li, S.; Zeng, X.; Lu, P.; Sun, W. A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights. Computers 2025, 14, 432. https://doi.org/10.3390/computers14100432
Zhang X, Li S, Zeng X, Lu P, Sun W. A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights. Computers. 2025; 14(10):432. https://doi.org/10.3390/computers14100432
Chicago/Turabian StyleZhang, Xiaorui, Shuaitong Li, Xianglong Zeng, Peisen Lu, and Wei Sun. 2025. "A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights" Computers 14, no. 10: 432. https://doi.org/10.3390/computers14100432
APA StyleZhang, X., Li, S., Zeng, X., Lu, P., & Sun, W. (2025). A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights. Computers, 14(10), 432. https://doi.org/10.3390/computers14100432