A Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition
Abstract
1. Introduction
- A novel network architecture called student action recognition (SAR) is introduced, which combines the Video Swin Transformer and ViT3D. By feeding deep features extracted from the Video Swin Transformer network into ViT3D for comprehensive attention computation, the architecture not only enhances the model’s ability to capture action details but also improved the understanding of the overall structure of the action.
- The DPE-SAR is introduced, employing a Dynamic Position Embedding method for position encoding in SAR. By incorporating zero-padding in the convolutional process using a deep convolutional approach, each element in the data could better comprehend its absolute position by gradually exploring its neighborhood information, strengthening the model’s understanding of the spatial structure of action videos and improving its local capturing capability for spatial–temporal action features.
- A student classroom meta-action dataset, GUET10, is constructed for smart classroom scenarios. The effectiveness and reliability of the SAR and DPE-SAR models are validated on diverse video data across various action recognition scenarios, providing a new technical approach for smart education and behavior analysis domains.
2. Related Work
3. Materials and Methods
3.1. Dataset
3.2. Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition (DPE-SAR)
4. Experimental Section
4.1. Experimental Details
4.2. Experimental Evaluation of the SAR Action Recognition Model
4.3. Experimental Evaluation of DPE-SAR Action Recognition Model
5. Discussion
- Experiments were conducted on the outdoor sports datasets UCF101 and HMDB51, as well as the educational scenario dataset GUET10, to evaluate the proposed models. Both the SAR and DPE-SAR models achieved higher Top1 Acc and mean class acc compared to the baseline model. These results effectively validate the superiority of the proposed models in action recognition tasks.
- After comparing and analyzing the SAR and Video Swin Transformer models, it is evident that the SAR model significantly improves both key performance metrics, Top1 Acc and mean class acc. The result indicates that the SAR model achieves superior performance in action recognition tasks due to full attention computation at a deep network stage (Stage 4), enabling it to focus more on the global characteristics of actions. The advantage of this full attention mechanism lies in its ability to comprehensively consider the contextual information of each element in the sequence, rather than being restricted to information within a local window. This enhances the model’s understanding and recognition capability of the overall action structure.
- By comparing the DPE-SAR and SAR models, we observe further improvements in the Top1 Acc and mean class acc values. The result validates the effectiveness of Dynamic Positional Encoding (DPE) in enhancing the model’s understanding of the spatial structures in action videos. Through the incorporation of Dynamic Positional Encoding, the DPE-SAR model can capture the spatial–temporal features of actions more precisely and better comprehend the temporal–spatial evolution of actions, thereby achieving more precise and accurate predictions in action recognition tasks.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shou, Z.; Yan, M.; Wen, H.; Liu, J.; Mo, J.; Zhang, H. Research on Students’ Action Behavior Recognition Method Based on Classroom Time-Series Images. Appl. Sci. 2023, 13, 10426. [Google Scholar] [CrossRef]
- Lin, F.C.; Ngo, H.H.; Dow, C.R.; Lam, K.H.; Le, H.L. Student behavior recognition system for the classroom environment based on skeleton pose estimation and person detection. Sensors 2021, 21, 5314. [Google Scholar] [CrossRef] [PubMed]
- Chen, Z.; Huang, W.; Liu, H.; Wang, Z.; Wen, Y.; Wang, S. ST-TGR: Spatio-Temporal Representation Learning for Skeleton-Based Teaching Gesture Recognition. Sensors 2024, 24, 2589. [Google Scholar] [CrossRef]
- Muhammad, K.; Ullah, A.; Imran, A.S.; Sajjad, M.; Kiran, M.S.; Sannino, G.; de Albuquerque, V.H.C. Human action recognition using attention based LSTM network with dilated CNN features. Future Gener. Comput. Syst. 2021, 125, 820–830. [Google Scholar] [CrossRef]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Chen, Z.; Xie, L.; Niu, J.; Liu, X.; Wei, L.; Tian, Q. Visformer: The vision-friendly transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 589–598. [Google Scholar]
- Hu, Z.P.; Zhang, R.X.; Qiu, Y.; Zhao, M.Y.; Sun, Z. 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimed. Tools Appl. 2021, 80, 33179–33192. [Google Scholar] [CrossRef]
- Huo, H.; Li, B. MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition. Electronics 2024, 13, 948. [Google Scholar] [CrossRef]
- Alfaifi, R.; Artoli, A.M. Human action prediction with 3D-CNN. SN Comput. Sci. 2020, 1, 286. [Google Scholar] [CrossRef]
- Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 2020, 22, 2990–3001. [Google Scholar] [CrossRef]
- Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Parallel temporal feature selection based on improved attention mechanism for dynamic gesture recognition. Complex Intell. Syst. 2023, 9, 1377–1390. [Google Scholar]
- Alfasly, S.; Chui, C.K.; Jiang, Q.; Lu, J.; Xu, C. An effective video transformer with synchronized spatiotemporal and spatial self-attention for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2496–2509. [Google Scholar] [CrossRef]
- Mazzia, V.; Angarano, S.; Salvetti, F.; Angelini, F.; Chiaberge, M. Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognit. 2022, 124, 108487. [Google Scholar] [CrossRef]
- Wensel, J.; Ullah, H.; Munir, A. Vit-ret: Vision and recurrent transformer neural networks for human activity recognition in videos. IEEE Access 2023, 11, 72227–72249. [Google Scholar] [CrossRef]
- Chen, S.; Ge, C.; Tong, Z.; Wang, J.; Song, Y.; Wang, J.; Luo, P. Adaptformer: Adapting vision transformers for scalable visual recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 16664–16678. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021.
- Chen, T.; Mo, L. Swin-fusion: Swin-transformer with feature fusion for human action recognition. Neural Process. Lett. 2023, 55, 11109–11130. [Google Scholar] [CrossRef]
- Bulat, A.; Perez Rua, J.M.; Sudhakaran, S.; Martinez, B.; Tzimiropoulos, G. Space-time mixing attention for video transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 19594–19607. [Google Scholar]
- Liu, G.; Zhang, C.; Xu, Q.; Cheng, R.; Song, Y.; Yuan, X.; Sun, J. I3d-shufflenet based human action recognition. Algorithms 2020, 13, 301. [Google Scholar] [CrossRef]
- Ma, Y.; Wang, R. Relative-position embedding based spatially and temporally decoupled Transformer for action recognition. Pattern Recognit. 2024, 145, 109905. [Google Scholar] [CrossRef]
- Wu, X.; Wang, R.; Hou, J.; Lin, H.; Luo, J. Spatial–temporal relation reasoning for action prediction in videos. Int. J. Comput. Vis. 2021, 129, 1484–1505. [Google Scholar] [CrossRef]
- Kwon, S. Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput. 2021, 102, 107101. [Google Scholar]
- Zhu, L.; Fan, H.; Luo, Y.; Xu, M.; Yang, Y. Temporal cross-layer correlation mining for action recognition. IEEE Trans. Multimed. 2021, 24, 668–676. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Zheng, W.; Gong, G.; Tian, J.; Lu, S.; Wang, R.; Yin, Z.; Li, X.; Yin, L. Design of a modified transformer architecture based on relative position coding. Int. J. Comput. Intell. Syst. 2023, 16, 168. [Google Scholar] [CrossRef]
- Dufter, P.; Schmitt, M.; Schütze, H. Position information in transformers: An overview. Comput. Linguist. 2022, 48, 733–763. [Google Scholar] [CrossRef]
- Xin, W.; Liu, R.; Liu, Y.; Chen, Y.; Yu, W.; Miao, Q. Transformer for skeleton-based action recognition: A review of recent advances. Neurocomputing 2023, 537, 164–186. [Google Scholar] [CrossRef]
- Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Shen, C. Conditional positional encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
- Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. UniFormer: Unifying Convolution and Self-Attention for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef]












| Dataset | Number of Class | Number of Video Clips | Average Duration per Clip | 
|---|---|---|---|
| UCF101 | 101 | 100 to 200 | 3 to 10 s | 
| HMDB51 | 51 | 100 to 110 | 2 to 20 s | 
| GUET10 | 10 | 100 to 200 | 3 to 18 s | 
| Experimental Environment | Environment Configuration | 
|---|---|
| Operating systems | Win11 | 
| CPU | Ryzen 9 7950X | 
| Video Cards | GeForce RTX 4090 | 
| RAM | 64GB | 
| ROM | 2T SSD | 
| Programming Languages | Python 3.9 | 
| Framework | Pytorch | 
| Dataset | Model | Swin | SAR | DPE-SAR | 
|---|---|---|---|---|
| UCF101 | Top1 Acc | 0.94140 | 0.95229 | 0.95680 | 
| mean class acc | 0.93876 | 0.95009 | 0.95475 | |
| HMDB51 | Top1 Acc | 0.54269 | 0.58203 | 0.58280 | 
| mean class acc | 0.55156 | 0.60071 | 0.60430 | |
| GUET10 | Top1 Acc | 0.86150 | 0.91136 | 0.93075 | 
| mean class acc | 0.84993 | 0.90919 | 0.92065 | 
| Fps | Swin | DPE-SAR | 
|---|---|---|
| 8 | 0.1436 | 0.1406 | 
| 16 | 0.1688 | 0.1430 | 
| 32 | 0.2546 | 0.2273 | 
| 64 | 0.3326 | 0.3368 | 
| 128 | 0.5667 | 0.5916 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shou, Z.; Yuan, X.; Li, D.; Mo, J.; Zhang, H.; Zhang, J.; Wu, Z. A Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition. Sensors 2024, 24, 5371. https://doi.org/10.3390/s24165371
Shou Z, Yuan X, Li D, Mo J, Zhang H, Zhang J, Wu Z. A Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition. Sensors. 2024; 24(16):5371. https://doi.org/10.3390/s24165371
Chicago/Turabian StyleShou, Zhaoyu, Xiaohu Yuan, Dongxu Li, Jianwen Mo, Huibing Zhang, Jingwei Zhang, and Ziyong Wu. 2024. "A Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition" Sensors 24, no. 16: 5371. https://doi.org/10.3390/s24165371
APA StyleShou, Z., Yuan, X., Li, D., Mo, J., Zhang, H., Zhang, J., & Wu, Z. (2024). A Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition. Sensors, 24(16), 5371. https://doi.org/10.3390/s24165371
 
        


 
       