Lightweight Multi-Scale Framework for Human Pose and Action Classification
Abstract
1. Introduction
- We propose a novel attention-based deep learning architecture that integrates multi-scale hierarchical features with both spatial and channel attention mechanisms for effective yoga pose classification.
- Achieves state-of-the-art accuracy on Yoga-82 and Stanford 40 Actions with an extremely low parameter count (0.79 million), making it suitable for real-time applications.
- Utilizes and proposes modified modules such as SPA and CCAM.
- Introduces learnable gating to balance cross-attended and raw fused features, improving fine-grained pose discrimination.
- Employs explainable AI techniques to increase the interpretability and trustworthiness of our model.
- Accurately distinguishes subtle intra-class variations, which is crucial for yoga pose classification.
2. Related Works
2.1. Transformers
2.2. Transfer Learning
2.3. Attention Mechanism
3. Proposed Method
3.1. Overview
3.2. Backbone
| Algorithm 1: Overall explanation of our framework. |
![]() |
3.3. Multi-Scale Feature Extraction
3.4. Spatial Attention
3.5. Context-Aware Channel Attention
3.6. Dual Weighted Cross Attention
3.7. Loss Function
4. Experimental Result
4.1. Dataset
4.1.1. Yoga-82
4.1.2. Stanford 40 Actions
4.1.3. Data Augmentation
4.2. Experimental Settings
4.3. Training and Validation Analysis
4.4. Evaluation Metrics
4.5. Comparison with State-of-the-Art Methods
4.6. Ablation Study
4.7. Confusion Matrix
4.8. Gradient-Weighted Class Activation Mapping
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yan, L.; Du, Y. Exploring trends and clusters in human posture recognition research: An analysis using citespace. Sensors 2025, 25, 632. [Google Scholar] [CrossRef] [PubMed]
- Kakizaki, M.; Miah, A.S.M.; Hirooka, K.; Shin, J. Dynamic Japanese sign language recognition throw hand pose estimation using effective feature extraction and classification approach. Sensors 2024, 24, 826. [Google Scholar] [CrossRef]
- Hussain, S.; Siddiqui, H.U.R.; Saleem, A.A.; Raza, M.A.; Alemany-Iturriaga, J.; Velarde-Sotres, Á.; Díez, I.D.l.T.; Dudley, S. Smart physiotherapy: Advancing arm-based exercise classification with posenet and ensemble models. Sensors 2024, 24, 6325. [Google Scholar] [CrossRef] [PubMed]
- Duda-Goławska, J.; Rogowski, A.; Laudańska, Z.; Żygierewicz, J.; Tomalski, P. Identifying infant body position from inertial sensors with machine learning: Which parameters matter? Sensors 2024, 24, 7809. [Google Scholar] [CrossRef]
- Cruciata, L.; Contino, S.; Ciccarelli, M.; Pirrone, R.; Mostarda, L.; Papetti, A.; Piangerelli, M. Lightweight vision transformer for frame-level ergonomic posture classification in industrial workflows. Sensors 2025, 25, 4750. [Google Scholar] [CrossRef]
- Aydın, V.A. Comparison of CNN-based methods for yoga pose classification. Turk. J. Eng. 2024, 8, 65–75. [Google Scholar] [CrossRef]
- Cao, Q.; Yu, Q. Application analysis of artificial intelligence virtual reality Technology in Fitness Training Teaching. Int. J. High Speed Electron. Syst. 2025, 34, 2440084. [Google Scholar] [CrossRef]
- Galada, A.; Baytar, F. Design and evaluation of a problem-based learning VR module for apparel fit correction training. PLoS ONE 2025, 20, e0311587. [Google Scholar] [CrossRef]
- Meghana, J.; Chethan, H.; KS, S.K.; SP, S.P. Comprehensive analysis of pose estimation and machine learning classifiers for precise yoga pose detection and classification. Procedia Comput. Sci. 2025, 258, 3345–3356. [Google Scholar] [CrossRef]
- Shih, C.L.; Liu, J.Y.; Anggraini, I.T.; Xiao, Y.; Funabiki, N.; Fan, C.P. A Yoga Pose Difficulty Level Estimation Method Using OpenPose for Self-Practice System to Yoga Beginners. Information 2024, 15, 789. [Google Scholar] [CrossRef]
- Saber, A.; Fateh, A.; Parhami, P.; Siahkarzadeh, A.; Fateh, M.; Ferdowsi, S. Efficient and accurate pneumonia detection using a novel multi-scale transformer approach. Sensors 2025, 25, 7233. [Google Scholar] [CrossRef]
- Kumar, D.; Sinha, A. Yoga Pose Detection and Classification Using Deep Learning; LAP LAMBERT Academic Publishing: London, UK, 2020. [Google Scholar]
- Agrawal, Y.; Shah, Y.; Sharma, A. Implementation of machine learning technique for identification of yoga poses. In Proceedings of the 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT), Gwalior, India, 10–12 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 40–43. [Google Scholar]
- Knap, P. Human modelling and pose estimation overview. arXiv 2024, arXiv:2406.19290. [Google Scholar] [CrossRef]
- Fateh, A.; Rezvani, M.; Tajary, A.; Fateh, M. Providing a voting-based method for combining deep neural network outputs to layout analysis of printed documents. J. Mach. Vis. Image Process. 2022, 9, 47–64. [Google Scholar]
- Saber, A.; Fakhim, M.S.; Fateh, A.; Fateh, M. A lightweight multi-scale refinement network for gastrointestinal disease classification. Expert Syst. Appl. 2026, 308, 131029. [Google Scholar] [CrossRef]
- Askari, F.; Fateh, A.; Mohammadi, M.R. Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms. Neural Netw. 2025, 187, 107339. [Google Scholar] [CrossRef] [PubMed]
- Fateh, A.; Mohammadi, M.R.; Motlagh, M.R.J. MSDNet: Multi-scale decoder for few-shot semantic segmentation via transformer-guided prototyping. Image Vis. Comput. 2025, 162, 105672. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
- Fakhim, M.S.; Fateh, M.; Fateh, A.; Jalali, Y. DA-COVSGNet: Double Attentional Network for COVID Severity Grading. Int. J. Eng. 2025, 38, 1568–1582. [Google Scholar] [CrossRef]
- Fateh, A.; Rezvani, Y.; Moayedi, S.; Rezvani, S.; Fateh, F.; Fateh, M. BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification with Swin-HAFNet. arXiv 2025, arXiv:2506.14318. [Google Scholar] [CrossRef]
- Hassanin, M.; Khamis, A.; Bennamoun, M.; Boussaid, F.; Radwan, I. Crossformer3D: Cross spatio-temporal transformer for 3D human pose estimation. Signal Image Video Process. 2025, 19, 618. [Google Scholar] [CrossRef]
- Zheng, H.; Li, H.; Dai, W.; Zheng, Z.; Li, C.; Zou, J.; Xiong, H. Hipart: Hierarchical pose autoregressive transformer for occluded 3d human pose estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 16807–16817. [Google Scholar]
- İşgüder, E.; Durmaz İncel, Ö. FedOpenHAR: Federated Multitask Transfer Learning for Sensor-Based Human Activity Recognition. J. Comput. Biol. 2025, 32, 558–572. [Google Scholar] [CrossRef]
- Thukral, M.; Haresamudram, H.; Ploetz, T. Cross-domain har: Few-shot transfer learning for human activity recognition. ACM Trans. Intell. Syst. Technol. 2025, 16, 22. [Google Scholar] [CrossRef]
- Akash, M.; Mohalder, R.D.; Khan, M.A.M.; Paul, L.; Ali, F.B. Yoga Pose classification using transfer learning. In Data-Driven Applications for Emerging Technologies; CRC Press: Boca Raton, FL, USA, 2025; p. 197. [Google Scholar]
- Wei, X.; Wang, Z. TCN-attention-HAR: Human activity recognition based on attention mechanism time convolutional network. Sci. Rep. 2024, 14, 7414. [Google Scholar] [CrossRef]
- Zhang, L.; Yu, J.; Gao, Z.; Ni, Q. A multi-channel hybrid deep learning framework for multi-sensor fusion enabled human activity recognition. Alex. Eng. J. 2024, 91, 472–485. [Google Scholar] [CrossRef]
- Sun, W.; Ma, Y.; Wang, R. k-NN attention-based video vision transformer for action recognition. Neurocomputing 2024, 574, 127256. [Google Scholar] [CrossRef]
- Dastbaravardeh, E.; Askarpour, S.; Saberi Anari, M.; Rezaee, K. Channel attention-based approach with autoencoder network for human action recognition in low-resolution frames. Int. J. Intell. Syst. 2024, 2024, 1052344. [Google Scholar] [CrossRef]
- Verma, M.; Kumawat, S.; Nakashima, Y.; Raman, S. Yoga-82: A new dataset for fine-grained classification of human poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 1038–1039. [Google Scholar]
- Yao, B.; Jiang, X.; Khosla, A.; Lin, A.L.; Guibas, L.; Fei-Fei, L. Human action recognition by learning bases of action attributes and parts. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1331–1338. [Google Scholar]
- Borthakur, D.; Paul, A.; Kapil, D.; Saikia, M.J. Yoga pose estimation using angle-based feature extraction. Healthcare 2023, 11, 3133. [Google Scholar] [CrossRef] [PubMed]
- Ashrafi, S.S.; Shokouhi, S.B.; Ayatollahi, A. Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection. Multimed. Tools Appl. 2021, 80, 32567–32593. [Google Scholar] [CrossRef]
- Li, Y.; Li, K.; Wang, X. Recognizing actions in images by fusing multiple body structure cues. Pattern Recognit. 2020, 104, 107341. [Google Scholar] [CrossRef]
- Gkioxari, G.; Girshick, R.; Malik, J. Contextual action recognition with r* cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1080–1088. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zheng, Y.; Zheng, X.; Lu, X.; Wu, S. Spatial attention based visual semantic learning for action recognition in still images. Neurocomputing 2020, 413, 383–396. [Google Scholar] [CrossRef]
- Yan, S.; Smith, J.S.; Lu, W.; Zhang, B. Multibranch attention networks for action recognition in still images. IEEE Trans. Cogn. Dev. Syst. 2017, 10, 1116–1125. [Google Scholar] [CrossRef]
- Bas, C.; Ikizler-Cinbis, N. Top-down and bottom-up attentional multiple instance learning for still image action recognition. Signal Process. Image Commun. 2022, 104, 116664. [Google Scholar] [CrossRef]
- Hosseyni, S.R.; Seyedin, S.; Taheri, H. Human Action Recognition in Still Images Using ConViT. In Proceedings of the 2024 32nd International Conference on Electrical Engineering (ICEE), Tehran, Iran, 14–16 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]










| Model | Number of Classes | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | Params (Million) |
|---|---|---|---|---|---|---|
| Verma, Manisha, et al. [32] | 6 | 87.2 | - | - | - | 22.59 |
| Borthakur, Debanjan, et al. [34] | 6 | 85.0 | - | - | - | - |
| Akash et al. [27] | 6 | 85 | 87 | 83 | 83 | - |
| Proposed method | 6 | 90.40 | 90.29 | 89.24 | 89.73 | 0.79 |
| Verma, Manisha, et al. [32] | 20 | 84.42 | - | - | - | 22.59 |
| Proposed method | 20 | 87.44 | 87.26 | 85.75 | 86.39 | 0.79 |
| Verma, Manisha, et al. [32] | 82 | 78.88 | - | - | - | 22.59 |
| Proposed method | 82 | 80.16 | 81.01 | 77.78 | 78.41 | 0.79 |
| Model | MAP (%) | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|---|
| R*CNN [37] | 90.90 | - | - | - | - |
| ResNet-50 [38] | 87.20 | - | - | - | - |
| SAAM-Nets [39] | 93.00 | - | - | - | - |
| Multi-Branch Attention [40] | 90.70 | - | - | - | - |
| Top-down + Bottom-up Attention [41] | 91.00 | - | - | - | - |
| Multi-Attention Guided Network [35] | 94.20 | - | - | - | - |
| Body Structure Cues [36] | 93.80 | - | - | - | - |
| Hosseyni et al. [42] | 93.10 | - | - | - | - |
| Proposed method | 94.28 | 90.27 | 89.65 | 89.47 | 89.30 |
| Method | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|
| Resnet 50 | 79.88 | 79.13 | 78.00 | 78.38 |
| VGG 16 | 65.70 | 64.41 | 61.93 | 62.21 |
| Efficient Net | 70.01 | 67.98 | 66.01 | 66.98 |
| Swin transformer | 87.44 | 87.26 | 85.75 | 86.39 |
| Baseline | Multi-Scale | Spatial Attention | CCAM | DWCA | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | Parameters (Million) | Flops (G) | Inference Time (ms) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Swin Transformer | 64.50 | 62.65 | 61.53 | 61.74 | 0.14 | 15.17 | 20.07 | ||||
| ✓ | 77.69 | 79.10 | 73.39 | 74.62 | 0.25 | 15.26 | 19.87 | ||||
| ✓ | 70.88 | 71.26 | 65.75 | 66.72 | 0.14 | 15.17 | 19.69 | ||||
| ✓ | 69.51 | 69.47 | 63.19 | 64.54 | 0.14 | 15.17 | 20.99 | ||||
| ✓ | 70.76 | 69.69 | 66.24 | 66.34 | 0.60 | 16.61 | 25.7 | ||||
| ✓ | ✓ | 81.35 | 81.04 | 79.62 | 79.87 | 0.25 | 15.26 | 21.69 | |||
| ✓ | ✓ | 82.19 | 81.71 | 81.14 | 81.02 | 0.25 | 15.26 | 21.61 | |||
| ✓ | ✓ | 80.00 | 80.08 | 76.49 | 77.26 | 0.71 | 16.70 | 26.41 | |||
| ✓ | ✓ | ✓ | 83.48 | 82.81 | 82.21 | 82.10 | 0.29 | 15.47 | 21.16 | ||
| ✓ | ✓ | ✓ | 79.43 | 80.15 | 74.78 | 76.29 | 0.71 | 16.70 | 26.56 | ||
| ✓ | ✓ | ✓ | 81.11 | 81.57 | 77.45 | 78.85 | 0.71 | 16.70 | 26.34 | ||
| ✓ | ✓ | 72.20 | 71.33 | 67.99 | 68.69 | 0.14 | 15.17 | 20.05 | |||
| ✓ | ✓ | 69.54 | 67.69 | 65.14 | 65.49 | 0.60 | 16.61 | 25.15 | |||
| ✓ | ✓ | ✓ | 76.49 | 75.75 | 73.34 | 74.16 | 0.60 | 16.61 | 26.41 | ||
| ✓ | ✓ | 70.25 | 69.35 | 63.93 | 64.45 | 0.60 | 16.61 | 25.88 | |||
| ✓ | ✓ | ✓ | ✓ | 87.44 | 87.26 | 85.75 | 86.39 | 0.79 | 16.91 | 26.91 |
| Channels | Height | Width | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|---|---|
| 64 | 56 | 56 | 78.47 | 79.34 | 74.54 | 75.69 |
| 64 | 28 | 28 | 78.53 | 78.52 | 74.23 | 75.42 |
| 64 | 14 | 14 | 80.06 | 79.19 | 76.45 | 77.44 |
| 64 | 7 | 7 | 79.49 | 79.84 | 74.68 | 75.95 |
| 128 | 56 | 56 | 87.44 | 87.26 | 85.75 | 86.39 |
| 128 | 28 | 28 | 80.18 | 79.51 | 76.71 | 76.89 |
| 128 | 14 | 14 | 78.29 | 77.65 | 75.85 | 76.22 |
| 128 | 7 | 7 | 82.52 | 82.47 | 79.56 | 80.41 |
| 256 | 56 | 56 | 78.59 | 78.63 | 76.12 | 76.85 |
| 256 | 28 | 28 | 77.21 | 79.01 | 72.68 | 74.31 |
| 256 | 14 | 14 | 79.43 | 77.73 | 77.08 | 76.92 |
| 256 | 7 | 7 | 77.78 | 78.24 | 73.12 | 74.43 |
| Model | Conv 7 × 7 | Conv 3 × 3 | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|---|---|
| Spatial attention (from original cbam) | ✓ | 80.06 | 79.64 | 77.32 | 78.01 | |
| Spatial attention (from original cbam) | ✓ | 80.72 | 80.14 | 79.56 | 79.44 | |
| Our Spatial attention | ✓ | ✓ | 87.44 | 87.26 | 85.75 | 86.39 |
| Model | Residual Connection | Conv 1 × 1 | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|---|---|
| Non-Context-Aware Channel Attention | ✓ | 82.07 | 82.68 | 78.87 | 80.24 | |
| Context-Aware Channel Attention | ✓ | ✓ | 87.44 | 87.26 | 85.75 | 86.39 |
| Model | Method | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|---|
| Proposed Method | Concat | 83.48 | 82.79 | 82.74 | 82.57 |
| Proposed Method | Addition | 82.52 | 82.32 | 81.01 | 81.07 |
| Proposed Method | Cross-Attention | 85.10 | 85.64 | 82.88 | 84.10 |
| Proposed Method | DWCA | 87.44 | 87.26 | 85.75 | 86.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Saber, A.; Hosseini, M.-M.; Fateh, A.; Fateh, M.; Abolghasemi, V. Lightweight Multi-Scale Framework for Human Pose and Action Classification. Sensors 2026, 26, 1102. https://doi.org/10.3390/s26041102
Saber A, Hosseini M-M, Fateh A, Fateh M, Abolghasemi V. Lightweight Multi-Scale Framework for Human Pose and Action Classification. Sensors. 2026; 26(4):1102. https://doi.org/10.3390/s26041102
Chicago/Turabian StyleSaber, Alireza, Mohammad-Mehdi Hosseini, Amirreza Fateh, Mansoor Fateh, and Vahid Abolghasemi. 2026. "Lightweight Multi-Scale Framework for Human Pose and Action Classification" Sensors 26, no. 4: 1102. https://doi.org/10.3390/s26041102
APA StyleSaber, A., Hosseini, M.-M., Fateh, A., Fateh, M., & Abolghasemi, V. (2026). Lightweight Multi-Scale Framework for Human Pose and Action Classification. Sensors, 26(4), 1102. https://doi.org/10.3390/s26041102


