CDO-POSE: A Lightweight Model for 2D Human Pose Estimation
Abstract
1. Introduction
- We introduce an improved C3k2CAA module, an attention-augmented structure, to replace the original structure in the backbone network, achieving a reduction of 32.3% parameters while maintaining accuracy.
- Employing DySample as upsampling instead of conventional interpolation increases accuracy without increasing model complexity.
- Incorporating the OKS-Loss function further boosts the accuracy of the estimation without increasing the number of parameters.
2. Related Work
3. Methodologies
3.1. Preliminary
3.2. C3k2 with Context Anchor Attention
3.3. DySample
3.4. OKS-Loss
4. Experiments
4.1. Datasets
4.2. Baselines
- Stacked Hourglass network employs repeated encoder-decoder hourglass modules to capture multi-scale context and iteratively refine heatmaps for precise keypoint localization [49].
- PETR is a transformer-based end-to-end pose estimator that casts multi-person pose as a DETR-style set prediction problem and uses learned queries to directly output human keypoints without separate detection or grouping [50].
- ED-Pose unifies person detection and pose estimation by treating keypoints as explicit detection targets and decoding them with queries so that a single network handles multi-person pose in a fully end-to-end manner [51].
- SimpleBaseline follows a top-down design with a ResNet backbone and a small stack of transposed convolutions to produce heatmaps, showing that a minimal architecture is highly competitive [34].
- Mask R-CNN extends a two-stage detector with a dedicated keypoint branch that predicts human joints together with bounding boxes and segmentation masks, enabling joint detection and pose within one framework [4].
- YOLO5s6_pose_ti_lite modifies YOLOv5 to directly regress body joints in a single forward pass and couples person detection with keypoint regression for low-latency inference [47].
- YOLOv8s-pose adds a dedicated keypoint head on top of a lightweight YOLOv8 backbone and uses decoupled detection heads to deliver strong accuracy at real-time speeds [38].
- YOLOv11s-pose further refines the YOLO design with an improved backbone and head to raise pose accuracy while preserving throughput [52].
- YOLOv5s_pose adapts YOLOv5 to output keypoint coordinates for each detected person and provides a simple and fast baseline [47].
- YOLOv7-tiny-pose is an extremely compact variant that emphasizes high-speed inference with a lightweight pose head and accepts some accuracy trade-offs [53].
4.3. Training Details
4.4. Experimental Results
4.5. Ablation Study
4.6. Edge Device Deployment
4.7. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based Human Pose Estimation: A Survey. ACM Comput. Surv. 2023, 56, 11. [Google Scholar] [CrossRef]
- Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; Yuan, J. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13232–13242. [Google Scholar]
- Tu, Z.; Liu, Y.; Zhang, Y.; Mu, Q.; Yuan, J. DTCM: Joint optimization of dark enhancement and action recognition in videos. IEEE Trans. Image Process. 2023, 32, 3507–3520. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Wei, J.; Yu, B.; Zou, T.; Zheng, Y.; Qiu, X.; Hu, M.; Yu, H.; Xiao, D.; Yu, Y.; Liu, J. A review of transformer-based human pose estimation: Delving into the relation modeling. Neurocomputing 2025, 639, 130210. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part v 13; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Bulling, A.; Blanke, U.; Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 2014, 46, 1–33. [Google Scholar] [CrossRef]
- Satyanarayanan, M. The emergence of edge computing. Computer 2017, 50, 30–39. [Google Scholar] [CrossRef]
- To, H.T.; Le, T.K.; Le, C.L. Real-Time End-to-End 3D Human Pose Prediction on AI Edge Devices. In Proceedings of the International Conference on Intelligent Systems & Networks, Hanoi, Vietnam, 19–20 March 2021. [Google Scholar]
- Liu, L.; Blancaflor, E.B.; Abisado, M. A lightweight multi-person pose estimation scheme based on Jetson Nano. Appl. Comput. Sci. 2023, 19, 1–14. [Google Scholar] [CrossRef]
- Zhang, Y.; Gan, J.; Zhao, Z.; Chen, J.; Chen, X.; Diao, Y.; Tu, S. A real-time fall detection model based on BlazePose and improved ST-GCN. J. Real-Time Image Process. 2023, 20, 121. [Google Scholar] [CrossRef]
- Xu, D.; Li, T.; Li, Y.; Su, X.; Tarkoma, S.; Hui, P. A Survey on Edge Intelligence. arXiv 2020, arXiv:2003.12172. [Google Scholar]
- Osokin, D. Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose. arXiv 2018, arXiv:1811.12004. [Google Scholar]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
- Jo, B.; Kim, S. Comparative analysis of OpenPose, PoseNet, and MoveNet models for pose estimation in mobile devices. Trait. Du Signal 2022, 39, 119–124. [Google Scholar] [CrossRef]
- Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Yang, S.; Quan, Z.; Nie, M.; Yang, W. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. Hrformer: High-resolution vision transformer for dense predict. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
- Kreiss, S.; Bertoni, L.; Alahi, A. Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Papandreou, G.; Zhu, T.; Chen, L.C.; Gidaris, S.; Tompson, J.; Murphy, K. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Computer Vision—ECCV 2018, Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021. [Google Scholar]
- Zhou, S.; Duan, X.; Zhou, J. Human pose estimation based on frequency domain and attention module. Neurocomputing 2024, 604, 128318. [Google Scholar] [CrossRef]
- Li, T.; Geng, P.; Lu, X.; Li, W.; Lyu, L. Skeleton-based action recognition through attention guided heterogeneous graph neural network. Knowl.-Based Syst. 2025, 309, 112868. [Google Scholar] [CrossRef]
- Chang, H.; Ren, P.; Zhang, H.; Xie, L.; Chen, H.; Yin, E. Hierarchical-aware orthogonal disentanglement framework for fine-grained skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 11252–11261. [Google Scholar]
- Kang, G.Y.; Lu, Z.Q.; Lu, Z.M. Lightweight Human Pose Estimation Network and Angle-based Action Recognition. J. Netw. Intell. 2020, 5, 240–249. [Google Scholar]
- Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Computer Vision—ECCV 2018, Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Xu, Y.; Zhao, L.; Gong, C.; Li, G.; Wang, D.; Wang, N. Dynpose: Largely improving the efficiency of human pose estimation by a simple dynamic framework. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 1160–1169. [Google Scholar]
- Jamil, S. PoseSynViT: Lightweight and Scalable Vision Transformers for Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 3912–3921. [Google Scholar]
- Guo, B.; Zhou, C.; Guo, F.; Luo, X.; Luo, G.; Zhang, F. LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation. arXiv 2025, arXiv:2506.04561. [Google Scholar]
- Ultralytics. YOLO by Ultralytics (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2023).
- Alif, M.A.R. YOLOv11 for Vehicle Detection: Advancements, Performance, and Applications in Intelligent Transportation Systems. arXiv 2024, arXiv:2410.22898. [Google Scholar] [CrossRef]
- Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2024. [Google Scholar]
- Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
- Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance Cheap Operation with Long-Range Attention. In NIPS ’22: Proceedings of the 36th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
- Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Li, J.; Wang, C.; Zhu, H.; Mao, Y.; Fang, H.S.; Lu, C. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VIII 14; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Shi, D.; Wei, X.; Li, L.; Ren, Y.; Tan, W. End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Yang, J.; Zeng, A.; Liu, S.; Li, F.; Zhang, R.; Zhang, L. Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation. arXiv 2023, arXiv:2302.01593. [Google Scholar]
- Ultralytics. YOLO by Ultralytics (Version 11.0.0). 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 30 September 2024).
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar] [CrossRef]
- Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]










| Method | Params (M) | mAP50 | mAP75 | mAP50-95 | mAPM | mAPL |
|---|---|---|---|---|---|---|
| Large model | ||||||
| Hourglass [49] | 277.8 | 81.8 | 61.8 | 56.6 | - | 67.0 |
| PETR [50] | 220.5 | 90.4 | 79.6 | - | 68.1 | 79.0 |
| ED-Pose [51] | 218 | 92.3 | 80.9 | 72.7 | 67.6 | 80.8 |
| SimpleBaseline [34] | 68.6 | 91.2 | 80.1 | 71.6 | 68.7 | 77.2 |
| Mask R-CNN [4] | 63.0 | 87.3 | 68.7 | 63.1 | 57.8 | 71.4 |
| CDO-POSE | 6.7 | 85.3 | 62.1 | 57.0 | 51.0 | 67.1 |
| Small model | ||||||
| YOLO5s6_pose_ti_lite [47] | 15.1 | 85.0 | 65.4 | 59.2 | 51.5 | 70.1 |
| YOLOv8s-pose [38] | 11.6 | 86.4 | 64.6 | 58.7 | 53.2 | 68.1 |
| YOLOv11s-pose [52] | 9.9 | 86.3 | 63.2 | 57.3 | 52.9 | 65.7 |
| YOLOv5s_pose [47] | 7.2 | 83.2 | 58.2 | 53.9 | 50.3 | 59.6 |
| YOLOv7-tiny-pose [53] | 6.1 | 80.3 | 47.3 | 45.8 | 45.9 | 47.1 |
| CDO-POSE | 6.7 | 85.3 | 62.1 | 57.0 | 51.0 | 67.1 |
| Method | Params (M) | mAP50 | mAP75 | mAP50-95 | mAPM | mAPL |
|---|---|---|---|---|---|---|
| Large model | ||||||
| PETR [50] | 220.5 | 90.4 | 78.3 | 77.3 | 72.0 | 65.8 |
| ED-Pose [51] | 218 | 90.5 | 79.8 | 80.5 | 73.8 | 63.8 |
| SimpleBaseline [34] | 68.6 | 81.4 | 60.3 | 71.4 | 61.2 | 51.2 |
| Mask R-CNN [4] | 63.0 | 83.5 | 60.3 | 69.4 | 57.9 | 45.8 |
| CDO-POSE | 6.7 | 84.6 | 63.4 | 66.2 | 56.9 | 50.4 |
| Small model | ||||||
| YOLO5s6_pose_ti_lite [47] | 15.1 | 81.3 | 54.1 | 50.7 | 50.7 | 43.1 |
| YOLOv8s-pose [38] | 11.6 | 85.8 | 66.0 | 68.3 | 61.7 | 51.6 |
| YOLOv11s-pose [52] | 9.9 | 85.9 | 65.0 | 67.7 | 61.1 | 50.8 |
| YOLOv5s_pose [47] | 7.2 | 78.2 | 49.4 | 46.7 | 46.7 | 39.7 |
| YOLOv7-tiny-pose [53] | 6.1 | 78.7 | 51.1 | 47.8 | 47.8 | 40.7 |
| CDO-POSE | 6.7 | 84.6 | 63.4 | 66.2 | 56.9 | 50.4 |
| C3k2CAA | DySample | OKS-Loss | Params (M) | mAP50 | mAP75 | mAP50-95 | mAPM | mAPL |
|---|---|---|---|---|---|---|---|---|
| × | × | × | 9.9 | 86.3 | 63.2 | 57.3 | 52.9 | 65.7 |
| ✓ | × | × | 6.7 | 85.2 | 60.1 | 54.9 | 50.3 | 63.5 |
| × | ✓ | × | 9.9 | 86.6 | 63.8 | 57.7 | 53.7 | 65.8 |
| × | × | ✓ | 9.9 | 86.6 | 66.0 | 59.8 | 54.6 | 69.2 |
| ✓ | ✓ | × | 6.7 | 85.4 | 60.7 | 55.5 | 50.9 | 64.2 |
| × | ✓ | ✓ | 9.9 | 86.9 | 65.9 | 59.9 | 54.6 | 69.3 |
| ✓ | ✓ | ✓ | 6.7 | 85.3 | 62.1 | 57.0 | 51.0 | 67.1 |
| Model | Params (M) | FPS (480p) | FPS (720p) |
|---|---|---|---|
| YOLOv8-pose | 69.4 | 8.94 | 6.40 |
| YOLOv11-pose | 58.8 | 10.37 | 6.54 |
| CDO-POSE | 6.7 | 39.79 | 29.23 |
| Dataset | Devices | Model | Params (M) | GFLOPS | mAP50-95 | mAP50 | mAP75 |
|---|---|---|---|---|---|---|---|
| COCO2017 Val | NVIDIA 4090D GPU | CDO-POSE | 6.7 | 17.8 | 58.4 | 83.8 | 63.2 |
| Jetson Orin Nano | CDO-POSE | 6.7 | 17.8 | 58.3 | 83.5 | 63.5 | |
| COCO2017 test-dev | NVIDIA 4090D GPU | CDO-POSE | 6.7 | 17.8 | 57.0 | 85.3 | 62.1 |
| Jetson Orin Nano | CDO-POSE | 6.7 | 17.8 | 56.9 | 84.9 | 62.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xu, H.; Chen, J.; Cai, S.; Guo, J. CDO-POSE: A Lightweight Model for 2D Human Pose Estimation. Sensors 2026, 26, 2159. https://doi.org/10.3390/s26072159
Xu H, Chen J, Cai S, Guo J. CDO-POSE: A Lightweight Model for 2D Human Pose Estimation. Sensors. 2026; 26(7):2159. https://doi.org/10.3390/s26072159
Chicago/Turabian StyleXu, Haifeng, Jingke Chen, Shuhan Cai, and Jiangling Guo. 2026. "CDO-POSE: A Lightweight Model for 2D Human Pose Estimation" Sensors 26, no. 7: 2159. https://doi.org/10.3390/s26072159
APA StyleXu, H., Chen, J., Cai, S., & Guo, J. (2026). CDO-POSE: A Lightweight Model for 2D Human Pose Estimation. Sensors, 26(7), 2159. https://doi.org/10.3390/s26072159
