CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity
Abstract
1. Introduction
- We propose a compact spatial representation (CSR) that reduces subsequent computational costs while enhancing the capability against unreliable 2D pose detection. CSR is the first attempt to explore 2D pose sequences locally while reducing subsequent 2D-to-3D lifting computational cost at the same time.
- We design an effective hybrid adaptive fusion (HAF) to integrate features in both the CSR domain and the frequency domain, thereby improving the accuracy of 3D pose estimation at the expense of low computational overhead. HAF enriches the residual network structure and can adapt to the general fusion of multiple features.
- Our CGFusionFormer achieves comparable results on two challenging datasets for 3D HPE, demonstrating a superior speed–accuracy trade-off compared to previous methods.
2. Related Work
2.1. Transformer-Based Methods for 3D Human Pose Estimation
2.2. Strategies to Enhance Robustness in Computer Vision
2.2.1. Multi-Hypothesis Methods
2.2.2. Noise-Based Processing Methods
3. Method
3.1. Overview
3.2. Compact Spatial Representation (CSR)
3.2.1. Body-Part Spatial Transformer Encoder (BPS)
3.2.2. Filter-Based Multi-Hypothesis Compression (FMC)
3.3. Hybrid Adaptive Fusion (HAF)
3.4. Regression Head
4. Experiment
4.1. Datasets and Evaluation Metrics
- Human3.6M [20] is a widely used benchmark for 3D human pose estimation. It was recorded in a controlled indoor studio with synchronized and calibrated multi-view RGB cameras (typically four views). The dataset contains over 3.6 million video frames of 11 subjects performing 15 everyday actions (e.g., walking, phoning) and provides accurate 3D joint annotations captured by a motion-capture system. Following the standard subject split, we use S1, S5, S6, S7, and S8 for training and S9 and S11 for testing. We adopt the Human3.6M-17 joint convention with pelvis as the root and report errors in millimeters in the camera coordinate system. The results are reported with two common metrics: MPJPE (Mean Per-Joint Position Error; Protocol #1) and P-MPJPE (Protocol #2), which computes MPJPE after a rigid Procrustes alignment [35].
- MPI-INF-3DHP [21] is also captured by camera. Compared to Human3.6M, it covers more diverse motions and scenes, including indoor green screen setups with varied lighting and a portion of outdoor environments, resulting in larger viewpoint changes, more occlusions, and more diverse backgrounds. The 3D labels are obtained via multi-view reconstruction and motion-capture pipelines with manual verification, making the dataset suitable for evaluating generalization. Following [4], we report MPJPE, PCK (Percentage of Correct Keypoints), and AUC (Area Under the Curve).
4.2. Implementation Details and Analysis
4.2.1. Hyperparameter Settings
4.2.2. Experimental Settings
4.3. Comparison with State-of-the-Art Methods
4.3.1. Results on Human3.6M
4.3.2. Results on MPI-INF-3DHP
4.3.3. Robustness and Qualitative Comparisons
4.3.4. Ablation Study
4.3.5. Computational Complexity Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Rommel, C.; Letzelter, V.; Samet, N.; Marlet, R.; Cord, M.; Pérez, P.; Valle, E. ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 14 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 107350–107378. [Google Scholar]
- Xu, J.; Guo, Y.; Peng, Y. FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 561–570. [Google Scholar] [CrossRef]
- He, R.; Xiang, S.; Tao, P.; Yu, Y. Monocular 3D Human Pose Estimation Based on Global Temporal-Attentive and Joints-Attention In Video. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
- Hu, W.; Zhang, C.; Zhan, F.; Zhang, L.; Wong, T.T. Conditional Directed Graph Convolution for 3D Human Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; MM ’21. pp. 602–611. [Google Scholar] [CrossRef]
- Liu, K.; Zou, Z.; Tang, W. Learning Global Pose Features in Graph Convolutional Networks for 3D Human Pose Estimation. In Proceedings of the Computer Vision—ACCV 2020, Kyoto, Japan, 30 November–4 December 2020; Ishikawa, H., Liu, C.L., Pajdla, T., Shi, J., Eds.; ACCV: Cham, Switzerland, 2021; pp. 89–105. [Google Scholar] [CrossRef]
- Shin, S.; Kim, J.; Halilaj, E.; Black, M.J. WHAM: Reconstructing World-Grounded Humans with Accurate 3D Motion. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2070–2080. [Google Scholar] [CrossRef]
- Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7307–7316. [Google Scholar] [CrossRef]
- Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1263–1272. [Google Scholar] [CrossRef]
- Li, S.; Chan, A.B. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Proceedings of the Computer Vision—ACCV 2014, Singapore, 1–5 November 2014; Cremers, D., Reid, I., Saito, H., Yang, M.H., Eds.; ACCV: Cham, Switzerland, 2015; pp. 332–347. [Google Scholar] [CrossRef]
- Lie, W.N.; Vann, V. Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints. Sensors 2024, 24, 8017. [Google Scholar] [CrossRef] [PubMed]
- Zheng, H.; Li, H.; Dai, W.; Zheng, Z.; Li, C.; Zou, J.; Xiong, H. HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 16807–16817. [Google Scholar] [CrossRef]
- Shan, W.; Liu, Z.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. P-STMO: Pre-trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; ACM: Cham, Switzerland, 2022; pp. 461–478. [Google Scholar] [CrossRef]
- Chen, H.; He, J.Y.; Xiang, W.; Cheng, Z.Q.; Liu, W.; Liu, H.; Luo, B.; Geng, Y.; Xie, X. HDFormer: High-order Directed Transformer for 3D Human Pose Estimation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, Macao, China, 19–25 August 2023; pp. 581–589. [Google Scholar] [CrossRef]
- Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar] [CrossRef]
- Wei, F.; Xu, G.; Wu, Q.; Qin, P.; Pan, L.; Zhao, Y. Whole-Body 3D Pose Estimation Based on Body Mass Distribution and Center of Gravity Constraints. Sensors 2025, 25, 3944. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.; Quan, W.; Zhao, R.; Zhang, M.; Jiang, N. Learning Temporal–Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation. Sensors 2024, 24, 4422. [Google Scholar] [CrossRef] [PubMed]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 11636–11645. [Google Scholar] [CrossRef]
- Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13137–13146. [Google Scholar] [CrossRef]
- Zhao, Q.; Zheng, C.; Liu, M.; Wang, P.; Chen, C. PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8877–8886. [Google Scholar] [CrossRef]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef]
- Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 506–516. [Google Scholar] [CrossRef]
- Hassanin, M.; Khamis, A.; Bennamoun, M.; Boussaïd, F.; Radwan, I. CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation. arXiv 2022, arXiv:2203.13387. [Google Scholar] [CrossRef]
- Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation. IEEE Trans. Multim. 2023, 25, 1282–1293. [Google Scholar] [CrossRef]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15979–15988. [Google Scholar] [CrossRef]
- Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7745–7754. [Google Scholar] [CrossRef]
- Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; Yuan, J. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13222–13232. [Google Scholar] [CrossRef]
- Qian, X.; Tang, Y.; Zhang, N.; Han, M.; Xiao, J.; Huang, M.; Lin, R. HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation. arXiv 2023, arXiv:2301.07322. [Google Scholar] [CrossRef]
- Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. MotionBERT: Unified Pretraining for Human Motion Analysis. arXiv 2022, arXiv:2210.06551. [Google Scholar] [CrossRef]
- Wehrbein, T.; Rudolph, M.; Rosenhahn, B.; Wandt, B. Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 11179–11188. [Google Scholar] [CrossRef]
- Rezende, D.J.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 1530–1538. [Google Scholar]
- Tu, Z.; Milanfar, P.; Talebi, H. MULLER: Multilayer Laplacian Resizer for Vision. arXiv 2023, arXiv:2304.02859. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Shan, W.; Liu, Z.; Zhang, X.; Wang, Z.; Han, K.; Wang, S.; Ma, S.; Gao, W. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 14715–14725. [Google Scholar] [CrossRef]
- Simo-Serra, E.; Ramisa, A.; Alenyà, G.; Torras, C.; Moreno-Noguer, F. Single image 3D human pose estimation from noisy observations. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2673–2680. [Google Scholar] [CrossRef]
- Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based Human Pose Estimation: A Survey. ACM Comput. Surv. 2024, 56, 1–37. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar] [CrossRef]
- Lin, J.; Lee, G.H. Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation. In Proceedings of the 30th British Machine Vision Conference, Cardiff, UK, 9–12 September 2019; p. 101. [Google Scholar]
- Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 198–209. [Google Scholar] [CrossRef]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar]
Method | f | Seq. Len. | MFLOPs | MPJPE↓/P-MPJPE↓ |
---|---|---|---|---|
PoseFormer [17] | 27 | 27 | 542.1 | 47.0/- |
StridedTrans [23] | 81 | 81 | 342.5 | 47.5/- |
MHFormer [18] | 9 | 9 | 342.9 | 47.8/- |
MHFormer [18] | 27 | 27 | 1031.8 | 45.9/- |
P-STMO [12] (*) | 81 | 81 | 493 | 45.6/- |
STCFormer [14] | 27 | 27 | 2173 | 44.1/- |
PoseFormerV2 [19] | 1 | 27 | 77.2 | 48.7/37.8 |
CGFusionFormer (ours) | 1 | 27 | 46.7 | 48.5/37.8 |
PoseFormerV2 [19] | 1 | 81 | 77.2 | 47.6/37.3 |
CGFusionFormer (ours) | 1 | 81 | 46.7 | 47.3/37.4 |
PoseFormerV2 [19] | 3 | 27 | 117.3 | 47.9/37.4 |
CGFusionFormer (ours) | 3 | 27 | 71.3 | 47.6/37.3 |
PoseFormerV2 [19] | 3 | 81 | 117.3 | 47.1/37.3 |
CGFusionFormer (ours) | 3 | 81 | 71.3 | 47.1/37.4 |
PoseFormerV2 [19] | 9 | 81 | 351.7 | 46.0/36.1 |
CGFusionFormer (ours) | 9 | 81 | 215 | 46.5/36.5 |
Method | Seq. Len. | PCK↑ | AUC↑ | MPJPE↓ |
---|---|---|---|---|
Pavllo et al. [25] | 81 | 86.0 | 51.9 | 84.0 |
Pavllo et al. [25] | 243 | 85.5 | 51.5 | 84.8 |
Lin et al. [39] | 25 | 83.6 | 51.4 | 79.8 |
Chen et al. [40] | 81 | 87.9 | 54.0 | 78.8 |
PoseFormer [17] | 9 | 95.4 | 63.2 | 57.7 |
MHFormer [18] | 9 | 93.8 | 63.3 | 58.0 |
MixSTE [26] | 27 | 94.4 | 66.5 | 54.9 |
P-STMO [12] (*) | 81 | 97.9 | 75.8 | 32.2 |
PoseFormerV2 [19] | 81 | 97.9 | 78.8 | 27.8 |
CGFusionFormer (ours) | 81 | 97.9 | 78.5 | 27.2 |
BPS | FMC | HAF | MFLOPs | MPJPE↓ | ||
---|---|---|---|---|---|---|
Baseline | #1 | 117.3 | 47.9 | |||
+BPS | #2 | ✓ | 117.3 | 47.5 | ||
+FMC | #3 | ✓ | ✓ | 67.7 | 48.4 | |
+HAF | #4 | ✓ | ✓ | ✓ | 71.7 | 47.6 |
Method | f | Seq. Len. | MFLOPs | FPS (l) | FPS (e) | Perform. Drop (mm) |
---|---|---|---|---|---|---|
MHFormer [18] CVPR’22 | 81 | 81 | 3132.2 | 82.57 | 14.59 | 5.25 |
PoseFormerV2 [19] CVPR’23 | 3 | 81 | 117.3 | 89.93 | 17.74 | 4.3 |
CGFusionFormer | 1 | 81 | 46.7 | 97.95 | 18.21 | 3.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lu, T.; Wang, H.; Xiao, D. CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity. Sensors 2025, 25, 6052. https://doi.org/10.3390/s25196052
Lu T, Wang H, Xiao D. CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity. Sensors. 2025; 25(19):6052. https://doi.org/10.3390/s25196052
Chicago/Turabian StyleLu, Tao, Hongtao Wang, and Degui Xiao. 2025. "CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity" Sensors 25, no. 19: 6052. https://doi.org/10.3390/s25196052
APA StyleLu, T., Wang, H., & Xiao, D. (2025). CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity. Sensors, 25(19), 6052. https://doi.org/10.3390/s25196052