STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios
Abstract
1. Introduction
2. Related Work
3. Methodology
3.1. Graph Configuration
3.2. Propagation Rules
3.3. 6D Rotation Representation
3.4. Skip-TCN
3.5. Architecture
4. Experimental Results
4.1. Implementation Details
4.2. Comparison of Joint Position Estimation
| Method | Dir. | Disc. | Eat. | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD | Walk | WalkT | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Martinez et al. [2] (2017) | 51.8 | 56.2 | 58.1 | 59.0 | 69.5 | 78.4 | 55.2 | 58.1 | 74.0 | 94.6 | 62.3 | 59.1 | 65.1 | 65.1 | 52.4 | 62.9 |
| Fang et al. [40] 2018) | 50.1 | 54.3 | 57.0 | 57.1 | 66.6 | 73.3 | 53.4 | 55.7 | 72.8 | 88.6 | 60.3 | 57.7 | 62.7 | 47.5 | 50.6 | 60.4 |
| Pavlakos et al. [41] (2018) | 48.5 | 54.4 | 54.4 | 52.0 | 59.4 | 65.3 | 49.9 | 52.9 | 65.8 | 71.1 | 56.6 | 52.9 | 60.9 | 44.7 | 47.8 | 56.2 |
| Lee et al. [42] (2018) | 40.2 | 49.2 | 47.8 | 52.6 | 50.1 | 75.0 | 50.2 | 43.0 | 55.8 | 73.9 | 54.1 | 55.6 | 58.2 | 43.3 | 43.3 | 52.8 |
| Zhao et al. [10] (2019) | 57.3 | 60.7 | 51.4 | 60.5 | 61.1 | 49.9 | 47.3 | 68.1 | 86.2 | 55.0 | 67.8 | 61.0 | 42.1 | 60.6 | 45.3 | 57.6 |
| Ci et al. [43] (2019) | 46.8 | 52.3 | 44.7 | 50.4 | 52.9 | 68.9 | 49.6 | 46.4 | 60.2 | 78.9 | 51.2 | 50.0 | 54.8 | 40.4 | 43.3 | 52.7 |
| Pavllo et al. [1] 2019) | 47.1 | 50.6 | 49.0 | 51.8 | 53.6 | 61.4 | 49.4 | 47.4 | 59.3 | 67.4 | 52.4 | 49.5 | 55.3 | 39.5 | 42.7 | 51.8 |
| Cai et al. [20] (2019) § | 46.5 | 48.8 | 47.6 | 50.9 | 52.9 | 61.3 | 48.3 | 45.8 | 59.2 | 64.4 | 51.2 | 48.4 | 53.5 | 39.2 | 41.2 | 50.6 |
| Xu et al. [23] (2021) | 45.2 | 49.9 | 47.5 | 50.9 | 54.9 | 66.1 | 48.5 | 46.3 | 59.7 | 71.5 | 51.4 | 48.6 | 53.9 | 39.9 | 44.1 | 51.9 |
| Zhao et al. [44] (2022) | 45.2 | 50.8 | 48.0 | 50.0 | 54.9 | 65.0 | 48.2 | 47.1 | 60.2 | 70.0 | 51.6 | 48.7 | 54.1 | 39.7 | 43.1 | 51.8 |
| Soubarna et al. [19] (2024) | 48.9 | 50.1 | 46.7 | 50.4 | 54.6 | 63.0 | 48.8 | 47.9 | 64.1 | 68.6 | 50.5 | 48.7 | 53.9 | 39.3 | 42.2 | 51.7 |
| STG-Net (Ours, F = 1) | 44.6 | 50.9 | 48.0 | 49.3 | 52.0 | 61.2 | 48.3 | 46.5 | 58.8 | 69.8 | 51.4 | 47.6 | 52.9 | 38.6 | 41.6 | 50.8 |
| Method | Dir. | Disc. | Eat. | Greet | Phone | Photo | Pose | Purch | Sit | SitD. | Smoke | Wait | WalkD | Walk | WalkT. | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Martinez et al. [2] (2017) | 39.5 | 43.2 | 46.4 | 47.0 | 51.0 | 56.0 | 41.4 | 40.6 | 56.5 | 69.4 | 49.2 | 45.0 | 49.5 | 38.0 | 43.1 | 47.7 |
| Fang et al. [40] (2018) | 38.2 | 41.7 | 43.7 | 44.9 | 48.5 | 55.3 | 40.2 | 38.2 | 54.5 | 64.4 | 47.2 | 44.3 | 47.3 | 36.7 | 41.7 | 45.7 |
| Pavlakos et al. [41] (2018) | 34.7 | 39.8 | 41.8 | 38.6 | 42.5 | 47.5 | 38.0 | 36.6 | 50.7 | 56.8 | 42.6 | 39.6 | 43.9 | 32.1 | 36.5 | 41.8 |
| Lee et al. [42] (2018) | 34.9 | 35.2 | 43.2 | 42.6 | 46.2 | 55.0 | 37.6 | 38.8 | 50.9 | 67.3 | 48.9 | 35.2 | 31.0 | 50.7 | 34.6 | 43.4 |
| Pavllo et al. [1] (2019) | 36.0 | 38.7 | 38.0 | 41.7 | 40.1 | 45.9 | 37.1 | 35.4 | 46.8 | 53.4 | 41.4 | 36.9 | 43.1 | 30.3 | 34.8 | 40.0 |
| Cai et al. [20] (2019) | 36.8 | 38.7 | 38.2 | 41.7 | 40.7 | 46.8 | 37.9 | 35.6 | 47.6 | 51.7 | 41.3 | 36.8 | 42.7 | 31.0 | 34.7 | 40.2 |
| Liu et al. [8] (2020) | 35.9 | 40.0 | 38.0 | 41.5 | 42.5 | 51.4 | 37.8 | 36.0 | 48.6 | 56.6 | 41.8 | 38.3 | 42.7 | 31.7 | 36.2 | 41.2 |
| Zou et al. [9] (2021) | 35.7 | 38.6 | 36.3 | 40.5 | 39.2 | 44.5 | 37.0 | 35.4 | 46.4 | 51.2 | 40.5 | 35.6 | 41.7 | 30.7 | 33.9 | 39.1 |
| STG-Net (Ours, F = 1) | 35.5 | 39.6 | 38.2 | 40.7 | 40.3 | 46.2 | 36.2 | 35.7 | 48.4 | 54.9 | 42.0 | 36.1 | 42.7 | 31.1 | 34.4 | 40.1 |
| Method | Dir. | Disc. | Eat. | Greet | Phone | Photo | Pose | Purch | Sit | SitD. | Smoke | Wait | WalkD | Walk | WalkT. | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pavllo et al. [1] (2019) | 45.2 | 46.7 | 43.3 | 45.6 | 48.1 | 55.1 | 44.6 | 44.3 | 57.3 | 65.8 | 47.1 | 44.0 | 49.0 | 32.8 | 33.9 | 46.8 |
| Liu et al. [14] (2020) | 41.8 | 44.8 | 41.1 | 44.9 | 47.4 | 54.1 | 43.4 | 42.2 | 56.2 | 63.6 | 45.3 | 43.5 | 45.3 | 31.3 | 32.2 | 45.1 |
| Zeng et al. [45] (2020) | 46.6 | 47.1 | 43.9 | 41.6 | 45.8 | 49.6 | 46.5 | 40.0 | 53.4 | 61.1 | 46.1 | 42.6 | 43.1 | 31.5 | 32.6 | 44.8 |
| Chen et al. [46] (2021) | 41.4 | 43.5 | 40.1 | 42.9 | 46.6 | 51.9 | 41.7 | 42.3 | 53.9 | 60.2 | 45.4 | 41.7 | 46.0 | 31.5 | 32.7 | 44.1 |
| Li et al. [3] (2022) | 40.3 | 43.3 | 40.2 | 42.3 | 45.6 | 52.3 | 41.8 | 40.5 | 55.9 | 60.6 | 44.2 | 43.0 | 44.2 | 30.0 | 30.2 | 43.7 |
| Yu et al. [6] (2023) | 41.3 | 44.3 | 40.8 | 41.8 | 45.9 | 54.1 | 42.1 | 41.5 | 57.8 | 62.9 | 45.0 | 42.8 | 45.9 | 29.4 | 29.9 | 44.4 |
| Zhao et al. [47] (2023) | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | 45.2 |
| Islam et al. [24] (2024) | 38.7 | 43.9 | 42.3 | 43.8 | 44.8 | 48.1 | 42.4 | 41.2 | 52.6 | 63.8 | 43.5 | 42.7 | 44.7 | 34.1 | 34.5 | 44.1 |
| Song et al. [22] (2024) | 41.1 | 43.3 | 40.4 | 41.3 | 44.9 | 53.2 | 41.7 | 41.1 | 54.9 | 65.2 | 43.5 | 41.3 | 42.7 | 29.1 | 29.2 | 43.5 |
| Lin et al. [38] (2025) | 39.0 | 42.5 | 40.7 | 41.1 | 45.9 | 51.3 | 41.1 | 40.5 | 54.1 | 59.9 | 44.1 | 41.3 | 43.4 | 29.5 | 29.8 | 43.0 |
| Hao et al. [39] (2025) | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | 43.0 |
| STAG-Net (Ours, F = 243, S) | 40.6 | 44.5 | 39.4 | 41.0 | 45.6 | 50.9 | 42.6 | 43.2 | 54.3 | 59.9 | 44.6 | 40.8 | 43.1 | 29.2 | 30.4 | 43.3 |
| STAG-Net (Ours, F = 243, L) | 37.8 | 43.5 | 38.9 | 41.0 | 42.4 | 52.6 | 41.3 | 38.8 | 51.6 | 57.2 | 42.0 | 40.2 | 42.4 | 27.0 | 29.8 | 41.8 |
| Method | Dir. | Disc. | Eat. | Greet | Phone | Photo | Pose | Purch | Sit | SitD. | Smoke | Wait | WalkD | Walk | WalkT. | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pavllo et al. [1] (2019) | 34.1 | 36.1 | 34.4 | 37.2 | 36.4 | 42.2 | 34.4 | 33.6 | 45.0 | 52.5 | 37.4 | 33.8 | 37.8 | 25.6 | 27.3 | 36.5 |
| Liu et al. [14] (2020) | 32.3 | 35.2 | 33.3 | 35.8 | 35.9 | 41.5 | 33.2 | 32.7 | 44.6 | 50.9 | 37.0 | 32.4 | 37.0 | 25.2 | 27.2 | 35.6 |
| Zeng et al. [45] (2020) | 34.8 | 32.1 | 28.5 | 30.7 | 31.4 | 36.9 | 35.6 | 30.5 | 38.9 | 40.5 | 32.5 | 31.0 | 29.9 | 22.5 | 24.5 | 32.0 |
| Chen et al. [46] (2021) | 32.6 | 35.1 | 32.8 | 35.4 | 36.3 | 40.4 | 32.4 | 32.3 | 42.7 | 49.0 | 36.8 | 32.4 | 36.0 | 24.9 | 26.5 | 35.0 |
| Li et al. [3] (2022) | 32.7 | 35.5 | 32.5 | 35.4 | 35.9 | 41.6 | 33.0 | 31.9 | 45.1 | 50.1 | 36.3 | 33.5 | 35.1 | 23.9 | 25.0 | 35.2 |
| Yu et al. [6] (2023) | 32.4 | 35.3 | 32.6 | 34.2 | 35.0 | 42.1 | 32.1 | 31.9 | 45.5 | 49.5 | 36.1 | 32.4 | 35.6 | 23.5 | 24.7 | 34.8 |
| Zhao et al. [47] (2023) | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | 35.6 |
| Islam et al. [24] (2024) | 33.0 | 36.1 | 34.1 | 37.4 | 36.2 | 40.4 | 33.6 | 32.4 | 44.1 | 54.4 | 36.5 | 34.5 | 36.2 | 26.4 | 27.4 | 36.2 |
| Song et al. [22] (2024) | 31.1 | 34.9 | 32.4 | 33.7 | 36.3 | 42.8 | 31.6 | 31.2 | 44.7 | 48.6 | 36.9 | 32.4 | 35.4 | 24.1 | 24.4 | 34.7 |
| Lin et al. [38] (2025) | 31.2 | 34.4 | 33.0 | 33.9 | 35.4 | 39.4 | 32.3 | 31.7 | 43.3 | 48.1 | 35.9 | 32.8 | 34.6 | 24.0 | 24.5 | 34.3 |
| Hao et al. [39] (2025) | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | 28.3 |
| STAG-Net (Ours, F = 243, S) | 31.3 | 35.1 | 31.0 | 32.7 | 34.6 | 39.1 | 32.1 | 32.4 | 42.7 | 47.7 | 35.3 | 31.0 | 33.4 | 22.5 | 23.7 | 33.6 |
| STAG-Net (Ours, F = 243, L) | 30.2 | 33.2 | 30.9 | 32.4 | 32.1 | 40.3 | 30.4 | 29.3 | 41.3 | 45.2 | 33.9 | 30.2 | 33.3 | 21.6 | 24.5 | 32.6 |
| Method | 3DPCK | AUC |
|---|---|---|
| Luo et al. [48] (2018) | 81.8 | 45.2 |
| Wandt et al. [49] (2019) | 82.5 | 58.5 |
| Sárándi et al. [50] (2021) | 90.6 | 56.2 |
| Zheng et al. [51] (2021) | 88.6 | 56.4 |
| Gong et al. [52] (2022) | 89.1 | 53.1 |
| Oreshkin et al. [53] (2023) | 88.6 | 48.9 |
| Shetty et al. [54] (2023) | 91.8 | 52.3 |
| Qian et al. [55] (2023) | 97.3 | 71.5 |
| Hao et al. [39] (2025) | 94.0 | 55.2 |
| STG-Net (Ours, F = 1) | ||
| STAG-Net (Ours, F = 27) | ||
| STAG-Net (Ours, F = 81) |
| Method | Walk | Jog | Avg. | ||||
|---|---|---|---|---|---|---|---|
| S1 | S2 | S3 | S1 | S2 | S3 | ||
| Martinez et al. [2] (2017) | 19.7 | 17.4 | 46.8 | 26.9 | 18.2 | 18.6 | 24.6 |
| Lee et al. [42] (2018) | 18.6 | 19.9 | 30.5 | 25.7 | 16.8 | 17.7 | 21.5 |
| Pavllo et al. [1] (2019) | 13.9 | 10.2 | 46.6 | 20.9 | 13.1 | 13.8 | 19.8 |
| Zhang et al. [56] (2021) | 13.7 | 9.5 | 47.1 | 21.0 | 12.6 | 13.4 | 19.5 |
| Li et al. [3] (2022) † | 9.7 | 7.6 | 15.8 | 12.3 | 9.4 | 11.2 | 11.0 |
| Aouaidjia et al. [25] (2025) † | 8.7 | 6.5 | 17.9 | 13.5 | 7.8 | 8.5 | 10.4 |
| STAG-Net (Ours, F = 3) † | 8.5 | 6.2 | 10.2 | 10.0 | 7.8 | 8.4 | 8.5 |
4.3. Qualitative Results
4.4. Comparison of Joint Orientation Estimation
4.5. Model Size and Computational Complexity
| Method | Parameters | FLOPs | MPJPE | Frames |
|---|---|---|---|---|
| M. Rayat et al. [57] (2018) | 16.96 M | 33.88 M | 58.3 | 1 |
| Pavllo et al. [1] (2019) | ⸺ | ⸺ | 51.8 | 1 |
| Soubarna et al. [19] 2024) | 4.6 M | ⸺ | 51.7 | 1 |
| STG-Net (Ours) | 2.44 M | 0.34 M | 50.8 | 1 |
| STAG-Net (Ours) | 5.51 M | 55.66 M | 51.1 | 1 |
| Pavllo et al. [1] (2019) | 16.95 M | 33.87 M | 46.8 | 243 |
| LI et al. [3] (2022) | 4.23 M | 1.37 G | 44.0 | 243 |
| Yu et al. [6] (2023) | 1.3 M | 1.5 G | 44.4 | 243 |
| Zhu et al. [59] (2023) | 42.5 M | 174.7 G | 39.2 | 243 |
| Tang et al. [30] (2023) | 4.75 M | 19.56 G | 41.0 | 243 |
| Mehraban et al. [58] (2024) | 19.0 M | 78.3 G | 38.4 | 243 |
| Lin et al. [38] (2015) | 9.3 M | 1.29 G | 43.0 | 243 |
| STAG-Net (Ours, model-S) | 3.06 M | 6.93 G | 43.3 | 243 |
| STAG-Net (Ours, model-L) | 6.26 M | 13.88 G | 41.8 | 243 |
4.6. Ablation Study
5. Real-Time Application
6. Discussion
7. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| STAG-Net | Spatial–Temporal Attention Graph network |
| STG-Net | Spatial–Temporal Graph network |
| GCN | Graph Convolutional Network |
| TCN | Temporal Convolutional Network |
| CPN | Cascaded Pyramid Network |
| BN | Batch Normalization |
| ECA | Efficient Channel Attention |
| GNN | Graph Neural Network |
| M-NE | Modulated Node–Edge |
| M-NEA | Modulated Node–Edge–Attention |
| FC | Fully Connected |
| MPJPE | Mean Per Joint Position Error |
| P-MPJPE | Procrustes-aligned Mean Per Joint Position Error |
| IDev | Identity Deviation |
| MPJAE | Mean Per Joint Angular Error |
| 3DPCK | 3D Percentage of Correct Keypoints |
| AUC | Area Under the Curve |
References
- Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7753–7762. [Google Scholar]
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2640–2649. [Google Scholar]
- Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Trans. Multimed. 2022, 25, 1282–1293. [Google Scholar] [CrossRef]
- Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 13147–13156. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Yu, B.X.; Zhang, Z.; Liu, Y.; Zhong, S.h.; Liu, Y.; Chen, C.W. Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8818–8829. [Google Scholar]
- Manessi, F.; Rozza, A.; Manzo, M. Dynamic graph convolutional networks. Pattern Recognit. 2020, 97, 107000. [Google Scholar] [CrossRef]
- Liu, K.; Ding, R.; Zou, Z.; Wang, L.; Tang, W. A comprehensive study of weight sharing in graph networks for 3d human pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 318–334. [Google Scholar]
- Zou, Z.; Tang, W. Modulated graph convolutional network for 3D human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11477–11487. [Google Scholar]
- Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3425–3435. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7025–7034. [Google Scholar]
- Sosa, J.; Hogg, D. Self-supervised 3d human pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4787–4796. [Google Scholar]
- Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 2500–2509. [Google Scholar]
- Liu, R.; Shen, J.; Wang, H.; Chen, C.; Cheung, S.C.; Asari, V. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5064–5073. [Google Scholar]
- Lee, K.; Kim, W.; Lee, S. From human pose similarity metric to 3D human pose estimator: Temporal propagating LSTM networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1781–1797. [Google Scholar] [CrossRef]
- Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7122–7131. [Google Scholar]
- Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 10–17 October 2021; pp. 11446–11456. [Google Scholar]
- Fisch, M.; Clark, R. Orientation keypoints for 6D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 10145–10158. [Google Scholar] [CrossRef] [PubMed]
- Banik, S.; Avagyan, E.; Auddy, S.; Gracia, A.M.; Knoll, A. PoseGraphNet++: Enriching 3D human pose with orientation estimation. arXiv 2023, arXiv:2308.11440. [Google Scholar]
- Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.J.; Yuan, J.; Thalmann, N.M. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 2272–2281. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Song, X.; Li, Z.; Chen, S.; Demachi, K. Quater-gcn: Enhancing 3d human pose estimation with orientation and semi-supervised training. arXiv 2024, arXiv:2404.19279. [Google Scholar]
- Xu, T.; Takano, W. Graph stacked hourglass networks for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16105–16114. [Google Scholar]
- Islam, Z.; Hamza, A.B. Multi-hop graph transformer network for 3D human pose estimation. J. Vis. Commun. Image Represent. 2024, 101, 104174. [Google Scholar] [CrossRef]
- Aouaidjia, K.; Li, A.; Zhang, W.; Zhang, C. 3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer. arXiv 2025, arXiv:2505.01003. [Google Scholar] [CrossRef]
- Jiang, X.; Zhu, R.; Li, S.; Ji, P. Co-embedding of nodes and edges with graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 7075–7086. [Google Scholar] [CrossRef]
- Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5745–5753. [Google Scholar]
- Holschneider, M.; Kronland-Martinet, R.; Morlet, J.; Tchamitchian, P. A real-time algorithm for signal analysis with the help of the wavelet transform. In Proceedings of the Wavelets: Time-Frequency Methods and Phase Space Proceedings of the International Conference, Marseille, France, 14–18 December 1987; Springer: Berlin/Heidelberg, Germany, 1990; pp. 286–297. [Google Scholar]
- Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.P.; Rhodin, H.; Pons-Moll, G.; Theobalt, C. XNect: Real-time multi-person 3D motion capture with a single RGB camera. Acm Trans. Graph. (TOG) 2020, 39, 82-1. [Google Scholar] [CrossRef]
- Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3d human pose estimation with spatio-temporal criss-cross attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar]
- Huynh, D.Q. Metrics for 3D rotations: Comparison and analysis. J. Math. Imaging Vis. 2009, 35, 155–164. [Google Scholar] [CrossRef]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef]
- Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; IEEE: New York, NY, USA, 2017; pp. 506–516. [Google Scholar]
- Sigal, L.; Balan, A.O.; Black, M.J. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 2010, 87, 4–27. [Google Scholar] [CrossRef]
- Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; Lu, C. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3383–3393. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://pytorch.org (accessed on 1 March 2026).
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
- Lin, H.; Xu, S.; Su, C. MSTFormer: Multi-granularity spatial-temporal transformers for 3D human pose estimation. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 1–19. [Google Scholar] [CrossRef]
- Hao, X.; Li, H. Perspose: 3d human pose estimation with perspective encoding and perspective rotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 8110–8119. [Google Scholar]
- Fang, H.S.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning pose grammar to encode human body configuration for 3d pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7307–7316. [Google Scholar]
- Lee, K.; Lee, I.; Lee, S. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 119–135. [Google Scholar]
- Ci, H.; Wang, C.; Ma, X.; Wang, Y. Optimizing network structure for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 2262–2271. [Google Scholar]
- Zhao, W.; Wang, W.; Tian, Y. Graformer: Graph-oriented transformer for 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20438–20447. [Google Scholar]
- Zeng, A.; Sun, X.; Huang, F.; Liu, M.; Xu, Q.; Lin, S. Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 507–523. [Google Scholar]
- Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 198–209. [Google Scholar] [CrossRef]
- Zhao, Q.; Zheng, C.; Liu, M.; Wang, P.; Chen, C. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8877–8886. [Google Scholar]
- Luo, C.; Chu, X.; Yuille, A. Orinet: A fully convolutional network for 3d human pose estimation. arXiv 2018, arXiv:1811.04989. [Google Scholar] [CrossRef]
- Wandt, B.; Rosenhahn, B. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7782–7791. [Google Scholar]
- Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. Metrabs: Metric-scale truncation-robust heatmaps for absolute 3d human pose estimation. IEEE Trans. Biom. Behav. Identity Sci. 2020, 3, 16–30. [Google Scholar] [CrossRef]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 11656–11665. [Google Scholar]
- Gong, K.; Li, B.; Zhang, J.; Wang, T.; Huang, J.; Mi, M.B.; Feng, J.; Wang, X. PoseTriplet: Co-evolving 3D human pose estimation, imitation, and hallucination under self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11017–11027. [Google Scholar]
- Oreshkin, B.N. 3d human pose and shape estimation via hybrik-transformer. arXiv 2023, arXiv:2302.04774. [Google Scholar] [CrossRef]
- Shetty, K.; Birkhold, A.; Jaganathan, S.; Strobel, N.; Kowarschik, M.; Maier, A.; Egger, B. Pliks: A pseudo-linear inverse kinematic solver for 3d human body estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 574–584. [Google Scholar]
- Qian, X.; Tang, Y.; Zhang, N.; Han, M.; Xiao, J.; Huang, M.C.; Lin, R.S. Hstformer: Hierarchical spatial-temporal transformers for 3d human pose estimation. arXiv 2023, arXiv:2301.07322. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, Y.; Zhou, Z.; Luan, T.; Wang, Z.; Qiao, Y. Learning dynamical human-joint affinity for 3d pose estimation in videos. IEEE Trans. Image Process. 2021, 30, 7914–7925. [Google Scholar] [CrossRef] [PubMed]
- Hossain, M.R.I.; Little, J.J. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 68–84. [Google Scholar]
- Mehraban, S.; Adeli, V.; Taati, B. MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6920–6930. [Google Scholar]
- Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15085–15099. [Google Scholar]






| Method | MPJPE (mm) | MPJAE (rad) |
|---|---|---|
| Soubarna et al. [19] (2024) | 51.7 | 0.26 |
| STG-Net (Ours, F = 1) | 50.8 | 0.23 |
| Model | MPJPE | MPJAE |
|---|---|---|
| TCN (F = 1) | 52.6 | ⸺ |
| GCN (F = 1) | 52.7 | 0.23 |
| STG-Net (F = 1) | 50.8 | 0.23 |
| STAG-Net (F = 1) | 51.1 | 0.23 |
| TCN (F = 27) | 51.4 | ⸺ |
| Attention (F = 27) | 47.0 | ⸺ |
| GCN+TCN (F = 27) | 49.4 | 0.22 |
| GCN+Attention (F = 27) | 47.6 | 0.21 |
| STAG-Net (F = 27) | 45.9 | 0.20 |
| TCN (F = 81) | 48.8 | ⸺ |
| Attention (F = 81) | 45.2 | ⸺ |
| GCN+TCN (F = 81) | 48.6 | 0.21 |
| GCN+Attention (F = 81) | 47.0 | 0.21 |
| STAG-Net (F = 81) | 45.1 | 0.20 |
| Input Dimension | MPJPE | MPJAE |
|---|---|---|
| 128 (F = 27) | 47.1 | 0.21 |
| 192 (F = 27) | 45.9 | 0.20 |
| 256 (F = 27) | 48.1 | 0.22 |
| 384 (F = 27) | 45.9 | 0.20 |
| 128 (F = 81) | 46.0 | 0.21 |
| 192 (F = 81) | 45.1 | 0.20 |
| 256 (F = 81) | 45.9 | 0.20 |
| 384 (F = 81) | 45.5 | 0.19 |
| 128 (F = 243) | 43.3 | 0.19 |
| 192 (F = 243) | 41.8 | 0.19 |
| 256 (F = 243) | 43.7 | 0.20 |
| 384 (F = 243) | 42.4 | 0.19 |
| Input M-NEA Residual Block | MPJPE | MPJAE |
|---|---|---|
| 2 (F = 27) | 47.4 | 0.21 |
| 3 (F = 27) | 45.9 | 0.20 |
| 4 (F = 27) | 48.1 | 0.22 |
| 5 (F = 27) | 46.5 | 0.21 |
| Model | MPJPE |
|---|---|
| Node-Attention (F = 1) | 51.7 |
| Node-Edge-Attention (F = 1) | 51.1 |
| Node-Attention (F = 27) | 46.9 |
| Node-Edge-Attention (F = 27) | 45.9 |
| Node-Attention (F = 81) | 45.9 |
| Node-Edge-Attention (F = 81) | 45.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, C.; Jia, R.; Guo, Q.; Shi, X.; Hirano, M.; Yamakawa, Y. STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios. Robotics 2026, 15, 54. https://doi.org/10.3390/robotics15030054
Yang C, Jia R, Guo Q, Shi X, Hirano M, Yamakawa Y. STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios. Robotics. 2026; 15(3):54. https://doi.org/10.3390/robotics15030054
Chicago/Turabian StyleYang, Chunxin, Ruoyu Jia, Qitong Guo, Xiaohang Shi, Masahiro Hirano, and Yuji Yamakawa. 2026. "STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios" Robotics 15, no. 3: 54. https://doi.org/10.3390/robotics15030054
APA StyleYang, C., Jia, R., Guo, Q., Shi, X., Hirano, M., & Yamakawa, Y. (2026). STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios. Robotics, 15(3), 54. https://doi.org/10.3390/robotics15030054

