Human Trajectory Prediction Based on a Single Frame of Pose and Initial Velocity Information
Abstract
1. Introduction
- We design a novel Dual-GRU architecture. The hierarchical design decouples coarse trajectory shaping from fine-grained refinement, which is distinct from prior single-step predictors.
- We propose two novel loss functions. The pose loss function models human motion continuity in the quaternion space for joint angles. The trajectory loss function preserves realistic motion regardless of global coordinates.
- We introduce a method to automatically decide the coefficients of the losses by dynamically adjusting the scale of the skeleton.
2. Related Work
2.1. Pose Prediction
2.2. Social Motion Forecasting
2.3. Motion Synthesis
3. Pose and Trajectory Prediction
3.1. Trajectory and Skeleton Model
3.2. Network Structure
3.3. Training Loss
3.3.1. Pose Loss
3.3.2. Trajectory Loss
4. Experimental Settings
4.1. Dataset and Data Pre-Processing
4.2. Network Implementation
5. Results
5.1. Human Pose Prediction
- Method I: Use dynamic adjustment mechanism? If yes, we gradually reduce the feeding ratio of the ground truth. If no, we always feed the ground truth for the first 50 frames.
- Method II: Use the quaternion loss? If yes, we use both the position loss and the quaternion loss. If no, we use the position loss only.
- Method III: Use the geodesic distance loss? If yes, we calculate the geodesic distance between the ground truth and the predicted values. If no, we calculate the L2 loss instead.
5.2. Human Trajectory Prediction
6. Limitations
- Pose Quality: The framework assumes high-quality input poses (e.g., from motion capture systems). Noisy or heavily occluded poses can degrade prediction quality.
- No Scene or Interaction Modeling: Our model does not account for environmental constraints (e.g., obstacles) or social interactions with other agents. It may therefore struggle in dense or interactive environments.
- Failure Cases: Abrupt changes in motion intent (e.g., sudden turns or stops) and pattern transitions (e.g., walking to running) are common failure modes due to the lack of temporal or contextual cues.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Villegas, R.; Yang, J.; Ceylan, D.; Lee, H. Neural kinematic networks for unsupervised motion retargetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8639–8648. [Google Scholar]
- Maskeliūnas, R.; Damaševičius, R.; Blažauskas, T.; Canbulut, C.; Adomavičienė, A.; Griškevičius, J. BiomacVR: A virtual reality-based system for precise human posture and motion analysis in rehabilitation exercises using depth sensors. Electronics 2023, 12, 339. [Google Scholar] [CrossRef]
- Holden, D.; Saito, J.; Komura, T. A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. (TOG) 2016, 35, 1–11. [Google Scholar] [CrossRef]
- Harvey, F.G.; Yurick, M.; Nowrouzezahrai, D.; Pal, C. Robust motion in-betweening. ACM Trans. Graph. (TOG) 2020, 39, 60:1–60:12. [Google Scholar] [CrossRef]
- Wang, C.; He, S.; Wu, M.; Lam, S.K.; Tiwari, P.; Gao, X. Looking Clearer with Text: A Hierarchical Context Blending Network for Occluded Person Re-Identification. IEEE Trans. Inf. Forensics Secur. 2025, 20, 4296–4307. [Google Scholar] [CrossRef]
- Gao, X.; Chen, Z.; Wei, J.; Wang, R.; Zhao, Z. Deep Mutual Distillation for Unsupervised Domain Adaptation Person Re-identification. IEEE Trans. Multimed. 2025, 27, 1059–1071. [Google Scholar] [CrossRef]
- Ha, T.; Choi, C.H. An effective trajectory generation method for bipedal walking. Robot. Auton. Syst. 2007, 55, 795–810. [Google Scholar] [CrossRef]
- Collins, S.H.; Adamczyk, P.G.; Kuo, A.D. Dynamic arm swinging in human walking. Proc. R. Soc. B Biol. Sci. 2009, 276, 3679–3688. [Google Scholar] [CrossRef]
- Hirukawa, H.; Hattori, S.; Kajita, S.; Harada, K.; Kaneko, K.; Kanehiro, F.; Morisawa, M.; Nakaoka, S. A pattern generator of humanoid robots walking on a rough terrain. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Roma, Italy, 10–14 April 2007; pp. 2181–2187. [Google Scholar]
- Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4346–4354. [Google Scholar]
- Martinez, J.; Black, M.J.; Romero, J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2891–2900. [Google Scholar]
- Ghosh, P.; Song, J.; Aksan, E.; Hilliges, O. Learning human motion models for long-term predictions. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 458–466. [Google Scholar]
- Li, C.; Zhang, Z.; Lee, W.S.; Lee, G.H. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5226–5234. [Google Scholar]
- Li, M.; Chen, S.; Zhao, Y.; Zhang, Y.; Wang, Y.; Tian, Q. Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based Motion Prediction. IEEE Trans. Image Process. 2021, 30, 7760–7775. [Google Scholar] [CrossRef]
- Pang, C.; Gao, X.; Chen, Z.; Lyu, L. Self-adaptive graph with nonlocal attention network for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 17057–17069. [Google Scholar] [CrossRef]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021. [Google Scholar]
- Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. MotionBERT: A Unified Perspective on Learning Human Motion Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
- Mehraban, S.; Adeli, V.; Taati, B. Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6920–6930. [Google Scholar]
- Barsoum, E.; Kender, J.; Liu, Z. Hp-gan: Probabilistic 3d human motion prediction via gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1418–1427. [Google Scholar]
- Kundu, J.N.; Gor, M.; Babu, R.V. Bihmp-gan: Bidirectional 3d human motion prediction gan. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8553–8560. [Google Scholar]
- Wang, C.; Cao, R.; Wang, R. Learning discriminative topological structure information representation for 2D shape and social network classification via persistent homology. Knowl.-Based Syst. 2025, 311, 113125. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Gated feedback recurrent neural networks. In Proceedings of the International conference on Machine Learning, PMLR, Lile, France, 6–11 July 2015; pp. 2067–2075. [Google Scholar]
- Liu, X.; Yin, J.; Liu, J.; Ding, P.; Liu, J.; Liu, H. Trajectorycnn: A new spatio-temporal feature learning network for human motion prediction. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2133–2146. [Google Scholar] [CrossRef]
- Sak, H.; Senior, A.W.; Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proc. Interspeech 2014, 2014, 338–342. [Google Scholar]
- Pavllo, D.; Grangier, D.; Auli, M. Quaternet: A quaternion-based recurrent model for human motion. arXiv 2018, arXiv:1805.06485. [Google Scholar]
- Fujita, T.; Kawanishi, Y. Future pose prediction from 3d human skeleton sequence with surrounding situation. Sensors 2023, 23, 876. [Google Scholar] [CrossRef]
- Liu, Z.; Su, P.; Wu, S.; Shen, X.; Chen, H.; Hao, Y.; Wang, M. Motion prediction using trajectory cues. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 13299–13308. [Google Scholar]
- Zaier, M.; Wannous, H.; Drira, H.; Boonaert, J. A dual perspective of human motion analysis-3d pose estimation and 2d trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2189–2199. [Google Scholar]
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NA, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
- Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 683–700. [Google Scholar]
- Adeli, V.; Adeli, E.; Reid, I.; Niebles, J.C.; Rezatofighi, H. Socially and contextually aware human motion and pose forecasting. IEEE Robot. Autom. Lett. 2020, 5, 6033–6040. [Google Scholar] [CrossRef]
- Daniel, N.; Larey, A.; Aknin, E.; Osswald, G.A.; Caldwell, J.M.; Rochman, M.; Collins, M.H.; Yang, G.Y.; Arva, N.C.; Capocelli, K.E.; et al. PECNet: A deep multi-label segmentation network for eosinophilic esophagitis biopsy diagnostics. arXiv 2021, arXiv:2103.02015. [Google Scholar]
- Taylor, G.W.; Hinton, G.E. Factored conditional restricted Boltzmann machines for modeling motion style. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1025–1032. [Google Scholar]
- Grassia, F.S. Practical parameterization of rotations using the exponential map. J. Graph. Tools 1998, 3, 29–48. [Google Scholar] [CrossRef]
- Ernst, M.J.; Rast, F.M.; Bauer, C.M.; Marcar, V.L.; Kool, J. Determination of thoracic and lumbar spinal processes by their percentage position between C7 and the PSIS level. BMC Res. Notes 2013, 6, 58. [Google Scholar] [CrossRef]
- Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 1998, 6, 107–116. [Google Scholar] [CrossRef]
- Jiang, X.; Sun, J.; Li, C.; Ding, H. Video image defogging recognition based on recurrent neural network. IEEE Trans. Ind. Informatics 2018, 14, 3281–3288. [Google Scholar] [CrossRef]
- Lin, C.B.; Dong, Z.; Kuan, W.K.; Huang, Y.F. A framework for fall detection based on OpenPose skeleton and LSTM/GRU models. Appl. Sci. 2020, 11, 329. [Google Scholar] [CrossRef]
- Ma, H.; Cao, J.; Mi, B.; Huang, D.; Liu, Y.; Li, S. A gru-based lightweight system for can intrusion detection in real time. Secur. Commun. Netw. 2022, 2022, 5827056. [Google Scholar] [CrossRef]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Lab, C.G. CMU Graphics Lab Motion Capture Database Converted to FBX. Available online: https://mocap.cs.cmu.edu/ (accessed on 25 June 2025).
- Müller, M.; Röder, T.; Clausen, M.; Eberhardt, B.; Krüger, B.; Weber, A. Mocap database hdm05. Institut Inform. II Univ. Bonn 2007, 2. Available online: https://resources.mpi-inf.mpg.de/HDM05/ (accessed on 25 June 2025).
- Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; Bajcsy, R. Berkeley mhad: A comprehensive multimodal human action database. In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), Clearwater Beach, FL, USA, 15–17 January 2013; pp. 53–60. [Google Scholar]
- Xia, S.; Wang, C.; Chai, J.; Hodgins, J. Realtime style transfer for unlabeled heterogeneous human motion. ACM Trans. Graph. (TOG) 2015, 34, 1–10. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.d., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv 2015, arXiv:1511.07289. [Google Scholar]
- Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2255–2264. [Google Scholar]
- Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14424–14432. [Google Scholar]
- Wang, H.; Dong, J.; Cheng, B.; Feng, J. PVRED: A Position-Velocity Recurrent Encoder-Decoder for Human Motion Prediction. IEEE Trans. Image Process. 2021, 30, 6096–6106. [Google Scholar] [CrossRef]
Type | Pose Network | Trajectory Network |
---|---|---|
Inputs | Pose: (quaternion, joints) Velocity: (scalar speed + ) | Pose: (quaternion, joints) Velocity: (scalar speed + ) |
Outputs | Pose: (quaternion, joints) | Root joint trajectory: (2D XZ position, ) |
Model | Methods | Euler Angle Error | Quaternion Error | Total | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
I | II | III | Yaw | Pitch | Roll | EMAE | QMAE | |||||
✓ | ✓ | ✓ | 3.080 | 1.540 | 3.931 | 0.0050 | 0.0178 | 0.0121 | 0.0119 | 2.850 | 0.0117 | |
➀ | ✓ | ✓ | 3.296 | 1.704 | 4.182 | 0.0055 | 0.0223 | 0.0131 | 0.0141 | 3.061 | 0.0138 | |
➁ | ✓ | 7.194 | 6.629 | 6.931 | 0.0143 | 0.0441 | 0.0559 | 0.0477 | 6.918 | 0.0405 | ||
➂ | ✓ | ✓ | 3.809 | 1.803 | 4.400 | 0.0053 | 0.0179 | 0.0148 | 0.0137 | 3.338 | 0.0129 | |
➃ | ✓ | 3.867 | 1.747 | 4.396 | 0.0055 | 0.0183 | 0.0140 | 0.0131 | 3.336 | 0.0127 | ||
➄ | 10.453 | 9.617 | 10.460 | 0.0214 | 0.0654 | 0.0860 | 0.0680 | 10.176 | 0.0602 |
Methods | Input Features | Walking | ||
---|---|---|---|---|
Velocity | Pose | ADE (m)↓ | FDE (m)↓ | |
Baseline 1 | ✓ | ✓ | 1.734 | 3.224 |
Baseline 2 | ✓ | ✓ | 1.459 | 2.277 |
Baseline 3 | ✓ | 0.709 | 1.345 | |
Ours | ✓ | ✓ | ||
Ours * | ✓ | ✓ | 0.509 | 0.964 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, Y.; Yan, H. Human Trajectory Prediction Based on a Single Frame of Pose and Initial Velocity Information. Electronics 2025, 14, 2636. https://doi.org/10.3390/electronics14132636
Huang Y, Yan H. Human Trajectory Prediction Based on a Single Frame of Pose and Initial Velocity Information. Electronics. 2025; 14(13):2636. https://doi.org/10.3390/electronics14132636
Chicago/Turabian StyleHuang, Yucheng, and Hong Yan. 2025. "Human Trajectory Prediction Based on a Single Frame of Pose and Initial Velocity Information" Electronics 14, no. 13: 2636. https://doi.org/10.3390/electronics14132636
APA StyleHuang, Y., & Yan, H. (2025). Human Trajectory Prediction Based on a Single Frame of Pose and Initial Velocity Information. Electronics, 14(13), 2636. https://doi.org/10.3390/electronics14132636