End-to-End Pedestrian Trajectory Forecasting with Transformer Network
Abstract
:1. Introduction
- First, we propose an effective and end-to-end trainable framework built upon the transformer framework which is embedded with a random deviation query for trajectory forecasting. Taking advantage of the self-correcting ability introduced by the random deviation query, the robustness of the existing transformer network is enhanced. For detail, we design an attention mask to solve the assignment problems between parallel input queries and sequentially predict.
- Second, we present a co-training strategy based on a classification branch to improve the training effect. The whole scheme is trained collaboratively by the original loss and classification loss, which can improve the accuracy of the results. Experimental results compared with the state of the art methods show that the proposed method can predict plausible trajectory with higher accuracy.
2. Related Work
3. Materials and Methods
3.1. Problem Formulation
3.2. Encoder-Decoder Transformer
3.3. Random Deviation Query
3.4. Final Objective
4. Results
4.1. Experiment Setup
- Datasets; Following the related prior research, we evaluate the proposed method on two public datasets: ETH [40] and UCY [41]. These datasets contain 5 video sequences (Hotel, ETH, UCY, ZARA1, and ZARA2) consisting of 1536 pedestrians in total with different movement patterns and social interactions. People walk in parallel, moving in groups, turning in the corner, avoiding collisions when they walk face-to-face. These are common scenarios that involve social behaviors. These sequences are recorded in 25 frames/second (fps) and contain 4 different scene backgrounds.
- Metrics; Average Displacement Error (ADE), mean square error overall estimated points in the predicted trajectory and ground-truth trajectory. Final Displacement Error (FDE), the distance between the predicted final destination and the ground-truth final destination. They can be mathematically defined as follows:
4.2. Experiment on ETH and UCY Dataset
4.3. Ablation Study
4.3.1. Effect on Different Numbers of Encoder-Decoder Blocks
4.3.2. Effect on Different Key Parameter
4.3.3. Effect on Different Accuracy Discrimination Distance
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Li, F.; Savarese, S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Zhang, P.; Ouyang, W.; Zhang, P.; Xue, J.; Zheng, N. SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Bisagno, N.; Zhang, B.; Conci, N. Group LSTM: Group Trajectory Prediction in Crowded Scenarios; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Huynh, M.; Alaghband, G. Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM. In Proceedings of the International Symposium on Visual Computing, Lake Tahoe, NV, USA, 7–9 October 2019. [Google Scholar]
- Manh, H.; Alaghband, G. Scene-LSTM: A Model for Human Trajectory Prediction. arXiv 2018, arXiv:1808.04018. [Google Scholar]
- Chandra, R.; Guan, T.; Panuganti, S.; Mittal, T.; Bhattacharya, U.; Bera, A.; Manocha, D. Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMs. arXiv 2019, arXiv:1912.01118. [Google Scholar] [CrossRef]
- Tao, C.; Jiang, Q.; Duan, L.; Luo, P. Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction. In Proceedings of the Europeon Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Cheng, Q.; Wang, C. A Method of Trajectory Prediction Based on Kalman Filtering Algorithm and Support Vector Machine Algorithm. In Proceedings of the 2017 Chinese Intelligent Systems Conference (CISC), Mudanjiang, China, 14–15 October 2017; pp. 495–504. [Google Scholar]
- Chen, F.; Chhen, Z.; Biswas, S.; Lei, S.; Ramakrishnan, N.; Lu, C. Graph Convolutional Networks with Kalman Filtering for Traffic Prediction. In Proceedings of the 28th International Conference on Advances in Geographic Information Systems (SIGSPATIAL), Seattle, WA, USA, 3–6 November 2020. [Google Scholar]
- Dendorfer, P.; Ošep, A.; Leal-Taixé, L. Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Sadeghian, A.; Kosaraju, V.; Sadeghian, A.; Hirose, N.; Savarese, S. SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. GD-GAN: Generative Adversarial Networks for Trajectory Prediction and Group Detection in Crowds. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018. [Google Scholar]
- Javad, A.; Jean-Bernard, H.; Julien, P. Social Ways: Learning Multi-Modal Distributions of Pedestrian Trajectories with GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Haddad, S.; Wu, M.; Wei, H.; Lam, S.K. Situation-Aware Pedestrian Trajectory Prediction with Spatio-Temporal Attention Model. In Proceedings of the 24th Computer Vision Winter Workshop (CVWW), Stift Vorau, Austria, 6–8 February 2019. [Google Scholar]
- Yu, J.; Zhou, M.; Wang, X.; Pu, G.; Cheng, C.; Chen, B. A Dynamic and Static Context-Aware Attention Network for Trajectory Prediction. ISPRS Int. J. Geo-Inf. 2020, 10, 336. [Google Scholar] [CrossRef]
- Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Soft+ hardwired attention: An lstm framework for human trajectory prediction and abnormal event detection. Neural Netw. 2018, 108, 466–478. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Fan, Z.; Gong, Y.; Liu, D.; Wei, Z.; Wang, S.; Jiao, J.; Duan, N.; Zhang, R.; Huang, X. Mask Attention Networks: Rethinking and Strengthen Transformer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Online. 6–11 June 2021; pp. 1692–1701. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Europeon Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Chen, X.; Wu, Y.; Wang, Z.; Liu, S.; Li, J. Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset. arXiv 2020, arXiv:2010.11395. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image Transformer. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4052–4061. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar] [CrossRef]
- Gulati, A.; Qin, J.; Chiu, C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. Proc. Interspeech 2020, 2020, 5036–5040. [Google Scholar] [CrossRef]
- Giuliari, F.; Hasan, I.; Cristani, M.; Galasso, F. Transformer Networks for Trajectory Forecasting. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021. [Google Scholar]
- Seitz, M.J.; Dietrich, F.; Köster, G. The effect of stepping on pedestrian trajectories. Phys. A Stat. Mech. Its Appl. 2015, 421, 594–604. [Google Scholar] [CrossRef]
- Caramuta, C.; Collodel, G.; Giacomini, C.; Gruden, C.; Longo, G.; Piccolotto, P. Survey of detection techniques, mathematical models and simulation software in pedestrian dynamics. Transp. Res. Procedia 2017, 25, 551–567. [Google Scholar] [CrossRef] [Green Version]
- Boltes, M.; Seyfried, A. Collecting pedestrian trajectories. Neurocomputing 2013, 100, 127–133. [Google Scholar] [CrossRef]
- Gruden, C.; Campisi, T.; Canale, A.; Tesoriere, G.; Sraml, M. A cross-study on video data gathering and microsimulation techniques to estimate pedestrian safety level in a confined space. IOP Conf. Ser. Mater. Sci. Eng. 2019, 603, 042008. [Google Scholar] [CrossRef]
- Ma, W.C.; Huang, D.A.; Lee, N.; Kitani, K.M. Forecasting interactive dynamics of pedestrians with fictitious play. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kosaraju, V.; Sadeghian, A.; Martin, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. arXiv 2019, arXiv:1907.03395. [Google Scholar]
- Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Multi-agent generative trajectory forecasting with heterogeneous data for control. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020. [Google Scholar]
- Parth, K.; Kreiss, S.; Alahi, A. Human trajectory forecasting in crowds: A deep learning perspective. IEEE Trans. Intell. Transp. Syst. 2021. [Google Scholar] [CrossRef]
- Xue, H.; Huynh, D.Q.; Reynolds, M. A location-velocity-temporal attention LSTM model for pedestrian trajectory prediction. IEEE Access 2020, 8, 44576–44589. [Google Scholar] [CrossRef]
- Yu, C.; Ma, X.; Ren, J.; Zhao, H.; Yi, S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Proceedings of the European Conference on Computer Vision, Virtual. 23–28 August 2020. [Google Scholar]
- Xu, Y.; Piao, Z.; Gao, S. Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Yi, S.; Li, H.; Wang, X. Understanding Pedestrian Behaviors from Stationary Crowd Groups. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
- Pellegrini, S.; Ess, A.; Schindler, K.; Van Gool, L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of the 2009 IEEE 12th International Conference, Kyoto, Japan, 27 September–4 October 2009; pp. 261–268. [Google Scholar]
- Lerner, A.; Chrysanthou, Y.; Lischinski, D. Crowds by example. In Computer Graphics Forum; Wiley: Hoboken, NJ, USA, 2007; Volume 26, pp. 655–664. [Google Scholar]
Symbol | Representation |
---|---|
t | time stamp (frame) |
N | total number of pedestrians |
the trajectory of a pedestrian | |
the start time stamp and end time stamp | |
the trajectory of a pedestrian from to | |
the spatial location of | |
observed trajectory | |
predicted trajectory | |
ground truth future trajectories | |
displacement vector | |
weights matrix | |
input embedding data | |
temporal position embedding | |
spatial trajectory embedding | |
Q | query embedding inputs |
K | key embedding inputs |
V | value embedding inputs |
weight matrices in queries, keys, values, and output |
Method | Performance (ADE/FDE) | |||||
---|---|---|---|---|---|---|
ETH | Hotel | UCY | Zara1 | Zara2 | Average | |
SGAN [14] | 1.13/2.21 | 1.01/2.18 | 0.60/1.28 | 0.42/0.91 | 0.52/1.11 | 0.74/1.54 |
Social LSTM citeSgan18 | 1.09/2.35 | 0.79/1.76 | 0.67/1.40 | 0.47/1.00 | 0.56/1.17 | 0.72/1.54 |
S-Attention [32] | 0.39/3.74 | 0.29/2.64 | 0.20/0.52 | 0.30/2.13 | 0.33/3.92 | 0.30/2.35 |
Trajectron++ [33] | 0.50/1.19 | 0.24/0.59 | 0.36/0.89 | 0.29/0.72 | 0.27/0.67 | 0.34/0.84 |
LSTM [14] | 1.09/2.94 | 0.86/1.91 | 0.61/1.31 | 0.41/0.88 | 0.52/1.11 | 0.70/1.62 |
TF [26] | 1.03/2.10 | 0.36/0.71 | 0.53/1.32 | 0.44/1.00 | 0.34/0.76 | 0.54/1.17 |
Ours | 0.98/2.00 | 0.33/0.65 | 0.53/1.16 | 0.40/0.88 | 0.31/0.68 | 0.51/1.07 |
Method | Performance (ADE/FDE) | |||||
---|---|---|---|---|---|---|
ETH | Hotel | UCY | Zara1 | Zara2 | Average | |
SGAN [14] | 0.87/1.62 | 0.67/1.37 | 0.76/1.52 | 0.35/0.68 | 0.42/0.84 | 0.61/1.21 |
Sophie [11] | 0.70/1.43 | 0.76/1.67 | 0.54/1.24 | 0.30/0.63 | 0.38/0.78 | 0.54/1.15 |
Social-bigat [32] | 0.69/1.29 | 0.49/1.01 | 0.55/1.32 | 0.30/0.62 | 0.36/0.75 | 0.48/1.00 |
Trajectron++ [33] | 0.35/0.77 | 0.18/0.38 | 0.22/0.48 | 0.14/0.28 | 0.14/0.30 | 0.21/0.45 |
SGAN-ind [14] | 0.81/1.52 | 0.72/1.61 | 0.60/1.26 | 0.34/0.69 | 0.42/0.84 | 0.58/1.18 |
TF [26] | 0.61/1.12 | 0.18/0.30 | 0.35/0.65 | 0.22/0.38 | 0.17/0.32 | 0.31/0.55 |
Ours | 0.49/0.82 | 0.17/0.27 | 0.34/0.61 | 0.22/0.38 | 0.13/0.30 | 0.27/0.48 |
Performance (ADE/FDE) | ||||||
---|---|---|---|---|---|---|
ETH | Hotel | UCY | Zara1 | Zara2 | Average | |
2 | 1.021/2.092 | 0.336/0.645 | 0.554/1.224 | 0.418/0.927 | 0.329/0.746 | 0.532/1.127 |
4 | 1.023/2.121 | 0.348/0.668 | 0.543/1.199 | 0.414/0.916 | 0.321/0.717 | 0.530/1.125 |
6 | 0.987/2.005 | 0.337/0.650 | 0.537/1.168 | 0.404/0.886 | 0.311/0.688 | 0.515/1.079 |
8 | 0.976/1.953 | 0.338/0.654 | 0.535/1.156 | 0.408/0.898 | 0.314/0.687 | 0.514/1.070 |
10 | 0.981/1.916 | 0.316/0.593 | 0.540/1.164 | 0.406/0.886 | 0.321/0.711 | 0.513/1.054 |
Performance (ADE/FDE) | |||||||
---|---|---|---|---|---|---|---|
ETH | Hotel | UCY | Zara1 | Zara2 | Average | ADE + FDE | |
1 | 1.021/2.005 | 0.380/0.788 | 0.536/1.161 | 0.429/0.922 | 0.316/0.704 | 0.536/1.116 | 1.64 |
10 | 1.016/2.113 | 0.342/0.660 | 0.531/1.153 | 0.406/0.885 | 0.313/0.686 | 0.522/1.100 | 1.62 |
30 | 0.991/2.037 | 0.352/0.685 | 0.538/1.176 | 0.413/0.911 | 0.314/0.694 | 0.521/1.101 | 1.62 |
50 | 0.987/2.005 | 0.337/0.650 | 0.537/1.168 | 0.404/0.886 | 0.311/0.688 | 0.515/1.079 | 1.58 |
70 | 1.008/2.083 | 0.340/0.668 | 0.547/1.178 | 0.423/0.938 | 0.336/0.732 | 0.531/1.120 | 1.65 |
100 | 1.038/2.178 | 0.337/0.661 | 0.570/1.219 | 0.427/0.940 | 0.323/0.712 | 0.539/1.142 | 1.67 |
ADD | Performance (ADE/FDE) | |||||
---|---|---|---|---|---|---|
ETH | Hotel | UCY | Zara1 | Zara2 | Average | |
0.01 | 1.029/2.099 | 0.350/0.698 | 0.532/1.158 | 0.464/1.028 | 0.329/0.721 | 0.541/1.141 |
0.03 | 1.007/2.040 | 0.344/0.679 | 0.538/1.178 | 0.471/1.031 | 0.315/0.696 | 0.535/1.125 |
0.05 | 1.036/2.097 | 0.374/0.770 | 0.533/1.171 | 0.400/0.878 | 0.332/0.738 | 0.535/1.131 |
0.1 | 1.002/2.016 | 0.358/0.702 | 0.527/1.148 | 0.402/0.880 | 0.319/0.697 | 0.522/1.089 |
0.3 | 0.987/2.005 | 0.337/0.650 | 0.537/1.168 | 0.404/0.886 | 0.311/0.688 | 0.515/1.079 |
0.5 | 0.998/2.024 | 0.337/0.651 | 0.540/1.174 | 0.405/0.888 | 0.321/0.721 | 0.520/1.092 |
0.7 | 1.037/2.153 | 0.332/0.646 | 0.581/1.237 | 0.410/0.905 | 0.330/0.740 | 0.538/1.136 |
1 | 1.033/2.141 | 0.346/0.677 | 0.569/1.246 | 0.413/0.917 | 0.325/0.726 | 0.537/1.141 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yao, H.-Y.; Wan, W.-G.; Li, X. End-to-End Pedestrian Trajectory Forecasting with Transformer Network. ISPRS Int. J. Geo-Inf. 2022, 11, 44. https://doi.org/10.3390/ijgi11010044
Yao H-Y, Wan W-G, Li X. End-to-End Pedestrian Trajectory Forecasting with Transformer Network. ISPRS International Journal of Geo-Information. 2022; 11(1):44. https://doi.org/10.3390/ijgi11010044
Chicago/Turabian StyleYao, Hai-Yan, Wang-Gen Wan, and Xiang Li. 2022. "End-to-End Pedestrian Trajectory Forecasting with Transformer Network" ISPRS International Journal of Geo-Information 11, no. 1: 44. https://doi.org/10.3390/ijgi11010044
APA StyleYao, H. -Y., Wan, W. -G., & Li, X. (2022). End-to-End Pedestrian Trajectory Forecasting with Transformer Network. ISPRS International Journal of Geo-Information, 11(1), 44. https://doi.org/10.3390/ijgi11010044