HiPro-AD: Sparse Trajectory Transformer for End-to-End Autonomous Driving with Hybrid Spatiotemporal Attention
Abstract
1. Introduction
- Efficient Feature Extraction Network: An efficient IM-ResNet-34 backbone incorporating depthwise separable convolutions and Efficient Channel Attention (ECA), which substantially reduces computational overhead while maintaining feature quality.
- STFormer, a novel sparse transformer that iteratively refines trajectory proposals via proposal-anchored deformable self-attention, explicit temporal fusion from a BEV memory bank, and geometry-constrained spatial cross-attention. Combined with a Top-k multi-modal regression loss, STFormer eliminates the need for dense intermediate representations and achieves superior training stability without closed-loop rollouts.
- A lightweight pairwise ranking scorer that directly optimizes relative proposal quality using simulation-derived composite metrics, enabling precise selection of the optimal trajectory from diverse high-quality candidates and enhancing interpretability.
2. Related Works
2.1. End-to-End Autonomous Driving
2.2. Attention Mechanism
3. Methods
3.1. Scene Encoder
3.2. STFormer
3.3. Scorer
4. Experiments
4.1. NAVSIM Benchmark
4.2. Bench2Drive Benchmark
4.3. Ablation Studies
4.4. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Le Mero, L.; Yi, D.; Dianati, M.; Mouzakitis, A. A survey on imitation learning techniques for end-to-end autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 14128–14147. [Google Scholar] [CrossRef]
- Chen, J.; Li, S.E.; Tomizuka, M. Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5068–5078. [Google Scholar] [CrossRef]
- Coelho, D.; Oliveira, M. A review of end-to-end autonomous driving in urban environments. IEEE Access 2022, 10, 75296–75311. [Google Scholar] [CrossRef]
- Zhu, K.; Wang, Z.; Li, Z.; Xu, C.Z. Secure observer-based collision-free control for autonomous vehicles under non-Gaussian noises. IEEE Trans. Ind. Inform. 2024, 21, 2184–2193. [Google Scholar] [CrossRef]
- Zhu, K.; Wang, Z.; Ding, D.; Hu, J.; Dong, H. Cloud-Based Collision Avoidance Adaptive Cruise Control for Autonomous Vehicles Under External Disturbances with Token Bucket Shapers. IEEE Trans. Ind. Inform. 2025, 21, 8759–8769. [Google Scholar] [CrossRef]
- Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Jiang, B.; Chen, S.; Xu, Q.; Liao, B.; Chen, J.; Zhou, H.; Zhang, Q.; Liu, W.; Huang, C.; Wang, X. Vad: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
- Liao, B.; Chen, S.; Yin, H.; Jiang, B.; Wang, C.; Yan, S.; Zhang, X.; Li, X.; Zhang, Y.; Zhang, Q.; et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 November 2025. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Zhang, Z.; Liniger, A.; Dai, D.; Yu, F.; Van Gool, L. End-to-end urban driving by imitating a reinforcement learning coach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Hu, A.; Corrado, G.; Griffiths, N.; Murez, Z.; Gurau, C.; Yeo, H.; Kendall, A.; Cipolla, R.; Shotton, J. Model-based imitation learning for urban driving. In Proceedings of the Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, 28 November–9 December 2022; pp. 20703–20716. [Google Scholar]
- Chen, S.; Jiang, B.; Gao, H.; Liao, B.; Xu, Q.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar]
- Wu, P.; Jia, X.; Chen, L.; Yan, J.; Li, H.; Qiao, Y. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 35), New Orleans, LA, USA, 28 November–9 December 2022; pp. 6119–6132. [Google Scholar]
- Wang, X.; Zhu, Z.; Huang, G.; Chen, X.; Zhu, J.; Lu, J. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv 2023, arXiv:2309.09777. [Google Scholar]
- Xu, H.; Gao, Y.; Yu, F.; Darrell, T. End-to-end learning of driving models from large-scale video datasets. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Prakash, A.; Chitta, K.; Geiger, A. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J.A.; Abbeel, P.; Peters, J. An algorithmic perspective on imitation learning. Found. Trends® Robot. 2018, 7, 1–179. [Google Scholar] [CrossRef]
- Codevilla, F.; Miiller, M.; Lopez, A.; Koltun, V.; Dosovitskiy, A. End-to-end driving via conditional imitation learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
- Thrun, S.; Littman, M.L. Reinforcement learning: An introduction. AI Mag. 2000, 21, 103. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. International conference on machine learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Liang, X.; Wang, T.; Yang, L.; Xing, E. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Hu, J.; Li, S.; Gang, S. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar] [CrossRef]
- Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021. [Google Scholar]
- Dauner, D.; Hallgarten, M.; Li, T.; Weng, X.; Huang, Z.; Yang, Z.; Li, H.; Gilitschenski, I.; Ivanovic, B.; Pavone, M. NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking. In Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Guo, K.; Liu, H.; Wu, X.; Pan, J.; Lv, C. iPad: Iterative Proposal-centric End-to-End Autonomous Driving. arXiv 2025, arXiv:2505.15111. [Google Scholar]
- Dauner, D.; Hallgarten, M.; Geiger, A.; Chitta, K. Parting with misconceptions about learning-based vehicle motion planning. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
- Chen, S.; Jiang, B.; Gao, H.; Liao, B.; Xu, Q.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv 2024, arXiv:2402.13243. [Google Scholar]
- Yuan, C.; Zhang, Z.; Sun, J.; Sun, S.; Huang, Z.; Lee, C.D.W.; Li, D.; Han, Y.; Wong, A.; Tee, K.P.; et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. arXiv 2024, arXiv:2408.03601. [Google Scholar]
- Weng, X.; Ivanovic, B.; Wang, Y.; Wang, Y.; Pavone, M. PARA-Drive: Parallelized Architecture for Real-Time Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15449–15458. [Google Scholar]
- Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017. [Google Scholar]
- Jia, X.; Yang, Z.; Li, Q.; Zhang, Z.; Yan, J. Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving. In Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Zhai, J.T.; Feng, Z.; Du, J.; Mao, Y.; Liu, J.J.; Tan, Z.; Zhang, Y.; Ye, X.; Wang, J. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv 2023, arXiv:2305.10430. [Google Scholar]
- Jia, X.; You, J.; Zhang, Z.; Yan, J. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. arXiv 2025, arXiv:2503.07656. [Google Scholar]










| Learning Mode | Representative Methods | Advantages | Limitations |
|---|---|---|---|
| Imitation Learning (IL) | UniAD/VAD | Unified Optimization: Jointly optimizes perception, prediction, and planning to mitigate cascading errors and information loss. | Resource Intensity: High computational complexity and inference latency hinder real-time deployment on edge devices. |
| TransFuser [13] | Data Scalability: Directly utilizes large-scale expert demonstrations. | Causal Confusion: Prone to learning spurious correlations (e.g., background bias). | |
| Reinforcement Learning (RL) | PPO [9] | Long-horizon Planning: Optimizes for long-term cumulative rewards. | Sample Inefficiency: Requires extensive interactions for convergence. |
| SAC [10] | Super-human Potential: Explores novel strategies without reliance on human labels. | Reality Gap: Difficult to transfer simulation-trained policies to the real world safely. | |
| Knowledge Distillation | Roach [11] | Feature Enhancement: Student models acquire robust representations from privileged teachers. | Pipeline Complexity: Involves a convoluted multi-stage training protocol. |
| TCP [14] | Inference Efficiency: Achieves high performance with limited sensor inputs. | Oracle Dependency: Strictly relies on ground-truth states available only in simulators. | |
| World Models | MILE [12] | Spatiotemporal Modeling: Deep understanding of scene dynamics and future states. | Computational Cost: High resource demands for both training and inference. |
| DriveDreamer [15] | Self-Supervision: Learns from massive unlabeled video data. | Physical Inconsistency: Risk of generative hallucinations that may violate physical laws or geometric constraints. |
| Attention Mechanism | Properties | Adoption | Rationale |
|---|---|---|---|
| SENet/GSoP-Net [24,25] | Global pooling; 2nd-order stats | No | Inefficient: High computational cost or parameter overhead compared to ECA. |
| ECA-Net [26] | Local 1D cross-channel interaction | Yes | Efficient: Enhances channel semantics with negligible overhead; ideal for our lightweight encoder. |
| STN [27] | Explicit spatial transformation | No | Rigid: Limited flexibility compared to modern deformable sampling. |
| Non-Local/ViT [28,29] | Global dense self-attention | No | High Latency: Quadratic complexity on dense grids makes real-time planning infeasible. |
| Swin Transformer [30] | Hierarchical window-based attention | No | Dense: Still processes dense regions; incompatible with our proposal-centric sparse paradigm. |
| Deformable self-attention | Sparse adaptive point sampling | Yes | Sparse: Focuses computation strictly on trajectory proposals, ignoring irrelevant background. |
| CBAM/BAM [31,32] | Serial/Parallel channel-spatial fusion | No | Redundant: Complex multi-branch designs increase latency without proportional gains for our task. |
| Triplet Attention [33] | Cross-dimension | No | Redundant: Our proposal anchors and PE already explicitly model geometry and position. |
| Hyper—Parameter | Value |
|---|---|
| Proposal number N | 64 |
| Iteration number K | 4 |
| Planning time step interval | 0.5s |
| Channel dimension C | 256 |
| Hidden size | 256 |
| Feed—forward size | 1024 |
| Pillar reference point number Nref | 4 |
| Proposal loss discount λ | 0.1 |
| NAVSIM future planning horizon T | 8 |
| NAVSIM image input down-sample rate | 0.4 |
| Method | Input | NC | DAC | TTC | Comf. | EP | PDMS |
|---|---|---|---|---|---|---|---|
| PDM-Closed [36] (Rule-based) | Perception GT | 94.6 | 99.8 | 86.9 | 99.9 | 89.9 | 89.1 |
| VADV2-V8192 [37] | Camera & Lidar | 97.2 | 89.1 | 91.6 | 100 | 76.0 | 80.9 |
| Transfuser [13] | Camera & Lidar | 97.7 | 92.8 | 92.8 | 100 | 79.2 | 84.0 |
| DRAMA [38] | Camera & Lidar | 98.0 | 93.1 | 94.8 | 100 | 80.1 | 85.5 |
| DiffusionDrive [8] | Camera & Lidar | 98.2 | 96.2 | 94.7 | 100 | 82.2 | 88.1 |
| UniAD [6] | Camera | 97.8 | 91.9 | 92.9 | 100 | 78.8 | 83.4 |
| PARA-Drive [39] | Camera | 97.9 | 92.4 | 93.0 | 99.8 | 79.3 | 84.0 |
| HiPro-AD (Ours) | Camera | 98.6 | 98.7 | 95.3 | 100 | 89.2 | 92.6 |
| Method | Latency | Open-Loop | Closed-Loop | |||
|---|---|---|---|---|---|---|
| Avg. L2 | Efficiency | Comfort | Success Rate (%) | Driving Score | ||
| AD-MLP [42] | 4 ms | 3.64 | 48.45 | 22.63 | 0.00 | 18.05 |
| UniAD-Tiny [6] | 445 ms | 0.80 | 123.92 | 47.04 | 13.18 | 40.73 |
| UniAD-Base [6] | 558 ms | 0.73 | 129.21 | 43.58 | 16.36 | 45.81 |
| VAD [7] | 359 ms | 0.91 | 157.94 | 46.01 | 15.00 | 42.35 |
| DriveTransformer [43] | 212 ms | 0.62 | 100.64 | 20.78 | 35.01 | 63.46 |
| HiPro-AD (Ours) | 67 ms | 0.75 | 159.31 | 32.19 | 37.31 | 65.48 |
| Scene Encoder | Scene Encoder | Scorer | NC | DAC | TTC | EP | PDMS |
|---|---|---|---|---|---|---|---|
| ResNet-34 | BEVFormer | BCE | 97.6 | 93.0 | 92.9 | 68.9 | 78.5 |
| IM-ResNet-34 | BEVFormer | BCE | 98.0 | 94.9 | 93.8 | 77.5 | 83.6 |
| IM-ResNet-34 | STFormer | BCE | 98.4 | 97.2 | 94.8 | 86.2 | 89.4 |
| IM-ResNet-34 | STFormer | Pairwise | 98.6 | 98.7 | 95.3 | 89.2 | 92.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chen, B.; Wang, G.; Yang, J.; Huang, S.; Qian, X.; Huang, B.; Guo, G. HiPro-AD: Sparse Trajectory Transformer for End-to-End Autonomous Driving with Hybrid Spatiotemporal Attention. Sensors 2026, 26, 185. https://doi.org/10.3390/s26010185
Chen B, Wang G, Yang J, Huang S, Qian X, Huang B, Guo G. HiPro-AD: Sparse Trajectory Transformer for End-to-End Autonomous Driving with Hybrid Spatiotemporal Attention. Sensors. 2026; 26(1):185. https://doi.org/10.3390/s26010185
Chicago/Turabian StyleChen, Bing, Gaopeng Wang, Jiandong Yang, Shaoliang Huang, Xinhe Qian, Bin Huang, and Guanlun Guo. 2026. "HiPro-AD: Sparse Trajectory Transformer for End-to-End Autonomous Driving with Hybrid Spatiotemporal Attention" Sensors 26, no. 1: 185. https://doi.org/10.3390/s26010185
APA StyleChen, B., Wang, G., Yang, J., Huang, S., Qian, X., Huang, B., & Guo, G. (2026). HiPro-AD: Sparse Trajectory Transformer for End-to-End Autonomous Driving with Hybrid Spatiotemporal Attention. Sensors, 26(1), 185. https://doi.org/10.3390/s26010185

