Abstract
End-to-end (E2E) autonomous driving offers a promising alternative to traditional modular pipelines by mapping raw sensor data directly to vehicle controls, thereby mitigating error propagation. However, prevalent approaches largely rely on dense Bird’s-Eye-View (BEV) feature maps, which incur high computational overhead and necessitate complex post-processing for trajectory generation. To address these limitations, we propose HiPro-AD, a proposal-centric sparse E2E planning framework that fundamentally diverges from dense BEV paradigms. HiPro-AD integrates an efficiency-oriented IM-ResNet-34 encoder with a novel STFormer. This transformer dynamically fuses multi-view spatial features and historical temporal context via a proposal-anchored mechanism, focusing computation strictly on regions relevant to sparse trajectory proposals. Furthermore, trajectory selection is refined by a Pairwise Ranking Scorer, which identifies the optimal plan from diverse candidates based on relative quality. On the NAVSIM benchmark, HiPro-AD achieves a PDMS of 92.6 using only camera input, surpassing prior dense BEV and multimodal methods. On the closed-loop Bench2Drive benchmark, it attains a 37.31% success rate and a driving score of 65.48 with a latency of 67 ms, demonstrating real-time capability. These results validate the efficiency and robustness of our sparse paradigm in complex driving scenarios.