HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images
Abstract
1. Introduction
- We design HBEVOcc, a framework leveraging BEV representation and a novel height-aware deformable attention module for 3D occupancy prediction. By effectively exploiting the latent height information embedded in BEV features, it addresses the absence of vertical dimensionality in BEV representations, resulting in a significant improvement in 3D occupancy prediction performance.
- Our proposed method learns 3D occupancy prediction from multi-camera images through both explicit and implicit view transformations. It enables the efficient fusion of explicit, implicit, and multi-scale BEV features, significantly reducing the memory usage of 3D occupancy prediction whilst maintaining high performance. To further improve height voxel supervision, we introduce a height-aware voxel loss with adaptive weighting along the height axis.
- Through extensive experiments on the Occ3D-nuScenes and OpenOcc dataset, we demonstrate that HBEVOcc outperforms existing methods in 3D occupancy prediction, achieving superior performance in this challenging task. Our results outperform not only BEV-based but also voxel-based methods, achieving a better trade-off between memory consumption and accuracy.
2. Related Work
2.1. Vision-Based 3D Occupancy Prediction
2.2. 3D Semantic Scene Completion
2.3. BEV-Based 3D Scene Representation
3. Proposed Method
3.1. Problem Formulation
3.2. Overview
3.3. Explicit View Transformation
3.4. Implicit View Transformation
3.5. Height-Aware Deformable Attention

3.6. 3D Occupancy Prediction Head
3.7. Height-Aware Voxel Loss
3.8. Model Optimization
4. Experiments
4.1. Dataset
4.2. Experimental Settings
Implementation Details
4.3. Evaluation Metrics
Training
4.4. Main Results
4.4.1. 3D Occupancy Prediction Results on Occ3D-nuScenes
4.4.2. 3D Occupancy Prediction Results on OpenOcc
4.5. Ablation Study
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 1–18. [Google Scholar]
- Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2023; Volume 37, pp. 1486–1494. [Google Scholar]
- Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
- Wu, X.; Ma, D.; Qu, X.; Jiang, X.; Zeng, D. Depth dynamic center difference convolutions for monocular 3D object detection. Neurocomputing 2023, 520, 73–81. [Google Scholar]
- Tang, Y.; He, H.; Wang, Y.; Mao, Z.; Wang, H. Multi-modality 3D object detection in autonomous driving: A review. Neurocomputing 2023, 553, 126587. [Google Scholar] [CrossRef]
- Zhao, T.; Chen, Y.; Wu, Y.; Liu, T.; Du, B.; Xiao, P.; Qiu, S.; Yang, H.; Li, G.; Yang, Y.; et al. Improving Bird’s Eye View Semantic Segmentation by Task Decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 15512–15521. [Google Scholar]
- Xu, Z.; Li, S.; Peng, L.; Jiang, B.; Huang, R.; Chen, Y. Ultra-fast semantic map perception model for autonomous driving. Neurocomputing 2024, 599, 128162. [Google Scholar] [CrossRef]
- Agarwal, A.; Arora, C. Attention attention everywhere: Monocular depth prediction with skip attention. In IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 5861–5870. [Google Scholar]
- Masoumian, A.; Rashwan, H.A.; Abdulwahab, S.; Cristiano, J.; Asif, M.S.; Puig, D. GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network. Neurocomputing 2023, 517, 81–92. [Google Scholar] [CrossRef]
- Zhao, G.; Wei, H.; He, H. IAFMVS: Iterative Depth Estimation with Adaptive Features for Multi-View Stereo. Neurocomputing 2025, 629, 129682. [Google Scholar] [CrossRef]
- Hu, A.; Murez, Z.; Mohan, N.; Dudas, S.; Hawke, J.; Badrinarayanan, V.; Cipolla, R.; Kendall, A. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 15273–15282. [Google Scholar]
- Xu, H.; Chen, J.; Meng, S.; Wang, Y.; Chau, L.P. A survey on occupancy perception for autonomous driving: The information fusion perspective. Inf. Fusion 2025, 114, 102671. [Google Scholar] [CrossRef]
- Tian, X.; Jiang, T.; Yun, L.; Mao, Y.; Yang, H.; Wang, Y.; Wang, Y.; Zhao, H. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Adv. Neural Inf. Process. Syst. 2024, 36, 64318–64330. [Google Scholar]
- Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 9223–9232. [Google Scholar]
- Wei, Y.; Zhao, L.; Zheng, W.; Zhu, Z.; Zhou, J.; Lu, J. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 21729–21740. [Google Scholar]
- Zhang, Y.; Zhu, Z.; Du, D. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 9433–9443. [Google Scholar]
- Lu, Y.; Zhu, X.; Wang, T.; Ma, Y. Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. Adv. Neural Inf. Process. Syst. 2024, 37, 79618–79641. [Google Scholar]
- Hou, J.; Li, X.; Guan, W.; Zhang, G.; Feng, D.; Du, Y.; Xue, X.; Pu, J. FastOcc: Accelerating 3D Occupancy Prediction by Fusing the 2D Bird’s-Eye View and Perspective View. arXiv 2024, arXiv:2403.02710. [Google Scholar]
- Yu, Z.; Shu, C.; Deng, J.; Lu, K.; Liu, Z.; Yu, J.; Yang, D.; Li, H.; Chen, Y. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv 2023, arXiv:2311.12058. [Google Scholar]
- Wang, Y.; Chen, Y.; Liao, X.; Fan, L.; Zhang, Z. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 17158–17168. [Google Scholar]
- Pan, M.; Liu, J.; Zhang, R.; Huang, P.; Li, X.; Xie, H.; Wang, B.; Liu, L.; Zhang, S. Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2024; pp. 12404–12411. [Google Scholar]
- Huang, Y.; Zheng, W.; Zhang, B.; Zhou, J.; Lu, J. Selfocc: Self-supervised vision-based 3d occupancy prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 19946–19956. [Google Scholar]
- Zhang, C.; Yan, J.; Wei, Y.; Li, J.; Liu, L.; Tang, Y.; Duan, Y.; Lu, J. Occnerf: Advancing 3d occupancy prediction in lidar-free environments. IEEE Trans. Image Process. 2025, 34, 3096–3107. [Google Scholar] [CrossRef]
- Li, Z.; Yu, Z.; Wang, W.; Anandkumar, A.; Lu, T.; Alvarez, J.M. Fb-bev: Bev representation from forward-backward view transformations. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 6919–6928. [Google Scholar]
- Ma, Q.; Tan, X.; Qu, Y.; Ma, L.; Zhang, Z.; Xie, Y. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 19936–19945. [Google Scholar]
- Tan, Q.; Liu, W.; Bi, H.; Wang, L.; Yang, L.; Qiao, Y.; Zhao, Z.; Jiang, Y.; Guo, Q.; Liu, H.; et al. SAMOccNet: Refined SAM-based Surrounding Semantic Occupancy Perception for Autonomous Driving. Neurocomputing 2025, 650, 130918. [Google Scholar] [CrossRef]
- Murhij, Y.; Yudin, D. OFMPNet: Deep end-to-end model for occupancy and flow prediction in urban environment. Neurocomputing 2024, 586, 127649. [Google Scholar] [CrossRef]
- Liao, Z.; Wei, P.; Chen, S.; Wang, H.; Ren, Z. Stcocc: Sparse spatial-temporal cascade renovation for 3d occupancy and scene flow prediction. In Computer Vision and Pattern Recognition Conference; IEEE: Piscataway, NJ, USA, 2025; pp. 1516–1526. [Google Scholar]
- Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic scene completion from a single depth image. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 1746–1754. [Google Scholar]
- Cao, A.Q.; De Charette, R. Monoscene: Monocular 3d semantic scene completion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 3991–4001. [Google Scholar]
- Li, Y.; Yu, Z.; Choy, C.; Xiao, C.; Alvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In IEEE/CVF conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 9087–9098. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 1290–1299. [Google Scholar]
- Miao, R.; Liu, W.; Chen, M.; Gong, Z.; Xu, W.; Hu, C.; Zhou, S. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv 2023, arXiv:2302.13540. [Google Scholar]
- Jiang, H.; Cheng, T.; Gao, N.; Zhang, H.; Lin, T.; Liu, W.; Wang, X. Symphonize 3d semantic scene completion with contextual instance queries. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 20258–20267. [Google Scholar]
- Wu, Y.; Yan, Z.; Wang, Z.; Li, X.; Hui, L.; Yang, J. Deep height decoupling for precise vision-based 3d occupancy prediction. arXiv 2024, arXiv:2409.07972. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 4794–4803. [Google Scholar]
- Berman, M.; Triki, A.R.; Blaschko, M.B. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 4413–4421. [Google Scholar]
- Yu, Z.; Shu, C.; Sun, Q.; Linghu, J.; Wei, X.; Yu, J.; Liu, Z.; Yang, D.; Li, H.; Chen, Y. Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center. arXiv 2024, arXiv:2406.10527. [Google Scholar]
- Tong, W.; Sima, C.; Wang, T.; Chen, L.; Wu, S.; Deng, H.; Gu, Y.; Lu, L.; Luo, P.; Lin, D.; et al. Scene as occupancy. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 8406–8415. [Google Scholar]
- Liu, H.; Chen, Y.; Wang, H.; Yang, Z.; Li, T.; Zeng, J.; Chen, L.; Li, H.; Wang, L. Fully sparse 3d occupancy prediction. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 54–71. [Google Scholar]
- Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Shi, Y.; Cheng, T.; Zhang, Q.; Liu, W.; Wang, X. Occupancy as set of points. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 72–87. [Google Scholar]
- Ye, Z.; Jiang, T.; Xu, C.; Li, Y.; Zhao, H. CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction. arXiv 2024, arXiv:2409.13430. [Google Scholar]
- Li, J.; He, X.; Zhou, C.; Cheng, X.; Wen, Y.; Zhang, D. ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers. arXiv 2024, arXiv:2405.04299. [Google Scholar]
- Tan, X.; Wu, W.; Zhang, Z.; Fan, C.; Peng, Y.; Zhang, Z.; Xie, Y.; Ma, L. Geocc: Geometrically enhanced 3d occupancy network with implicit-explicit depth fusion and contextual self-supervision. IEEE Trans. Intell. Transp. Syst. 2025, 26, 5613–5623. [Google Scholar] [CrossRef]
- Gan, W.; Mo, N.; Xu, H.; Yokoya, N. A Comprehensive Framework for 3D Occupancy Estimation in Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 7852–7864. [Google Scholar] [CrossRef]
- He, Y.; Chen, W.; Xun, T.; Tan, Y. Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement. arXiv 2024, arXiv:2407.13155. [Google Scholar]
- Liu, Y.; Mou, L.; Yu, X.; Han, C.; Mao, S.; Xiong, R.; Wang, Y. Let occ flow: Self-supervised 3d occupancy flow prediction. arXiv 2024, arXiv:2407.07587. [Google Scholar] [CrossRef]







| Method | Mask | History Frame | Backbone | Image Size | mIoU (%) ↑ | others | barrier | bicycle | bus | car | cons. veh. | motorcycle | pedestrian | traffic cone | trailer | truck | drive. surf. | other flat | sidewalk | terrain | manmade | vegetation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MonoScene [31] | ✗ | ✗ | ResNet-101 | 900 × 1600 | 6.06 | 1.75 | 7.23 | 4.26 | 4.93 | 9.38 | 5.67 | 3.98 | 3.01 | 5.90 | 4.45 | 7.17 | 14.91 | 6.32 | 7.92 | 7.43 | 1.01 | 7.65 |
| OccFormer [17] | ✗ | ✗ | ResNet-101 | 900 × 1600 | 21.93 | 5.94 | 30.29 | 12.32 | 34.40 | 39.17 | 14.44 | 16.45 | 17.22 | 9.27 | 13.90 | 26.36 | 50.99 | 30.96 | 34.66 | 22.73 | 6.76 | 6.97 |
| TPVFormer [15] | ✗ | ✗ | ResNet-101 | 900 × 1600 | 28.34 | 6.67 | 39.20 | 14.24 | 41.54 | 46.98 | 19.21 | 22.64 | 17.87 | 14.54 | 30.20 | 35.51 | 56.18 | 33.65 | 35.69 | 31.61 | 19.97 | 16.12 |
| CTF-Occ [14] | ✗ | ✗ | ResNet-101 | 900 × 1600 | 28.53 | 8.09 | 39.33 | 20.56 | 38.29 | 42.24 | 16.93 | 24.52 | 22.72 | 21.05 | 22.98 | 31.11 | 53.33 | 33.84 | 37.98 | 33.23 | 20.79 | 18.00 |
| HBEVOcc (ours) | ✗ | ✗ | ResNet-50 | 256 × 704 | 29.13 | 6.48 | 37.65 | 18.05 | 38.66 | 42.56 | 18.45 | 21.72 | 19.94 | 18.13 | 21.43 | 30.03 | 62.66 | 34.33 | 39.94 | 37.39 | 24.01 | 23.73 |
| BEVFormer [2] | ✗ | 3 | ResNet-101 | 900 × 1600 | 23.67 | 5.03 | 38.79 | 9.98 | 34.41 | 41.09 | 13.24 | 16.50 | 18.15 | 17.83 | 18.66 | 27.70 | 48.95 | 27.73 | 29.08 | 25.38 | 15.41 | 14.46 |
| BEVStereo [3] | ✗ | 1 | ResNet-101 | 900 × 1600 | 24.51 | 5.73 | 38.41 | 7.88 | 38.70 | 41.20 | 17.56 | 17.33 | 14.69 | 10.31 | 16.84 | 29.62 | 54.08 | 28.92 | 32.68 | 26.54 | 18.74 | 17.49 |
| SparseOcc [42] | ✗ | 16 | ResNet-50 | 256 × 704 | 30.9 | 10.6 | 39.2 | 20.2 | 32.9 | 43.3 | 19.4 | 23.8 | 23.4 | 29.3 | 21.4 | 29.3 | 67.7 | 36.3 | 44.6 | 40.9 | 22.0 | 21.9 |
| HBEVOcc (ours) | ✗ | 1 | ResNet-50 | 256 × 704 | 34.34 | 10.51 | 45.41 | 24.32 | 41.10 | 47.65 | 23.79 | 26.59 | 24.68 | 27.29 | 27.88 | 34.58 | 65.02 | 35.38 | 42.83 | 40.49 | 35.03 | 31.23 |
| HBEVOcc (ours) | ✗ | 8 | ResNet-50 | 256 × 704 | 36.43 | 12.69 | 49.13 | 27.13 | 41.18 | 49.32 | 23.47 | 29.71 | 27.36 | 32.12 | 29.03 | 36.43 | 67.01 | 37.16 | 44.71 | 41.71 | 37.69 | 33.48 |
| BEVDetOcc [4] | ✔ | ✗ | ResNet-50 | 256 × 704 | 31.64 | 6.65 | 36.97 | 8.33 | 38.69 | 44.46 | 15.21 | 13.67 | 16.39 | 15.27 | 27.11 | 31.04 | 78.70 | 36.45 | 48.27 | 51.68 | 36.82 | 32.09 |
| FlashOcc [20] | ✔ | ✗ | ResNet-50 | 256 × 704 | 31.95 | 6.21 | 39.57 | 11.27 | 36.32 | 43.95 | 16.25 | 14.73 | 16.89 | 15.76 | 28.56 | 30.91 | 78.16 | 37.52 | 47.42 | 51.35 | 36.79 | 31.42 |
| DHD-S [36] | ✔ | ✗ | ResNet-50 | 256 × 704 | 36.50 | 10.59 | 43.21 | 23.02 | 40.61 | 47.31 | 21.68 | 23.25 | 23.85 | 23.40 | 31.75 | 34.15 | 80.16 | 41.30 | 49.95 | 54.07 | 38.73 | 33.51 |
| HBEVOcc (ours) | ✔ | ✗ | ResNet-50 | 256 × 704 | 36.93 | 11.00 | 44.07 | 23.83 | 40.46 | 48.9 | 22.29 | 24.49 | 25.80 | 25.80 | 29.19 | 34.24 | 79.82 | 41.32 | 50.33 | 53.56 | 38.54 | 34.16 |
| BEVDet4D [4] | ✔ | 1 | ResNet-50 | 256 × 704 | 36.01 | 8.22 | 44.21 | 10.34 | 42.08 | 49.63 | 23.37 | 17.41 | 21.49 | 19.70 | 31.33 | 37.09 | 80.13 | 37.37 | 50.41 | 54.29 | 45.56 | 39.59 |
| FlashOcc [20] | ✔ | 1 | ResNet-50 | 256 × 704 | 37.84 | 9.08 | 46.32 | 17.71 | 42.7 | 50.64 | 23.72 | 20.13 | 22.34 | 24.09 | 30.26 | 37.39 | 81.68 | 40.13 | 52.34 | 56.46 | 47.69 | 40.6 |
| OSP [44] | ✔ | 1 | ResNet-101 | 900 × 1600 | 41.21 | 10.95 | 49.0 | 27.68 | 50.24 | 55.99 | 22.96 | 31.02 | 30.91 | 30.25 | 35.60 | 41.23 | 82.09 | 42.59 | 51.9 | 55.1 | 44.82 | 38.17 |
| COTR (BEVDet4D) [26] | ✔ | 1 | ResNet-50 | 256 × 704 | 41.39 | 12.20 | 48.51 | 29.08 | 44.66 | 53.33 | 27.01 | 29.19 | 28.91 | 30.98 | 35.03 | 39.50 | 81.83 | 42.53 | 53.71 | 56.86 | 48.18 | 42.09 |
| DHD-M [36] | ✔ | 1 | ResNet-50 | 256 × 704 | 41.49 | 12.72 | 48.68 | 26.31 | 43.22 | 52.92 | 27.33 | 28.49 | 28.52 | 30.02 | 35.81 | 40.24 | 83.12 | 44.67 | 54.71 | 57.69 | 48.87 | 42.09 |
| HBEVOcc (ours) | ✔ | 1 | ResNet-50 | 256 × 704 | 41.84 | 13.28 | 49.96 | 28.88 | 45.76 | 53.77 | 28.19 | 29.68 | 29.20 | 32.38 | 34.77 | 40.27 | 82.28 | 44.00 | 53.60 | 56.78 | 47.23 | 41.20 |
| FBOCC [25] | ✔ | 16 | ResNet-50 | 256 × 704 | 39.11 | 13.57 | 44.74 | 27.01 | 45.41 | 49.1 | 25.15 | 26.33 | 27.86 | 27.79 | 32.28 | 36.75 | 80.07 | 42.76 | 51.18 | 55.13 | 42.19 | 37.53 |
| FastOcc [19] | ✔ | 16 | ResNet-101 | 640 × 1600 | 39.21 | 12.06 | 43.53 | 28.04 | 44.80 | 52.16 | 22.96 | 29.14 | 29.68 | 26.98 | 30.81 | 38.44 | 82.04 | 41.93 | 51.92 | 53.71 | 41.04 | 35.49 |
| BEVFormer [2] | ✔ | 3 | ResNet-101 | 900 × 1600 | 39.24 | 10.13 | 47.91 | 24.90 | 47.57 | 54.52 | 20.23 | 28.85 | 28.02 | 25.73 | 33.03 | 38.56 | 81.98 | 40.65 | 50.93 | 53.02 | 43.86 | 37.15 |
| BEVDet4D [4] | ✔ | 8 | ResNet-50 | 384 × 704 | 39.26 | 9.33 | 47.05 | 19.23 | 41.47 | 52.21 | 27.19 | 22.23 | 23.32 | 21.58 | 35.77 | 38.94 | 82.48 | 40.42 | 53.75 | 57.71 | 49.94 | 45.76 |
| CVT-Occ [45] | ✔ | 6 | ResNet-101 | 900 × 1600 | 40.34 | 9.45 | 49.46 | 23.57 | 49.18 | 55.63 | 23.1 | 27.85 | 28.88 | 29.07 | 34.97 | 40.98 | 81.44 | 40.92 | 51.37 | 54.25 | 45.94 | 39.71 |
| ViewFormer [46] | ✔ | 3 | ResNet-50 | 256 × 704 | 41.85 | 12.94 | 50.11 | 27.97 | 44.61 | 52.85 | 22.38 | 29.62 | 28.01 | 29.28 | 35.18 | 39.40 | 84.71 | 49.39 | 57.44 | 59.69 | 47.37 | 40.56 |
| PanoOcc [21] | ✔ | 3 | ResNet-101 | 900 × 1600 | 42.13 | 11.67 | 50.48 | 29.64 | 49.44 | 55.52 | 23.29 | 33.26 | 30.55 | 30.99 | 34.43 | 42.57 | 83.31 | 44.23 | 54.40 | 56.04 | 45.94 | 40.40 |
| GEOcc [47] | ✔ | 8 | ResNet-50 | 256 × 704 | 43.64 | 14.29 | 51.27 | 31.11 | 46.13 | 55.09 | 29.12 | 30.46 | 30.99 | 35.47 | 35.2 | 41.82 | 84.0 | 47.0 | 55.52 | 59.5 | 50.03 | 44.82 |
| HBEVOcc (ours) | ✔ | 8 | ResNet-50 | 256 × 704 | 43.98 | 14.38 | 52.89 | 30.65 | 46.29 | 55.84 | 29.00 | 33.29 | 32.15 | 36.42 | 37.12 | 41.99 | 82.86 | 45.48 | 54.91 | 58.96 | 50.77 | 44.6 |
| BEVDet4D [4] | ✔ | 1 | Swin-B | 512 × 1408 | 42.02 | 12.15 | 49.63 | 25.1 | 52.02 | 54.46 | 27.87 | 27.99 | 28.94 | 27.23 | 36.43 | 42.22 | 82.31 | 43.29 | 54.46 | 57.9 | 48.61 | 43.55 |
| FlashOcc [20] | ✔ | 1 | Swin-B | 512 × 1408 | 43.52 | 13.42 | 51.07 | 27.68 | 51.57 | 56.22 | 27.27 | 29.98 | 29.93 | 29.80 | 37.77 | 43.52 | 83.81 | 46.55 | 56.15 | 59.56 | 50.84 | 44.67 |
| GEOcc [47] | ✔ | 8 | Swin-B | 512 × 1408 | 44.67 | 14.02 | 51.4 | 33.08 | 52.08 | 56.72 | 30.04 | 33.54 | 32.34 | 35.83 | 39.34 | 44.18 | 83.49 | 46.77 | 55.72 | 58.94 | 48.85 | 43.0 |
| HBEVOcc (ours) | ✔ | 1 | Swin-B | 512 × 1408 | 45.20 | 15.03 | 52.51 | 33.66 | 52.98 | 56.93 | 29.03 | 34.54 | 33.41 | 35.83 | 38.58 | 44.29 | 83.79 | 47.42 | 56.24 | 59.33 | 50.53 | 44.30 |
| Method | Mask | History Frames | Backbone | Input Size | Epoch | RayIoU (%) ↑ | RayIoU1m, 2m, 4m ↑ | mIoU (%) ↑ | FPS↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Training GPU | Testing GPU | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SimpleOccupancy [48] | ✔ | ✗ | ResNet-101 | 336 × 672 | 12 | 22.5 | 17.0 | 22.7 | 27.9 | 31.8 | 9.7 | - | - | A100 | A100 |
| BEVFormer [2] | ✔ | 3 | ResNet-101 | 900 × 1600 | 24 | 32.4 | 26.1 | 32.9 | 38.0 | 39.2 | 3.0 | 25.1 | 6.7 | A100 | A100 |
| BEVDet4D [4] | ✔ | 1 | ResNet-50 | 256 × 704 | 90 | 29.6 | 23.6 | 30.0 | 35.1 | 36.1 | 2.6 | 8.4 | 4.7 | A100 | A100 |
| BEVDet4D [4] | ✔ | 8 | ResNet-50 | 384 × 704 | 90 | 32.6 | 26.6 | 33.1 | 38.2 | 39.3 | 0.8 | 10.1 | 6.4 | A100 | A100 |
| FBOcc [25] | ✔ | 16 | ResNet-50 | 256 × 704 | 90 | 33.5 | 26.7 | 34.1 | 39.7 | 39.1 | 10.3 | 11.1 | 5.5 | A100 | A100 |
| HBEVOcc-Fast(ours) | ✔ | 1 | ResNet-50 | 256 × 704 | 24 | 31.4 | 24.8 | 31.8 | 37.6 | 39.1 | 18.9 | 6.4 | 2.7 | RTX 2080Ti | RTX 4090 |
| HBEVOcc-Fast(ours) | ✔ | 8 | ResNet-50 | 256 × 704 | 24 | 33.4 | 26.9 | 33.8 | 39.4 | 41.2 | 14.6 | 6.9 | 2.8 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | ✔ | 1 | ResNet-50 | 256 × 704 | 24 | 33.4 | 26.9 | 33.8 | 39.4 | 41.8 | 8.2 | 7.3 | 3.0 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | ✔ | 8 | ResNet-50 | 256 × 704 | 24 | 34.9 | 28.6 | 35.4 | 40.8 | 44.0 | 5.4 | 7.5 | 3.1 | RTX 2080Ti | RTX 4090 |
| SparseOcc [42] | ✗ | 8 | ResNet-50 | 256 × 704 | 24 | 34.0 | 28.0 | 34.7 | 39.4 | 30.1 | 17.1 | 12.2 | 5.4 | A100 | RTX 4090 |
| SparseOcc [42] | ✗ | 16 | ResNet-50 | 256 × 704 | 24 | 35.1 | 29.1 | 35.8 | 40.3 | 30.6 | 14.1 | 22.9 | 6.9 | A100 | RTX 4090 |
| SparseOcc [42] | ✗ | 16 | ResNet-50 | 256 × 704 | 48 | 36.1 | 30.2 | 36.8 | 41.2 | 30.9 | 14.1 | 22.9 | 6.9 | A100 | RTX 4090 |
| Panoptic-FlashOcc [40] | ✗ | 1 | ResNet-50 | 256 × 704 | 24 | 36.0 | 30.1 | 36.8 | 41.1 | 29.6 | 39.4 | 6.1 | 2.2 | A100 | RTX 4090 |
| Panoptic-FlashOcc [40] | ✗ | 8 | ResNet-50 | 256 × 704 | 24 | 38.5 | 32.8 | 39.3 | 43.4 | 31.5 | 20.4 | 6.3 | 2.4 | A100 | RTX 4090 |
| GSD-Occ [49] | ✗ | 16 | ResNet-50 | 256 × 704 | 24 | 38.9 | - | - | - | - | 20.0 | - | 4.8 | A100 | A100 |
| HBEVOcc-Fast (ours) | ✗ | 1 | ResNet-50 | 256 × 704 | 24 | 37.1 | 30.9 | 37.9 | 42.5 | 31.7 | 18.9 | 6.4 | 2.7 | RTX 2080Ti | RTX 4090 |
| HBEVOcc-Fast (ours) | ✗ | 8 | ResNet-50 | 256 × 704 | 24 | 39.6 | 33.4 | 40.3 | 45.0 | 34.0 | 14.6 | 6.9 | 2.8 | RTX 2080Ti | RTX 4090 |
| HBEVOcc-Fast (ours) | ✗ | 8 | ResNet-50 | 256 × 704 | 48 | 40.1 | 34.2 | 40.8 | 45.3 | 34.2 | 14.6 | 6.9 | 2.8 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | ✗ | 1 | ResNet-50 | 256 × 704 | 24 | 39.2 | 33.3 | 40.0 | 44.4 | 34.3 | 8.2 | 7.3 | 3.0 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | ✗ | 8 | ResNet-50 | 256 × 704 | 24 | 41.0 | 35.9 | 41.8 | 45.5 | 36.4 | 5.4 | 7.5 | 3.1 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | ✗ | 8 | ResNet-50 | 256 × 704 | 48 | 41.5 | 36.1 | 42.2 | 46.3 | 36.5 | 5.4 | 7.5 | 3.1 | RTX 2080Ti | RTX 4090 |
| Method | Sup. | Backbone | Input Size | History Frames | Epoch | RayIoU (%) ↑ | mAVE ↓ | FPS ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Training GPU | Testing GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OccNeRF-C [24] | C | R101 | 900 × 1600 | - | - | 21.6 | 1.53 | - | - | - | - | - |
| OccNeRF-L [24] | L | R101 | 900 × 160 | - | - | 31.7 | 1.59 | - | - | - | - | - |
| RenderOcc [22] | L | R101 | 900 × 160 | 6 | 12 | 36.7 | 1.63 | - | - | - | - | - |
| Let Occ Flow [50] | C+L | R101 | 512 × 1408 | 2 | 16 | 40.5 | 1.45 | - | - | - | - | - |
| OccNet [41] | 3D | R101 | 900 × 160 | 3 | 24 | 39.7 | 1.61 | - | - | - | - | - |
| BEVFormer [2] | 3D | R50 | 900 × 160 | 3 | 24 | 28.1 | 1.12 | 3.0 | 26.0 | 6.7 | A100 | A100 |
| FB-Occ [25] | 3D | R50 | 256 × 704 | 16 | 90 | 32.3 | 0.83 | 10.3 | 11.1 | 5.5 | A100 | A100 |
| SparseOcc [42] | 3D | R50 | 256 × 704 | 8 | 48 | 33.4 | 0.87 | 17.1 | 15.8 | 5.4 | A100 | RTX 4090 |
| STCOcc [29] | 3D | R50 | 256 × 704 | 16 | 48 | 40.8 | 0.44 | 4.7 | 10.0 | 5.6 | RTX 4090 | RTX 4090 |
| HBEVOcc (ours) | 3D | R50 | 256 × 704 | 1 | 24 | 39.4 | 0.52 | 8.2 | 7.3 | 3.0 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | 3D | R50 | 256 × 704 | 8 | 24 | 40.8 | 0.41 | 5.4 | 7.5 | 3.1 | RTX 2080Ti | RTX 4090 |
| HBEVOcc (ours) | 3D | R50 | 256 × 704 | 8 | 48 | 41.4 | 0.39 | 5.4 | 7.5 | 3.1 | RTX 2080Ti | RTX 4090 |
| Baseline | EVT | IVT | HADA | HAVL | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|---|---|
| ✔ | 34.34 | 4.8 | 2.3 | 44.9 | 253.1 | ||||
| ✔ | ✔ | 35.02 | 5.1 | 2.3 | 50.1 | 259.1 | |||
| ✔ | ✔ | 34.43 | 4.8 | 2.3 | 44.9 | 253.1 | |||
| ✔ | ✔ | ✔ | 35.13 | 5.1 | 2.3 | 50.1 | 259.1 | ||
| ✔ | 34.70 | 5.0 | 2.3 | 45.8 | 280.3 | ||||
| ✔ | 34.40 | 4.1 | 2.2 | 28.1 | 148.7 | ||||
| ✔ | ✔ | 35.68 | 5.3 | 2.5 | 50.7 | 384.5 | |||
| ✔ | ✔ | ✔ | 36.76 | 6.3 | 2.6 | 56.2 | 393.6 | ||
| ✔ | ✔ | ✔ | ✔ | 36.93 | 6.3 | 2.6 | 56.2 | 393.6 |
| Baseline | EVT | IVT | HADA | HAVL | RayIoU (%) ↑ | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|---|---|---|
| ✔ | 32.12 | 25.51 | 4.8 | 2.3 | 44.9 | 253.1 | ||||
| ✔ | ✔ | 32.28 | 25.90 | 5.1 | 2.3 | 50.1 | 259.1 | |||
| ✔ | ✔ | 32.25 | 27.27 | 4.8 | 2.3 | 44.9 | 253.1 | |||
| ✔ | ✔ | ✔ | 33.11 | 27.81 | 5.1 | 2.3 | 50.1 | 259.1 | ||
| ✔ | 32.58 | 25.97 | 5.0 | 2.3 | 45.8 | 280.3 | ||||
| ✔ | 32.51 | 26.24 | 4.1 | 2.2 | 28.1 | 148.7 | ||||
| ✔ | ✔ | 33.63 | 26.97 | 5.3 | 2.5 | 50.7 | 384.5 | |||
| ✔ | ✔ | ✔ | 34.01 | 27.43 | 6.3 | 2.6 | 56.2 | 393.6 | ||
| ✔ | ✔ | ✔ | ✔ | 34.30 | 29.13 | 6.3 | 2.6 | 56.2 | 393.6 |
| Baeline | EVT | IVT | HADA | HAVL | RayIoU (%) ↑ | mAVE ↓ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|---|---|---|
| ✔ | 31.95 | 1.20 | 4.8 | 2.3 | 45.1 | 261.6 | ||||
| ✔ | ✔ | 32.01 | 1.12 | 5.1 | 2.3 | 50.3 | 267.7 | |||
| ✔ | ✔ | 32.83 | 1.81 | 4.8 | 2.3 | 45.1 | 261.6 | |||
| ✔ | ✔ | ✔ | 32.92 | 1.75 | 5.1 | 2.3 | 50.3 | 267.7 | ||
| ✔ | 32.16 | 1.02 | 5.0 | 2.3 | 46.0 | 285.9 | ||||
| ✔ | 32.09 | 1.37 | 4.1 | 2.2 | 28.2 | 154.1 | ||||
| ✔ | ✔ | 33.19 | 1.14 | 5.3 | 2.5 | 51.0 | 395.1 | |||
| ✔ | ✔ | ✔ | 33.67 | 1.01 | 6.3 | 2.6 | 56.5 | 404.2 | ||
| ✔ | ✔ | ✔ | ✔ | 34.27 | 1.03 | 6.3 | 2.6 | 56.5 | 404.2 |
| Fusion Methods | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|
| Concat | 35.68 | 5.3 | 2.5 | 50.7 | 384.5 |
| Add | 35.42 | 5.2 | 2.3 | 47.7 | 297.6 |
| Gated Fusion | 35.58 | 5.4 | 2.4 | 48.1 | 308.8 |
| History Frames | Horizontal Points | Height Points | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|
| ✗ | 2 | 1 | 36.67 | 6.2 | 2.6 | 55.6 | 392.3 |
| ✗ | 2 | 2 | 36.68 | 6.2 | 2.6 | 56.2 | 393.3 |
| ✗ | 4 | 1 | 36.70 | 6.3 | 2.6 | 55.6 | 392.6 |
| ✗ | 4 | 2 | 36.76 | 6.3 | 2.6 | 56.2 | 393.6 |
| 1 | 2 | 1 | 41.36 | 7.2 | 3.0 | 73.4 | 640.8 |
| 1 | 2 | 2 | 41.63 | 7.2 | 3.0 | 74.3 | 642.3 |
| 1 | 4 | 1 | 41.56 | 7.3 | 3.0 | 73.4 | 641.2 |
| 1 | 4 | 2 | 41.64 | 7.3 | 3.0 | 74.3 | 642.7 |
| History Frames | Height | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|
| ✗ | 1 | 36.78 | 6.2 | 2.6 | 55.6 | 370.0 |
| 8 | 36.93 | 6.3 | 2.6 | 56.2 | 393.6 | |
| 1 | 1 | 41.48 | 7.3 | 3.0 | 73.5 | 578.2 |
| 8 | 41.84 | 7.3 | 3.0 | 74.3 | 642.7 | |
| 4 | 1 | 42.94 | 7.3 | 3.1 | 74.1 | 946.8 |
| 8 | 42.21 | 7.5 | 3.1 | 75.0 | 1131.2 | |
| 8 | 1 | 43.98 | 7.5 | 3.1 | 75.0 | 1450.7 |
| 8 | 42.58 | 7.6 | 3.1 | 75.9 | 1782.7 |
| HAVL Height | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|
| 2 | 35.80 | 5.3 | 2.5 | 50.7 | 384.5 |
| 4 | 35.86 | 5.3 | 2.5 | 50.7 | 384.5 |
| 8 | 35.82 | 5.3 | 2.5 | 50.7 | 384.5 |
| 16 | 36.00 | 5.3 | 2.5 | 50.7 | 384.5 |
| Sampled Positions | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|
| 2000 | 35.87 | 5.3 | 2.5 | 50.7 | 384.5 |
| 4000 | 36.00 | 5.3 | 2.5 | 50.7 | 384.5 |
| 20,000 | 35.71 | 5.3 | 2.5 | 50.7 | 384.5 |
| 40,000 | 35.81 | 5.3 | 2.5 | 50.7 | 384.5 |
| HAVL | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|
| ✗ | 35.68 | 5.3 | 2.5 | 50.7 | 384.5 |
| Fixed Weight | 35.89 | 5.3 | 2.5 | 50.7 | 384.5 |
| Fixed Weight | 35.80 | 5.3 | 2.5 | 50.7 | 384.5 |
| Fixed Weight | 35.84 | 5.3 | 2.5 | 50.7 | 384.5 |
| Height-aware Weight | 36.00 | 5.3 | 2.5 | 50.7 | 384.5 |
| Dataset | History Frames | RayIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|
| Occ3D-nuScenes | 1 | 39.21 | 7.3 | 3.0 | 74.3 | 642.7 |
| 4 | 40.50 | 7.3 | 3.1 | 74.1 | 946.8 | |
| 8 | 41.05 | 7.5 | 3.1 | 75.0 | 1450.7 | |
| OpenOcc | 1 | 39.43 | 7.3 | 3.0 | 74.6 | 653.3 |
| 4 | 40.02 | 7.3 | 3.1 | 74.3 | 957.4 | |
| 8 | 40.78 | 7.5 | 3.1 | 75.3 | 1461.3 |
| HADA Height | mIoU (%) ↑ | Training Mem (G) ↓ | Inference Mem (G) ↓ | Params (M) | GFLOPs |
|---|---|---|---|---|---|
| 2 | 36.66 | 6.3 | 2.6 | 56.8 | 395.6 |
| 4 | 36.59 | 6.3 | 2.6 | 56.4 | 394.2 |
| 8 | 36.76 | 6.3 | 2.6 | 56.2 | 393.6 |
| 16 | 36.70 | 6.3 | 2.6 | 56.1 | 393.2 |
| Methods | Representation | History Frames | mIoU * ↑ | mIoU ↑ | mIoU ↑ (+HADA) | ΔmIoU ↑ | ΔMem (G) ↓ |
|---|---|---|---|---|---|---|---|
| DHD-S [36] | BEV | ✗ | 36.50 | 36.51 | 36.99 | +0.48 | +0.94 |
| DHD-M [36] | BEV | 1 | 41.49 | 40.74 | 41.36 | +0.62 | +1.09 |
| FlashOcc:M2 [20] | BEV | ✗ | 32.08 | 32.62 | 33.54 | +0.92 | +0.29 |
| FlashOcc-4D-Stereo:M2 [20] | BEV | 1 | 37.84 | 38.80 | 39.73 | +0.93 | +0.31 |
| BEVDet4D [4] | Voxel | 1 | 36.01 | 37.40 | 38.35 | +0.95 | +0.98 |
| FBOcc [25] | BEV and Voxel | 16 | 39.11 | 40.21 | 40.65 | +0.44 | +0.66 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lyu, C.; Li, W.; Liao, I.Y.; Ding, F.; Liu, H.; Zhou, H. HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images. Sensors 2026, 26, 934. https://doi.org/10.3390/s26030934
Lyu C, Li W, Liao IY, Ding F, Liu H, Zhou H. HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images. Sensors. 2026; 26(3):934. https://doi.org/10.3390/s26030934
Chicago/Turabian StyleLyu, Chuandong, Wenkai Li, Iman Yi Liao, Fengqian Ding, Han Liu, and Hongchao Zhou. 2026. "HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images" Sensors 26, no. 3: 934. https://doi.org/10.3390/s26030934
APA StyleLyu, C., Li, W., Liao, I. Y., Ding, F., Liu, H., & Zhou, H. (2026). HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images. Sensors, 26(3), 934. https://doi.org/10.3390/s26030934


















