SpecBEV: An End-to-End BEV 3D Object Detection Algorithm Based on Frequency-Domain Analysis and Geometric Alignment
Abstract
1. Introduction
2. Related Work
2.1. Vision-Based BEV 3D Object Detection
2.2. Frequency-Domain Methods and Feature Enhancement
2.3. Cross-View Consistency and Geometric Alignment
3. Method
3.1. Overview
3.2. Frequency-Prior Spatial Attention Module (SA-Freq)
3.3. Cross-View Feature Alignment (CFA)
4. Experiments
4.1. Dataset and Evaluation Metrics
4.2. Implementation Details
4.3. Main Results
4.3.1. Comparison with State-of-the-Art Methods
4.3.2. Visualization Results
4.4. Ablation Studies
4.4.1. Module-Wise Ablation Study
4.4.2. Analysis of the Frequency-Selection Setting in SA-Freq
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| Abbreviations | Full Term | Mathematical Formulation/Definition |
| BEV | Bird’s-Eye View | |
| CNNs | Convolutional Neural Networks | |
| DCT | Discrete Cosine Transform | |
| conv | Convolutional Operation | : Input feature map : convolution kernel : Bias term |
| mAP | mean Average Precision | |
| NDS | NuScenes Detection Score | , |
| mATE | Mean Average Translation Error | , : Predicted bounding box center coordinates . : Ground truth bounding box center coordinates. |
| mASE | Mean Average Scale Error | , : Predicted bounding box dimensions (length, width, height). : Ground truth bounding box dimensions. |
| mAOE | Mean Average Orientation Error | , : Predicted yaw angle. : Ground truth yaw angle. |
| mAVE | Mean Average Velocity Error | , : Predicted velocity vector. : Ground truth velocity. |
| mAAE | Mean Average Attribute Error | , : Predicted attribute : Ground truth attribute : Indicator function |
References
- Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Zeng, J.; Li, Z.; Yang, J.; Deng, H.; et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2151–2170. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Zhao, Y.; Zhong, J.; Wang, B.; Sun, C.; Sun, F. Delving into the secrets of BEV 3D object detection in autonomous driving: A comprehensive survey. IEEE Trans. Intell. Transp. Syst. 2025, 27, 119–144. [Google Scholar] [CrossRef]
- Ma, Y.; Wang, T.; Bai, X.; Yang, H.; Hou, Y.; Wang, Y.; Qiao, Y.; Yang, R.; Zhu, X. Vision-centric bev perception: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10978–10997. [Google Scholar] [CrossRef] [PubMed]
- Huang, K.; Shi, B.; Li, X.; Li, X.; Huang, S.; Li, Y. Multi-modal sensor fusion for auto driving perception: A survey. arXiv 2022, arXiv:2202.02703. [Google Scholar] [CrossRef]
- Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XIV 16; Springer International Publishing: Cham, Switzerland, 2020; pp. 194–210. [Google Scholar]
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. BEVDepth: Acquisition of reliable depth for multi-view 3D object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1477–1485. [Google Scholar] [CrossRef]
- Reading, C.; Harakeh, A.; Chae, J.; Waslander, S.L. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
- Roddick, T.; Kendall, A.; Cipolla, R. Orthographic feature transform for monocular 3d object detection. arXiv 2018, arXiv:1811.08188. [Google Scholar] [CrossRef]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning bird’s-eye- view representation from multi-camera images via spatio temporal transformers. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 1–18. [Google Scholar]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 531–548. [Google Scholar]
- Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning; PMLR: New York, NY, USA, 2022; pp. 180–191. [Google Scholar]
- Zhu, Z.; Zhang, Y.; Chen, H.; Dong, Y.; Zhao, S.; Ding, W.; Zhong, J.; Zheng, S. Understanding the robustness of 3D object detection with bird’s-eye-view representations in autonomous driving. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 21600–21610. [Google Scholar]
- Wang, S.; Zhao, X.; Xu, H.M.; Chen, Z.; Yu, D.; Chang, J.; Yang, Z.; Zhao, F. Towards domain generalization for multi-view 3D object detection in bird-eye-view. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 13333–13342. [Google Scholar]
- Song, Z.; Yang, L.; Xu, S.; Liu, L.; Xu, D.; Jia, C.; Jia, F.; Wang, L. Graphbev: Towards robust bev feature alignment for multi-modal 3d object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 347–366. [Google Scholar]
- Borse, S.; Klingner, M.; Kumar, V.R.; Cai, H.; Almuzairee, A.; Yogamani, S.; Porikli, F. X-align: Cross-modal cross-view alignment for bird’s-eye-view segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 3287–3297. [Google Scholar]
- Tian, P.; Wang, Z.; Cheng, P.; Wang, Y.; Wang, Z.; Zhao, L.; Yan, M.; Yang, X.; Sun, X. Ucdnet: Multi-uav collaborative 3d object detection network by reliable feature mapping. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5602016. [Google Scholar] [CrossRef]
- Shi, P.; Zhou, M.; Dong, X.; Yang, A. Att-BEVFusion: An Object Detection Algorithm for Camera and LiDAR Fusion Under BEV Features. World Electr. Veh. J. 2024, 15, 539. [Google Scholar] [CrossRef]
- Pan, B.; Sun, J.; Leung, H.Y.T.; Andonian, A.; Zhou, B. Cross-view semantic segmentation for sensing surroundings. IEEE Robot. Autom. Lett. 2020, 5, 4867–4873. [Google Scholar] [CrossRef]
- Zhou, B.; Krähenbühl, P. Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022; pp. 13760–13769. [Google Scholar]
- Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; Anandkumar, A.; Fidler, S.; Luo, P.; Alvarez, J.M. M2BEV: Multi-camera joint 3D detection and segmentation with unified birds-eye view representation. arXiv 2022, arXiv:2204.05088. [Google Scholar]
- Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
- Li, Y.; Huang, B.; Chen, Z.; Cui, Y.; Liang, F.; Shen, M.; Liu, F.; Xie, E.; Sheng, L.; Ouyang, W.; et al. Fast-BEV: A fast and strong bird’s-eye-view perception baseline. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8665–8679. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 783–792. [Google Scholar]
- Yu, K.; Zhang, T.; Wang, H.; Xu, Q. FSTA-SNN: Frequency-Based Spatial-Temporal Attention Module for Spiking Neural Networks. Proc. AAAI Conf. Artif. Intell. 2025, 39, 22227–22235. [Google Scholar]
- Garg, I.; Chowdhury, S.S.; Roy, K. Dct-snn: Using dct to distribute spatial information over time for low-latency spiking neural networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 4671–4680. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
- Jiang, Y.; Zhang, L.; Miao, Z.; Zhu, X.; Gao, J.; Hu, W.; Jiang, Y.-G. Polarformer: Multi-camera 3d object detection with polar transformer. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1042–1050. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Peng, L.; Wu, X.; Yang, Z.; Liu, H.; Cai, D. Did-m3d: Decoupling instance depth for monocular 3d object detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 71–88. [Google Scholar]
- Chen, Y.; Liu, S.; Shen, X.; Jia, J. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 12536–12545. [Google Scholar]
- Rukhovich, D.; Vorontsova, A.; Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2022; pp. 2397–2406. [Google Scholar]
- Brigham, E.O. The Fast Fourier Transform and Its Applications; Prentice-Hall, Inc.: Saddle River, NJ, USA, 1988. [Google Scholar]
- Burrus, C.S. Wavelets and Wavelet Transforms; OpenStax: Houston, TX, USA, 2015. [Google Scholar]
- Huang, Z.; Zhang, Z.; Lan, C.; Zha, Z.-J.; Lu, Y.; Guo, B. Adaptive frequency filters as efficient global token mixers. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2023; pp. 6049–6059. [Google Scholar]
- Patro, B.N.; Namboodiri, V.P.; Agneeswaran, V.S. Spectformer: Frequency and attention is what you need in a vision transformer. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2025; pp. 9543–9554. [Google Scholar]
- Guo, S.; Yong, H.; Zhang, X.; Ma, J.; Zhang, L. Spatial-frequency attention for image denoising. arXiv 2023, arXiv:2302.13598. [Google Scholar] [CrossRef]
- Kong, L.; Dong, J.; Ge, J.; Li, M.; Pan, J. Efficient frequency domain-based transformers for high-quality image deblurring. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 5886–5895. [Google Scholar]
- Li, Q.; Wang, Y.; Wang, Y.; Wang, Y.; Zhao, H. Hdmapnet: An online hd map construction and evaluation framework. In 2022 International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2022; pp. 4628–4634. [Google Scholar]
- Saha, A.; Mendez, O.; Russell, C.; Bowden, R. Translating images into maps. In 2022 International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2022; pp. 9200–9206. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
- Wang, T.; Zhu, X.; Pang, J.; Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW); IEEE: Piscataway, NJ, USA, 2021; pp. 913–922. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 11621–11631. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]







| ID | Year | VT | mAP ↑ | NDS ↑ | mATE ↓ | mASE ↓ | mAOE ↓ | mAVE ↓ | mAAE ↓ |
|---|---|---|---|---|---|---|---|---|---|
| BEVDet | 2022 | √ | 0.2828 | 0.3500 | 0.7734 | 0.2884 | 0.6976 | 0.8637 | 0.2908 |
| BEVDet4D | 2022 | √ | 0.3235 | 0.4241 | 0.6884 | 0.2723 | 0.6732 | 0.4590 | 0.2842 |
| BEVDepth | 2023 | √ | 0.3441 | 0.4410 | 0.7280 | 0.2783 | 0.5561 | 0.5131 | 0.2355 |
| FastBEV | 2024 | -- | 0.3288 | 0.4590 | 0.6455 | 0.2922 | 0.4570 | 0.4250 | 0.2343 |
| Ours | -- | √ | 0.3856 | 0.4871 | 0.5970 | 0.2355 | 0.5523 | 0.4581 | 0.2136 |
| Method | Input Resolution | Params (M) | GFLOPs | FPS | mAP ↑ | NDS ↑ |
|---|---|---|---|---|---|---|
| BEVDet | 704 × 256 | 54.1 | 223.6 | 14.3 | 0.2693 | 0.3426 |
| BEVDet | 1056 × 384 | 54.1 | 452.0 | 9.3 | 0.2828 | 0.3500 |
| Ours | 704 × 256 | 54.5 | 229.0 | 13.6 | 0.3441 | 0.4410 |
| Ours | 1056 × 384 | 54.5 | 463.5 | 8.9 | 0.3856 | 0.4871 |
| ID | SAF | CFA | mAP ↑ | NDS ↑ | mATE ↓ | mASE ↓ | mAOE ↓ | mAVE ↓ | mAAE ↓ |
|---|---|---|---|---|---|---|---|---|---|
| M | -- | -- | 0.2828 | 0.3500 | 0.7734 | 0.2884 | 0.6976 | 0.8637 | 0.2908 |
| A | -- | √ | 0.3208 | 0.4008 | 0.6805 | 0.2622 | 0.6087 | 0.7825 | 0.2621 |
| B | √ | -- | 0.3451 | 0.4268 | 0.6489 | 0.2530 | 0.5701 | 0.7356 | 0.2495 |
| C () | √ | √ | 0.3414 | 0.4319 | 0.7151 | 0.2797 | 0.5786 | 0.5165 | 0.2421 |
| C () | √ | √ | 0.3641 | 0.4644 | 0.6283 | 0.2495 | 0.5626 | 0.4742 | 0.2378 |
| C () | √ | √ | 0.3856 | 0.4871 | 0.5970 | 0.2355 | 0.5523 | 0.4581 | 0.2136 |
| C () | √ | √ | 0.3374 | 0.4229 | 0.6355 | 0.2722 | 0.4570 | 0.4250 | 0.2343 |
| Kernel Size | K | mAP ↑ | NDS ↑ | mATE ↓ | mASE ↓ | mAOE ↓ | mAVE ↓ | mAAE ↓ |
|---|---|---|---|---|---|---|---|---|
| 3 × 3 | 9 | 0.3724 | 0.4716 | 0.6241 | 0.2607 | 0.5596 | 0.4725 | 0.2287 |
| 5 × 5 | 25 | 0.3781 | 0.4781 | 0.6018 | 0.2516 | 0.5642 | 0.4677 | 0.2241 |
| 7 × 7 | 49 | 0.3856 | 0.4871 | 0.5970 | 0.2355 | 0.5523 | 0.4581 | 0.2136 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lin, Y.; Jia, S. SpecBEV: An End-to-End BEV 3D Object Detection Algorithm Based on Frequency-Domain Analysis and Geometric Alignment. Sensors 2026, 26, 3551. https://doi.org/10.3390/s26113551
Lin Y, Jia S. SpecBEV: An End-to-End BEV 3D Object Detection Algorithm Based on Frequency-Domain Analysis and Geometric Alignment. Sensors. 2026; 26(11):3551. https://doi.org/10.3390/s26113551
Chicago/Turabian StyleLin, Yu, and Shijie Jia. 2026. "SpecBEV: An End-to-End BEV 3D Object Detection Algorithm Based on Frequency-Domain Analysis and Geometric Alignment" Sensors 26, no. 11: 3551. https://doi.org/10.3390/s26113551
APA StyleLin, Y., & Jia, S. (2026). SpecBEV: An End-to-End BEV 3D Object Detection Algorithm Based on Frequency-Domain Analysis and Geometric Alignment. Sensors, 26(11), 3551. https://doi.org/10.3390/s26113551

