SMM-POD: Panoramic 3D Object Detection via Spherical Multi-Stage Multi-Modal Fusion
Abstract
:1. Introduction
- Spherical panoramic multi-modal framework: We present a spherical convolutional framework designed for panoramic image and point cloud fusion. This framework reduces geometric distortion and improves the alignment of features across different sensor types. To our knowledge, this is the first method that combines spherical CNNs and spherical positional encoding within a transformer-based detection pipeline. This design enables position-aware attention on quasi-uniform Voronoi sphere (UVS) structures, which are essential for accurate panoramic perception.
- Attention-driven multi-stage fusion: We propose a new multi-stage fusion approach that enhances cross-modal interaction at both the feature extraction and context-encoding stages. Specifically, we use a cross-channel attention module to strengthen local feature alignment and a joint multi-head attention mechanism with a feature enhancement unit to improve the global context understanding. This structure improves the model’s ability to represent complex scenes and increases its generalization across object types.
- Extensive validation on panoramic-FoV datasets: We conducted comprehensive experiments on the DAIR-V2X-I dataset (a vehicle-infrastructure cooperative perception benchmark from the RODE dataset) and our panoramic multi-modal dataset SHU-3DPOD. The comparison results of the proposed method with the most advanced single-modal and multi-modal fusion methods show that the proposed method achieves the best success rate in the DAIR-V2X-I dataset, especially for small objects such as pedestrians and cyclists. Additionally, our method maintains consistent detection performance across all object categories. Under the panoramic multi-modal dataset SHU-3DPOD, our method substantially improves the detection accuracy and effectively eliminates the localization drift induced by image distortion.
2. Literature Review
2.1. Vision-Based 3D Object Detection
2.2. LiDAR-Based 3D Object Detection
2.3. Multi-Modal Fusion Object Detection
3. Materials and Methods
3.1. Spherical Panoramic Multi-Modal Image Construction
3.2. Multi-Stage Fusion Framework
3.2.1. Interactive Feature Extraction with Cross-Attention Modules
3.2.2. Fusion Strategy with SPE in the Transformer Encoder and Decoder
3.3. Loss Function Definition
4. Results
4.1. Dataset Introduction
4.2. Experimental Parameters Setting
4.3. Comparison on DAIR-V2X-I Dataset
4.4. Comparison on Panoramic SHU-3DPOD Dataset
5. Discussion
5.1. Parameter Count and Efficiency Discussion
5.2. Ablation Experiments and Discussion
5.2.1. Ablation Study on Input Channels
5.2.2. Ablation Study on Network Architecture
- (1)
- Feature extraction fusion removed: All feature interactions in the feature extraction stage are removed. The two modality branches operate independently with no fusion until the encoding network, where late fusion is applied.
- (2)
- Partial feature extraction fusion removed: The multi-level cross-attention mechanism in the feature extraction stage is removed. Instead, features from both modalities are directly concatenated and passed to the next layer.
- (3)
- Cross-attention in the context encoder removed: Only multi-head self-attention modules are retained in the context encoder stage, discarding cross-modal attention.
- (4)
- Self-attention in the context encoder removed: Only cross-attention modules are used in the context encoder, omitting modality-specific self-attention branches.
- (5)
- SPE (spherical positional encoding) removed: The proposed spherical position encoding is replaced by the standard learnable positional encoding used in conventional transformer architectures.
5.3. Limitations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
FOV | Field of View |
ERP | Equirectangular Projection |
UVS | Quasi-Uniform Voronoi Spherical |
SCNN | Spherical Convolution Neural Network |
KNN | K-Nearest Neighbor |
CAM | Cross-Attention Module |
FPS | Frames per Second |
SPE | Spherical Positional Encoding |
References
- Tang, Y.; He, H.; Wang, Y.; Mao, Z.; Wang, H. Multi-modality 3D object detection in autonomous driving: A review. Neurocomputing 2023, 553, 126587. [Google Scholar] [CrossRef]
- Chu, H.; Liu, H.; Zhuo, J.; Chen, J.; Ma, H. Occlusion-guided multi-modal fusion for vehicle-infrastructure cooperative 3D object detection. Pattern Recognit. 2025, 157, 110939. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Zhang, M.; Li, H.; Li, Q.; Zheng, M.; Dvinianina, I. POD-YOLO: YOLOX-Based Object Detection Model for Panoramic Image. In Proceedings of the Image Processing, Electronics and Computers, Dalian, China, 12–14 April 2024; pp. 236–250. [Google Scholar]
- de La Garanderie, G.P.; Abarghouei, A.A.; Breckon, T.P. Eliminating the blind spot: Adapting 3d object detection and monocular depth estimation to 360 panoramic imagery. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 789–807. [Google Scholar]
- Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y.; et al. LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Zhang, J.; Xu, D.; Li, Y.; Zhao, L.; Su, R. FusionPillars: A 3D object detection network with cross-fusion and self-fusion. Remote Sens. 2023, 15, 2692. [Google Scholar] [CrossRef]
- Xu, X.; Dong, S.; Xu, T.; Ding, L.; Wang, J.; Jiang, P.; Song, L.; Li, J. Fusionrcnn: Lidar-camera fusion for two-stage 3d object detection. Remote Sens. 2023, 15, 1839. [Google Scholar] [CrossRef]
- Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3d object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
- Liu, Z.; Huang, T.; Li, B.; Chen, X.; Wang, X.; Bai, X. EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8324–8341. [Google Scholar] [CrossRef] [PubMed]
- Zhang, C.; Tian, B.; Sun, Y.; Zhang, R. Monocular 3D Ray-Aware RPN For Roadside View Object Detection. In Proceedings of the 2023 International Annual Conference on Complex Systems and Intelligent Science (CSIS-IAC), Shenzhen, China, 20–22 October 2023; pp. 841–846. [Google Scholar]
- Yang, L.; Zhang, X.; Li, J.; Wang, L.; Zhang, C.; Ju, L.; Li, Z.; Shen, Y. Towards Scenario Generalization for Vision-based Roadside 3D Object Detection. arXiv 2024, arXiv:2401.16110. [Google Scholar] [CrossRef]
- Li, Z.; Chen, Z.; Li, A.; Fang, L.; Jiang, Q.; Liu, X.; Jiang, J. Unsupervised domain adaptation for monocular 3d object detection via self-training. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 245–262. [Google Scholar]
- Li, P.; Zhao, H.; Liu, P.; Cao, F. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 644–660. [Google Scholar]
- Ye, Z.; Zhang, H.; Gu, J.; Li, X. YOLOv7-3D: A Monocular 3D Traffic Object Detection Method from a Roadside Perspective. Appl. Sci. 2023, 13, 11402. [Google Scholar] [CrossRef]
- Yang, L.; Yu, J.; Zhang, X.; Li, J.; Wang, L.; Huang, Y.; Zhang, C.; Wang, H.; Li, Y. MonoGAE: Roadside monocular 3D object detection with ground-aware embeddings. arXiv 2023, arXiv:2310.00400. [Google Scholar] [CrossRef]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1477–1485. [Google Scholar]
- Yang, L.; Yu, K.; Tang, T.; Li, J.; Yuan, K.; Wang, L.; Zhang, X.; Chen, P. Bevheight: A robust framework for vision-based roadside 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21611–21620. [Google Scholar]
- Shi, H.; Pang, C.; Zhang, J.; Yang, K.; Wu, Y.; Ni, H.; Lin, Y.; Stiefelhagen, R.; Wang, K. Cobev: Elevating roadside 3d object detection with depth and height complementarity. arXiv 2023, arXiv:2310.02815. [Google Scholar] [CrossRef] [PubMed]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Ye, M.; Xu, S.; Cao, T. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1631–1640. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
- Wang, Y.; Deng, J.; Hou, Y.; Li, Y.; Zhang, Y.; Ji, J.; Ouyang, W.; Zhang, Y. CluB: Cluster meets BEV for LiDAR-based 3D object detection. Adv. Neural Inf. Process. Syst. 2024, 36, 40438–40449. [Google Scholar]
- Jin, Z.; Ji, X.; Cheng, Y.; Yang, B.; Yan, C.; Xu, W. Pla-lidar: Physical laser attacks against lidar-based 3d object detection in autonomous vehicle. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–24 May 2023; pp. 1822–1839. [Google Scholar]
- Wu, G.; Cao, T.; Liu, B.; Chen, X.; Ren, Y. Towards universal LiDAR-based 3D object detection by multi-domain knowledge transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 8669–8678. [Google Scholar]
- Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2918–2927. [Google Scholar]
- Wang, K.; Zhou, T.; Zhang, Z.; Chen, T.; Chen, J. PVF-DectNet: Multi-modal 3D detection network based on Perspective-Voxel fusion. Eng. Appl. Artif. Intell. 2023, 120, 105951. [Google Scholar] [CrossRef]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar]
- Chen, M.; Liu, P.; Zhao, H. LiDAR-camera fusion: Dual transformer enhancement for 3D object detection. Eng. Appl. Artif. Intell. 2023, 120, 105815. [Google Scholar] [CrossRef]
- Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–52. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020; pp. 10386–10393. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Lin, Y.; Fei, Y.; Gao, Y.; Shi, H.; Xie, Y. A LiDAR-Camera Calibration and Sensor Fusion Method with Edge Effect Elimination. In Proceedings of the 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 11–13 December 2022; pp. 28–34. [Google Scholar]
- Yang, Y.; Gao, Z.; Zhang, J.; Hui, W.; Shi, H.; Xie, Y. UVS-CNNs: Constructing general convolutional neural networks on quasi-uniform spherical images. Comput. Graph. 2024, 122, 103973. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Yu, H.; Luo, Y.; Shu, M.; Huo, Y.; Yang, Z.; Shi, Y.; Guo, Z.; Li, H.; Hu, X.; Yuan, J.; et al. Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21361–21370. [Google Scholar]
- Rukhovich, D.; Vorontsova, A.; Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2397–2406. [Google Scholar]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–18. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
- Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Dataset | Batch | lr_backbone | Query | num_point | Optimizer | decay_rate |
---|---|---|---|---|---|---|
Dair-V2X-I | 32 | 20 | 2048 | adam | 0.7 | |
SHU-3DPOD | 64 | 8 | 8192 | adamW | 0.1 |
Method | Modality 1 | Veh.( = 0.5) | Ped.( = 0.25) | Cyc.( = 0.25) | Para (MB) | FPS | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | mAP | Easy | Mod. | Hard | mAP | Easy | Mod. | Hard | mAP | ||||
PointPillars [23] | L | 63.07 | 54.00 | 54.01 | 38.53 | 37.20 | 37.28 | 38.46 | 22.60 | 22.49 | - | - | |||
SECOND [24] | L | 71.47 | 53.99 | 54.00 | 55.16 | 52.49 | 52.52 | 54.68 | 31.05 | 31.19 | - | - | |||
PV RCNN [26] | L | 71.47 | 54.29 | 54.30 | 66.22 | 59.27 | 59.25 | 50.37 | 28.72 | 29.15 | - | - | |||
PV RCNN++ [27] | L | 67.70 | 55.29 | 55.33 | 66.55 | 63.17 | 63.23 | 48.83 | 27.03 | 27.78 | - | - | |||
Imvoxelnet [44] | I | 44.78 | 37.58 | 37.55 | 6.81 | 6.75 | 6.73 | 21.06 | 13.57 | 13.17 | - | - | |||
BEVFormer [45] | I | 61.37 | 50.73 | 50.73 | 16.89 | 15.82 | 15.95 | 22.16 | 22.13 | 22.06 | - | - | |||
BEVDepth [20] | I | 75.50 | 63.58 | 63.67 | 34.95 | 33.42 | 33.27 | 55.67 | 55.47 | 55.34 | - | - | |||
BEVHeight [21] | I | 77.78 | 65.77 | 65.85 | 41.22 | 39.29 | 39.46 | 60.23 | 60.08 | 60.54 | - | - | |||
CoBEV [22] | I | 75.53 | 63.46 | 63.55 | 30.75 | 30.08 | 29.17 | 51.42 | 54.78 | 54.97 | - | - | |||
MVXNet [10] | 71.04 | 53.71 | 53.76 | 55.83 | 54.45 | 54.40 | 54.05 | 30.79 | 31.06 | 33.87 | 9.61 | ||||
FocalsConv [47] | 65.46 | 53.32 | 53.35 | 68.99 | 68.17 | 68.16 | 42.56 | 25.22 | 25.25 | 8.08 | 11.08 | ||||
LoGoNet [7] | 71.67 | 54.40 | 62.59 | 66.96 | 65.49 | 65.48 | 50.61 | 29.12 | 30.05 | 43.36 | 18.57 | ||||
SMM-POD (ours) | 77.26 | 77.22 | 75.24 | 72.14 | 72.06 | 69.02 | 69.61 | 69.61 | 69.66 | 145.01 | 11.64 |
Method | Image Modal | Car() | Ped.() | Cyc.() | FPS | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
(m) | (°) | (m) | (°) | (m) | (°) | ||||||
FocalsConv [47] | Fisheye-image | 0.37 | 11.53 | 55.98 | 0.17 | 4.57 | 45.27 | 1.35 | 18.48 | 25.14 | 9.64 |
ERP | 0.39 | 14.09 | 47.61 | 0.22 | 3.44 | 35.80 | 1.79 | 21.73 | 33.36 | 9.26 | |
Undistorted | 0.38 | 12.66 | 62.39 | 0.19 | 3.16 | 43.38 | 1.45 | 16.48 | 31.08 | 12.76 | |
LoGoNet [7] | Fisheye-image | 0.38 | 5.77 | 37.42 | 0.23 | 12.07 | 57.19 | 0.96 | 16.04 | 37.04 | 16.49 |
ERP | 0.31 | 5.77 | 47.27 | 0.26 | 17.10 | 57.19 | 1.35 | 13.83 | 26.81 | 17.10 | |
Undistorted | 0.34 | 3.62 | 58.83 | 0.24 | 14.15 | 54.10 | 0.81 | 10.88 | 30.41 | 21.69 | |
SMM-POD | UVS Multi-modal | 0.24 | 0.45 | 81.27 | 0.082 | 0.035 | 94.95 | 0.103 | 0.28 | 85.66 | 11.33 |
Modality | Veh.() | Ped.() | Cyc.() | ||||||
---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | |
RGB + Depth | 74.08 | 72.43 | 69.41 | 64.42 | 63.87 | 58.37 | 58.83 | 54.94 | 54.71 |
Depth | 37.91 | 35.49 | 31.38 | 26.69 | 21.98 | 30.34 | 21.21 | 18.58 | 16.16 |
Multi-Modal | 77.26 | 77.22 | 75.24 | 72.14 | 72.06 | 69.02 | 69.61 | 69.61 | 69.66 |
Configurations | Veh.() | Ped.() | Cyc.() | ||||||
---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | |
1 | 61.08 | 59.61 | 58.01 | 53.12 | 52.17 | 49.07 | 51.87 | 49.36 | 47.61 |
2 | 63.03 | 60.49 | 57.21 | 58.75 | 56.46 | 52.48 | 52.51 | 52.48 | 48.05 |
3 | 58.87 | 53.77 | 52.28 | 54.01 | 50.55 | 47.68 | 51.33 | 43.67 | 42.44 |
4 | 70.43 | 62.93 | 61.17 | 65.46 | 54.48 | 55.54 | 58.53 | 51.10 | 48.97 |
5 | 72.15 | 71.83 | 71.96 | 67.11 | 66.86 | 64.98 | 64.43 | 64.39 | 64.25 |
Full | 77.26 | 77.22 | 75.24 | 72.14 | 72.06 | 69.02 | 69.61 | 69.61 | 69.66 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Yang, Y.; Gao, Z.; Shi, H.; Xie, Y. SMM-POD: Panoramic 3D Object Detection via Spherical Multi-Stage Multi-Modal Fusion. Remote Sens. 2025, 17, 2089. https://doi.org/10.3390/rs17122089
Zhang J, Yang Y, Gao Z, Shi H, Xie Y. SMM-POD: Panoramic 3D Object Detection via Spherical Multi-Stage Multi-Modal Fusion. Remote Sensing. 2025; 17(12):2089. https://doi.org/10.3390/rs17122089
Chicago/Turabian StyleZhang, Jinghan, Yusheng Yang, Zhiyuan Gao, Hang Shi, and Yangmin Xie. 2025. "SMM-POD: Panoramic 3D Object Detection via Spherical Multi-Stage Multi-Modal Fusion" Remote Sensing 17, no. 12: 2089. https://doi.org/10.3390/rs17122089
APA StyleZhang, J., Yang, Y., Gao, Z., Shi, H., & Xie, Y. (2025). SMM-POD: Panoramic 3D Object Detection via Spherical Multi-Stage Multi-Modal Fusion. Remote Sensing, 17(12), 2089. https://doi.org/10.3390/rs17122089