RMP: Robust Multi-Modal Perception Under Missing Condition
Abstract
1. Introduction
- We present RMP, a robust multi-modal perception framework designed to address missing-modality scenarios.
- We design a missing feature reconstruction mechanism that exploits intra-modal feature correlations to recover missing camera representations, thereby alleviating performance degradation. Furthermore, we introduce a cross-modal adaptive fusion strategy that learns adjustable weights to fuse image and LiDAR features, improving the efficiency and reliability of cross-modal information interaction.
- Experiments on nuScenes [17] show SOTA-level performance and strong robustness under various camera view missing conditions.
2. Related Work
2.1. Single-Modal Perception
2.2. Multi-Modal Perception
2.3. Missing Modality Perception
3. Method
3.1. Overall Framework
3.2. Feature Extraction
3.3. Missing Feature Reconstruction
3.4. Multi-Modal Fusion
4. Experiment
4.1. Dataset
4.2. Metrics
4.3. Implementation Details
4.4. SOTA Comparison
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Mohapatra, S.; Yogamani, S.; Gotzig, H.; Milz, S.; Mader, P. BEVDetNet: Bird’s eye view LiDAR point cloud based real-time 3D object detection for autonomous driving. In Proceedings of the IEEE International Intelligent Transportation Systems Conference, Indianapolis, IN, USA, 19–22 September 2021; pp. 2809–2815. [Google Scholar]
- Ma, R.; Chen, C.; Yang, B.; Li, D.; Wang, H.; Cong, Y.; Hu, Z. CG-SSD: Corner guided single stage 3D object detection from LiDAR point cloud. ISPRS J. Photogramm. Remote Sens. 2022, 191, 33–48. [Google Scholar] [CrossRef]
- Wu, X.; Hou, Y.; Huang, X.; Lin, B.; He, T.; Zhu, X.; Ma, Y.; Wu, B.; Liu, H.; Cai, D.; et al. TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15311–15320. [Google Scholar]
- Yan, S.; Wang, S.; Duan, Y.; Hong, H.; Lee, K.; Kim, D.; Hong, Y. An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security), Philadelphia, PA, USA, 14–16 August 2024. [Google Scholar]
- Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2DPASS: 2D priors assisted semantic segmentation on lidar point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 677–695. [Google Scholar]
- Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and asymmetrical 3D convolution networks for LiDAR segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9939–9948. [Google Scholar]
- Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Li, H.; Gao, P. MonoDETR: Depth-guided transformer for monocular 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 9155–9166. [Google Scholar]
- Zhou, Y.; Zhu, H.; Liu, Q.; Chang, S.; Guo, M. Monoatt: Online monocular 3D object detection with adaptive token transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17493–17503. [Google Scholar]
- Yin, J.; Shen, J.; Chen, R.; Li, W.; Yang, R.; Frossard, P.; Wang, W. Is-fusion: Instance-scene collaborative fusion for multimodal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 14905–14915. [Google Scholar]
- Sun, T.; Zhang, Z.; Tan, X.; Peng, Y.; Qu, Y.; Xie, Y. Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11059–11072. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Qi, X.; Chen, Y.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Voxel field fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1120–1129. [Google Scholar]
- Njima, W.; Chafii, M.; Shubair, R. GAN based data augmentation for indoor localization using labeled and unlabeled data. In Proceedings of the International Balkan Conference on Communications and Networking, Novi Sad, Serbia, 20–22 September 2021; pp. 36–39. [Google Scholar]
- Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Deformable feature aggregation for dynamic multi-modal 3D object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 628–644. [Google Scholar]
- Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y. Logonet: Towards accurate 3D object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17524–17534. [Google Scholar]
- Li, J.; Dai, H.; Han, H.; Ding, Y. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 21694–21704. [Google Scholar]
- Ge, C.; Chen, J.; Xie, E.; Wang, Z.; Hong, L.; Lu, H.; Li, Z.; Luo, P. Metabev: Solving sensor failures for 3d detection and map segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8721–8731. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.; Vora, S.; Liong, V.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
- Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17545–17555. [Google Scholar]
- Qi, C.; Su, H.; Mo, K.; Guibas, L. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud-based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Li, Z.; Wang, F.; Wang, N. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7546–7555. [Google Scholar]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 531–548. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv 2022, arXiv:2205.13542. [Google Scholar]
- Xie, Y.; Xu, C.; Rakotosaona, M.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17591–17602. [Google Scholar]
- Zou, J.; Huang, T.; Yang, G.; Guo, Z.; Luo, T.; Feng, C.; Zuo, W. Unim2AE: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 296–313. [Google Scholar]
- Cui, L.; Li, X.; Meng, M.; Mo, X. MMFusion: A generalized multi-modal fusion detection framework. In Proceedings of the IEEE International Conference on Development and Learning, Macau, China, 9–12 November 2023; pp. 415–422. [Google Scholar]
- Njima, W.; Bazzi, A.; Chafii, M. DNN-Based Indoor Localization Under Limited Dataset Using GANs and Semi-Supervised Learning. IEEE Access 2022, 10, 69896–69909. [Google Scholar] [CrossRef]
- Yu, H.; Chan, S.; Zhou, X.; Zhang, X. SGFormer: Semantic-Geometry Fusion Transformer for Multi-modal 3D Panoptic Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; pp. 9616–9625. [Google Scholar]
- Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 48, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
- Adam, K. A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9601–9610. [Google Scholar]
- Yan, X.; Gao, J.; Li, J.; Zhang, R.; Li, Z.; Huang, R.; Cui, S. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; pp. 3101–3109. [Google Scholar]
- Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching efficient 3d architectures with sparse point-voxel convolution. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 685–702. [Google Scholar]
- Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; Liu, B. 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12547–12556. [Google Scholar]
- Tan, S.; Fazlali, H.; Xu, Y.; Ren, Y.; Liu, B. Uplifting range-view-based 3D semantic segmentation in real-time with multi-sensor fusion. In Proceedings of the IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024; pp. 16162–16169. [Google Scholar]
- Zhuang, Z.; Li, R.; Jia, K.; Wang, Q.; Li, Y.; Tan, M. Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16280–16290. [Google Scholar]
- Wu, Z.; Zhang, Y.; Lan, R.; Qiu, S.; Ran, S.; Liu, Y. APPFNet: Adaptive point-pixel fusion network for 3D semantic segmentation with neighbor feature aggregation. Expert Syst. Appl. 2024, 251, 123990. [Google Scholar] [CrossRef]
- Tan, M.; Zhuang, Z.; Chen, S.; Li, R.; Jia, K.; Wang, Q.; Li, Y. EPMF: Efficient perception-aware multi-sensor fusion for 3D semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8258–8273. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Dai, H.; Ding, Y. Self-distillation for robust LiDAR semantic segmentation in autonomous driving. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–676. [Google Scholar]
- Genova, K.; Yin, X.; Kundu, A.; Pantofaru, C.; Cole, F.; Sud, A.; Brewington, B.; Shucker, B.; Funkhouser, T. Learning 3D semantic segmentation with only 2D image supervision. In Proceedings of the International Conference on 3D Vision, London, UK, 1–3 December 2021; pp. 361–372. [Google Scholar]







| Method | Input | mIoU | Barrier | Bicycle | Bus | Car | Construction | Motorcycle | Pedestrian | Traffic_CONE | Trailer | Truck | Driveable | Other_FLAT | Sidewalk | Terrain | Manmade | Vegetation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PolarNet [32] | L | 69.4 | 72.2 | 16.8 | 77.0 | 86.5 | 51.1 | 69.7 | 64.8 | 54.1 | 69.7 | 63.5 | 96.6 | 67.1 | 77.7 | 72.1 | 87.1 | 84.5 |
| JS3C-Net [33] | 73.6 | 80.1 | 26.2 | 87.8 | 84.5 | 55.2 | 72.6 | 71.3 | 66.3 | 66.8 | 71.2 | 96.8 | 64.5 | 76.9 | 74.1 | 87.5 | 86.1 | |
| SPVNAS [34] | 77.4 | 80.0 | 30.0 | 91.9 | 90.8 | 64.7 | 79.0 | 75.6 | 70.9 | 81.0 | 74.6 | 97.4 | 69.2 | 80.0 | 76.1 | 89.3 | 87.1 | |
| Cylinder3D [6] | 77.2 | 82.8 | 29.8 | 84.3 | 89.4 | 63.0 | 79.3 | 77.2 | 73.4 | 84.6 | 69.1 | 97.7 | 70.2 | 80.3 | 75.5 | 90.4 | 87.6 | |
| AF2S3Net [35] | 78.3 | 78.9 | 52.2 | 89.9 | 84.2 | 77.4 | 74.3 | 77.3 | 72.0 | 83.9 | 73.8 | 97.1 | 66.5 | 77.5 | 74.0 | 87.7 | 86.8 | |
| SphereFormer [18] | 78.1 | 81.5 | 39.7 | 93.4 | 87.5 | 66.4 | 75.7 | 77.2 | 70.6 | 85.6 | 73.6 | 97.6 | 64.8 | 79.8 | 75.0 | 92.2 | 89.0 | |
| LaCRange [36] | LC | 75.3 | 78.0 | 32.6 | 88.3 | 84.5 | 63.9 | 81.5 | 75.6 | 72.5 | 64.7 | 68.0 | 96.6 | 65.9 | 78.6 | 75.0 | 90.4 | 88.3 |
| PMF-ResNet50 [37] | 77.0 | 82.1 | 40.3 | 80.9 | 86.4 | 63.7 | 79.2 | 79.8 | 75.9 | 81.2 | 67.1 | 97.3 | 67.7 | 78.1 | 74.5 | 90.0 | 88.5 | |
| APPFNet [38] | 78.1 | 77.2 | 52.2 | 90.9 | 93.6 | 54.2 | 79.2 | 80.7 | 71.2 | 64.1 | 84.2 | 97.5 | 73.9 | 77.2 | 75.2 | 91.1 | 87.9 | |
| PMF [37] | 75.5 | 80.1 | 35.7 | 79.7 | 86.0 | 62.4 | 76.3 | 76.9 | 73.6 | 78.5 | 66.9 | 97.1 | 65.3 | 77.6 | 74.4 | 89.5 | 88.5 | |
| EPMF [39] | 79.2 | 76.9 | 39.8 | 90.3 | 87.8 | 72.0 | 86.4 | 79.6 | 76.6 | 84.1 | 74.9 | 97.7 | 66.4 | 79.5 | 76.4 | 91.1 | 87.9 | |
| MSeg3D-H48 [15] | 81.1 | 83.1 | 42.5 | 94.9 | 92.0 | 67.1 | 78.6 | 85.7 | 80.5 | 87.5 | 77.3 | 97.7 | 69.8 | 81.2 | 77.8 | 92.4 | 90.1 | |
| Mseg3D ★ [15] | 72.6 | 71.5 | 39.7 | 92.5 | 87.8 | 45.0 | 79.5 | 76.9 | 61.2 | 51.0 | 83.0 | 95.7 | 69.5 | 69.3 | 70.3 | 85.3 | 84.1 | |
| Ours | LC | 79.8 | 82.5 | 44.9 | 92.7 | 91.0 | 71.5 | 73.9 | 82.5 | 76.1 | 85.7 | 75.5 | 97.4 | 69.1 | 79.9 | 76.3 | 90.3 | 87.4 |
| Method | Input | mIoU | Barrier | Bicycle | Bus | Car | Construction | Motorcycle | Pedestrian | Traffic_CONE | Trailer | Truck | Driveable | Other_FLAT | Sidewalk | Terrain | Manmade | Vegetation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AF2S3Net [35] | L | 62.2 | 60.3 | 12.6 | 82.9 | 80.0 | 20.1 | 62.0 | 59.0 | 49.0 | 42.2 | 67.4 | 94.2 | 68.0 | 64.1 | 68.6 | 82.9 | 82.4 |
| PolarNet [32] | 71.0 | 74.7 | 28.2 | 85.3 | 90.9 | 35.1 | 77.5 | 71.3 | 58.8 | 57.4 | 76.1 | 96.5 | 71.1 | 74.7 | 74.0 | 87.3 | 85.7 | |
| Cylinder3D [6] | 76.1 | 76.4 | 40.3 | 91.2 | 93.8 | 51.3 | 78.0 | 78.9 | 64.9 | 62.1 | 84.4 | 96.8 | 71.6 | 76.4 | 75.4 | 90.5 | 87.4 | |
| 2DPASS [5] | 76.4 | 74.4 | 44.3 | 93.6 | 92.0 | 54.0 | 79.7 | 78.9 | 57.2 | 72.5 | 85.7 | 96.2 | 72.7 | 74.1 | 74.5 | 87.5 | 85.4 | |
| SphereFormer [18] | 78.4 | 77.7 | 43.8 | 94.5 | 93.1 | 52.4 | 86.9 | 81.2 | 65.4 | 73.4 | 85.3 | 97.0 | 73.4 | 75.4 | 75.0 | 91.0 | 89.2 | |
| SDSeg3D [40] | 77.7 | 77.5 | 49.4 | 93.9 | 92.5 | 54.9 | 86.7 | 80.1 | 67.8 | 65.7 | 86.0 | 96.4 | 74.0 | 74.9 | 74.5 | 86.0 | 82.8 | |
| APPFNet [38] | LC | 78.1 | 77.2 | 52.2 | 90.9 | 93.6 | 54.2 | 79.2 | 80.7 | 71.2 | 64.1 | 84.2 | 97.5 | 73.9 | 77.2 | 75.2 | 91.1 | 87.9 |
| PMF-ResNet50 [37] | 79.0 | 74.9 | 55.4 | 91.0 | 93.0 | 60.5 | 80.3 | 83.2 | 73.6 | 67.2 | 84.5 | 95.9 | 75.1 | 74.6 | 75.5 | 90.3 | 89.0 | |
| 2D3DNet [41] | 79.0 | 78.3 | 55.1 | 95.4 | 87.7 | 59.4 | 79.3 | 80.7 | 70.2 | 68.2 | 86.6 | 96.1 | 74.9 | 75.7 | 75.1 | 91.4 | 89.9 | |
| MSeg3D-H48 [15] | 80.0 | 79.2 | 59.8 | 96.1 | 89.4 | 54.1 | 89.3 | 82.2 | 72.8 | 70.4 | 86.0 | 96.7 | 73.6 | 76.1 | 75.6 | 89.3 | 88.3 | |
| MSeg3D ★ [15] | 74.7 | 72.4 | 48.1 | 93.8 | 88.4 | 47.0 | 82.6 | 79.7 | 64.6 | 56.1 | 84.1 | 96.0 | 69.5 | 70.2 | 71.0 | 87.0 | 86.2 | |
| MSeg3D ★ [15] | 79.0 | 78.4 | 55.4 | 95.3 | 89.4 | 57.3 | 88.4 | 83.4 | 68.8 | 66.5 | 86.1 | 96.4 | 74.9 | 74.3 | 73.0 | 89.0 | 87.1 | |
| Ours | 80.3 | 79.2 | 56.8 | 96.1 | 89.7 | 56.7 | 89.5 | 84.4 | 72.6 | 71.1 | 87.7 | 96.9 | 75.9 | 76.5 | 76.1 | 89.3 | 87.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ma, X.; Cai, X.; Song, Y.; Liang, Y.; Liu, G.; Yang, Y. RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics 2026, 15, 119. https://doi.org/10.3390/electronics15010119
Ma X, Cai X, Song Y, Liang Y, Liu G, Yang Y. RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics. 2026; 15(1):119. https://doi.org/10.3390/electronics15010119
Chicago/Turabian StyleMa, Xin, Xuqi Cai, Yuansheng Song, Yu Liang, Gang Liu, and Yijun Yang. 2026. "RMP: Robust Multi-Modal Perception Under Missing Condition" Electronics 15, no. 1: 119. https://doi.org/10.3390/electronics15010119
APA StyleMa, X., Cai, X., Song, Y., Liang, Y., Liu, G., & Yang, Y. (2026). RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics, 15(1), 119. https://doi.org/10.3390/electronics15010119
