Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving
Abstract
1. Introduction
- Spatio-Temporal Consistency Module: Reduces temporal inconsistency 77.5% (IoU variation 0.182 → 0.041) through optical flow-guided synthesis with separate static/dynamic handling, achieving stable perception for control.
- Multi-Scale Geometric Alignment: Preserves metric accuracy via pyramid processing with reciprocal constraints, improving Chamfer distance to 0.086 m while maintaining 87.3% surface normal consistency.
- Calibrated Uncertainty Estimation: First framework providing reliable confidence scores (ECE = 0.034) for synthesized data through dual-branch aleatoric/epistemic modeling, enabling risk-aware behavioral adaptation.
- Real-World Validation: Comprehensive SPEED dataset (69,620 frames, 15 failure modes) demonstrates safety maintenance under sustained sensor loss with 4.7× faster recovery than baselines.
2. Related Work
2.1. Neural Rendering and Cross-Modal Synthesis
2.1.1. LiDAR Synthesis and Novel View Generation
2.1.2. Generative Models for Sensor Data Augmentation
2.2. Multi-Sensor Fusion for Autonomous Driving
2.2.1. Bird’s-Eye-View Representation Learning
2.2.2. Robust Fusion Under Adverse Conditions
2.2.3. Real-Time Multi-Modal Architectures
2.3. Uncertainty Estimation and Reliability Assessment
2.3.1. Uncertainty Quantification in Autonomous Driving
2.3.2. Motion-Aware Temporal Processing
2.4. Comprehensive Multi-Modal Datasets and Benchmarks
2.4.1. Large-Scale Multi-Modal Perception
2.4.2. Industry-Scale Approaches
2.5. Cross-Modal Conflict Resolution
2.6. Positioning of Our Approach
- Joint Cross-Modal and Temporal Modeling: While existing methods focus on single-frame translation, we jointly address stereo-to-LiDAR synthesis and temporal consistency in a unified framework.
- Geometry-Consistent Adversarial Training: Our approach anchors generated point clouds to scene structure through multi-scale geometric constraints, ensuring both local and global geometric fidelity.
- Uncertainty-Aware Real-Time Processing: We provide the first uncertainty quantification framework for cross-modal translation that operates within real-time constraints suitable for deployment.
- Comprehensive Sensor Failure Evaluation: Our evaluation protocol specifically addresses realistic sensor failure scenarios with up to 20% missing sensor streams, filling a critical gap in existing benchmarks.
3. Enhanced Cross-Modal Translation Framework
3.1. Framework Overview
3.2. Spatio-Temporal Consistency Module
3.2.1. Optical Flow-Based Temporal Alignment
3.2.2. Temporal Smoothness Regularization
3.2.3. Motion-Aware Synthesis for Dynamic Objects
3.3. Multi-Scale Geometric Alignment Network
3.3.1. Pyramid-Based Processing
3.3.2. Multi-Scale Geometric Loss
3.4. Uncertainty-Aware Fusion Mechanism
3.4.1. Aleatoric and Epistemic Uncertainty Estimation
3.4.2. Uncertainty-Weighted Loss Function
3.4.3. Adaptive Fusion Weights
3.5. Enhanced Generation Networks with Geometric Priors
3.5.1. Depth-Aware Residual Generator with Multi-Scale Injection
3.5.2. Geometry-Consistent Adversarial Training
3.6. Composite Loss Optimization with Curriculum Learning
3.7. Training Strategy and Implementation Details
3.7.1. Progressive Multi-Stage Training Protocol
3.7.2. Enhanced Data Augmentation with Physical Constraints
- Sensor Noise Modeling: Additive Gaussian noise with ;
- Environmental Perturbations: Brightness modulation , ;
- Occlusion Simulation: Random rectangular masks with coverage ;
- Temporal Dropout: Frame dropping with probability to simulate sensor failures.
4. Comprehensive Experimental Evaluation
4.1. Experimental Setup
4.1.1. Implementation Details
4.1.2. Real-World Sensor Failure Dataset (SPEED)
4.2. Comparison with State-of-the-Art Methods
4.3. Cross-Modal Translation Quality Analysis
4.3.1. Stereo View Synthesis Evaluation
4.3.2. Point Cloud Reconstruction Fidelity
4.4. Real-World Multi-Modal Validation
4.5. Sensor Failure Robustness Evaluation
4.6. Uncertainty Quantification and Calibration Analysis
4.7. Comprehensive Ablation Studies
Extended Ablation Studies
4.8. Computational Efficiency and Real-Time Performance
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Singh, S.; Saini, B. Automation of cars: A review and analysis of the state of art technology. Int. J. Eng. Res. Technol. 2021, 10, 449–456. [Google Scholar]
- Waymo LLC. Waymo Safety Report 2023; Technical Report; Waymo: Mountain View, CA, USA, 2023. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Palladin, E.; Dietze, R.; Narayanan, P.; Bijelic, M.; Heide, F. SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
- Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
- Cao, Y.; Xiao, C.; Cyr, B.; Zhou, Y.; Park, W.; Rampazzi, S.; Chen, Q.A.; Fu, K.; Mao, Z.M. Adversarial sensor attack on lidar-based perception in autonomous driving. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 2267–2281. [Google Scholar]
- Shetty, A.; Yu, M.; Kurzhanskiy, A.; Grembek, O.; Tavafoghi, H.; Varaiya, P. Safety challenges for autonomous vehicles in the absence of connectivity. Transp. Res. Part C Emerg. Technol. 2021, 128, 103133. [Google Scholar] [CrossRef]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–18. [Google Scholar]
- Park, J.I.; Park, J.; Kim, K.S. Fast and accurate desnowing algorithm for LiDAR point clouds. IEEE Access 2020, 8, 160202–160212. [Google Scholar] [CrossRef]
- Yang, W.; Tan, R.T.; Wang, S.; Fang, Y.; Liu, J. Single image deraining: From model-based to data-driven and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4059–4077. [Google Scholar] [CrossRef] [PubMed]
- Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
- Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Dada: Depth-aware domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7364–7373. [Google Scholar]
- Sensoy, M.; Kaplan, L.; Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
- Feng, D.; Rosenbaum, L.; Dietmayer, K. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. arXiv 2018, arXiv:1804.05132. [Google Scholar] [CrossRef]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 102. [Google Scholar] [CrossRef]
- Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139. [Google Scholar] [CrossRef]
- Chabot, F.; Trouvé-Peloux, P.; Bursuc, A. GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation. arXiv 2024, arXiv:2407.14108. [Google Scholar]
- Zheng, Z.; Lu, F.; Xue, W.; Chen, G.; Jiang, C. LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5145–5154. [Google Scholar]
- Yao, G.; Zhang, X.; Hao, S.; Wang, X.; Pan, Y. Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching. arXiv 2024, arXiv:2410.06285. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Zhao, X.; Zhang, X.; Yang, D.; Sun, M.; Wang, Z.; Li, B.; Jiao, J. MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation. In Proceedings of the ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024. [Google Scholar]
- Yuan, Y.; Sester, M. StreamLTS: Query-Based Temporal-Spatial LiDAR Fusion for Cooperative Object Detection. In Computer Vision—ECCV 2024 Workshops, Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 29 September 29–4 October 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15630. [Google Scholar]
- Weng, X.; Ivanovic, B.; Wang, Y.; Wang, Y.; Pavone, M. PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15449–15458. [Google Scholar]
- Xu, D.; Li, H.; Wang, Q.; Song, Z.; Chen, L.; Deng, H. M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving. arXiv 2024, arXiv:2403.12552. [Google Scholar]
- Chen, L.; Wang, J.; Mortlock, T.; Khargonekar, P.; Al Faruque, M.A. Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025. [Google Scholar]
- Park, G.; Koh, J.; Kim, J.; Moon, J.; Choi, J. LiDAR-Based 3D Temporal Object Detection via Motion-Aware LiDAR Feature Fusion. Sensors 2024, 24, 4667. [Google Scholar] [CrossRef] [PubMed]
- Wang, T.; Mao, X.; Zhu, C.; Xu, R.; Lyu, R.; Li, P.; Chen, X.; Zhang, W.; Chen, K.; Xue, T.; et al. EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Lyu, R.; Lin, J.; Wang, T.; Yang, S.; Mao, X.; Chen, Y.; Xu, R.; Huang, H.; Zhu, C.; Lin, D.; et al. MMScan: Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations. In Proceedings of the 38th International Conference on Neural Information Processing System, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- NVIDIA Research Team. Hydra-MDP: End-to-End Multimodal Planning. In Proceedings of the CVPR 2024 Autonomous Grand Challenge, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Gao, S.; Yang, J.; Chen, L.; Chitta, K.; Qiu, Y.; Geiger, A.; Zhang, J.; Li, H. Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Fu, J.; Gao, C.; Ai, J.; Wang, Y.; Zheng, Y.; Liang, X.; Xing, E.P. Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection. arXiv 2024, arXiv:2403.07593. [Google Scholar]
- Manivasagam, S.; Wang, S.; Wong, K.; Zeng, W.; Sazanovich, M.; Tan, S.; Yang, B.; Ma, W.C.; Urtasun, R. LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11164–11173. [Google Scholar] [CrossRef]
- Li, C.; Ren, Y.; Liu, B. PCGen: Point Cloud Generator for LiDAR Simulation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 11676–11682. [Google Scholar] [CrossRef]





| Method | KITTI-360 | nuScenes | BEV mIoU | Temporal | Real-Time | Uncertainty | Params |
|---|---|---|---|---|---|---|---|
| CD (m) ↓ | CD (m) ↓ | (%) ↑ | Consistency ↑ | (fps) ↑ | ECE ↓ | (M) | |
| LiDARsim [33] | 3.223 | 12.138 | 52.1 | 0.234 | 8 | - | - |
| PCGen [34] | 0.464 | 2.200 | 58.7 | 0.312 | 12 | - | 28.5 |
| LiDAR4D [19] | 0.144 | 0.323 | 67.4 | 0.678 | 6 | - | 67.2 |
| GS-LiDAR [18] | 0.138 | 0.312 | 69.8 | 0.721 | 8 | - | 89.4 |
| SAMFusion [4] | 0.156 | 0.334 | 72.3 | 0.645 | 11 | 0.089 | 78.6 |
| PARA-Drive [24] | 0.149 | 0.329 | 75.2 | 0.712 | 14 | - | 105.2 |
| Ours (Enhanced) | 0.096 | 0.298 | 79.3 | 0.727 | 17 | 0.074 | 34 |
| Environment | Failure Type | Duration | BEV mIoU | 3D mAP | Chamfer | Uncertainty | Recovery |
|---|---|---|---|---|---|---|---|
| (s) | (%) | (%) | (m) | Score | Time (s) | ||
| Urban Dense | Camera Fog/Rain | 12.3 | 86.2 ± 2.1 | 72.9 ± 1.8 | 0.089 | 0.923 | 2.1 |
| LiDAR Dust/Debris | 8.7 | 87.8 ± 1.9 | 74.3 ± 2.2 | 0.082 | 0.934 | 1.8 | |
| Complete Camera Loss | 25.6 | 83.4 ± 3.2 | 69.7 ± 2.8 | 0.104 | 0.887 | 3.2 | |
| Complete LiDAR Loss | 45.2 | 81.9 ± 3.8 | 67.1 ± 3.4 | 0.127 | 0.847 | 4.1 | |
| Highway | Sensor Disconnection | 5.8 | 88.1 ± 1.6 | 75.8 ± 1.4 | 0.078 | 0.945 | 1.5 |
| Heavy Rain Impact | 18.9 | 85.6 ± 2.4 | 71.4 ± 2.1 | 0.092 | 0.901 | 2.8 | |
| Snow/Ice Accumulation | 32.1 | 82.2 ± 3.1 | 68.6 ± 2.9 | 0.118 | 0.876 | 3.7 | |
| Average | - | 21.2 | 85.0 ± 2.6 | 71.4 ± 2.5 | 0.098 | 0.902 | 2.7 |
| Baseline (No Recovery) | - | - | 58.3 ± 5.2 | 41.7 ± 4.8 | 0.342 | 0.567 | 12.8 |
| Condition | ECE ↓ | MCE ↓ | Reliability | Sharpness | AUROC ↑ |
|---|---|---|---|---|---|
| Clear Weather | 0.021 | 0.087 | 0.953 | 0.234 | 0.94 |
| Light Rain | 0.034 | 0.112 | 0.923 | 0.189 | 0.91 |
| Heavy Rain | 0.056 | 0.167 | 0.887 | 0.145 | 0.87 |
| Dense Fog | 0.071 | 0.203 | 0.834 | 0.123 | 0.83 |
| Snow/Ice | 0.089 | 0.234 | 0.798 | 0.098 | 0.79 |
| Night (Low Light) | 0.045 | 0.134 | 0.901 | 0.167 | 0.88 |
| Average | 0.053 | 0.156 | 0.883 | 0.159 | 0.85 |
| Configuration | Chamfer | BEV | Temporal | Uncertainty | 3D mAP | FPS | Params |
|---|---|---|---|---|---|---|---|
| (m) ↓ | mIoU (%) | Consistency | ECE ↓ | (%) | (M) | ||
| Baseline (U-Net + PatchGAN) | 0.312 | 74.6 | 0.721 | 0.145 | 58.3 | 31 | 22.5 |
| + Multi-scale Geometry | 0.186 | 76.8 | 0.739 | 0.132 | 61.7 | 28 | 25.3 |
| + Temporal Consistency | 0.124 | 78.2 | 0.891 | 0.118 | 65.4 | 25 | 28.1 |
| + Uncertainty Estimation | 0.098 | 81.1 | 0.903 | 0.067 | 68.9 | 23 | 30.6 |
| + Enhanced Adversarial | 0.089 | 85.7 | 0.915 | 0.058 | 70.2 | 22 | 32.4 |
| Full Framework | 0.086 | 89.3 | 0.927 | 0.034 | 71.4 | 17 | 34.0 |
| Configuration | CD (m) | mIoU (%) | Temp | FPS |
|---|---|---|---|---|
| Pyramid Levels | ||||
| 1 level (1×) | 0.142 | 76.3 | 0.834 | 34 |
| 2 levels (1×, 0.5×) | 0.109 | 81.7 | 0.878 | 29 |
| 3 levels (ours) | 0.086 | 89.3 | 0.927 | 27 |
| 4 levels (+0.125×) | 0.088 | 89.1 | 0.925 | 19 |
| Component Isolation | ||||
| Only | 0.187 | 77.9 | 0.893 | 29 |
| Only | 0.102 | 82.4 | 0.742 | 28 |
| Both (full) | 0.086 | 89.3 | 0.927 | 27 |
| Sensitivity | ||||
| 0.1 | 0.098 | 85.2 | 0.856 | 27 |
| 1.0 (ours) | 0.086 | 89.3 | 0.927 | 27 |
| 5.0 | 0.104 | 84.3 | 0.918 | 27 |
| Flow Quality | ||||
| RAFT-tiny | 0.103 | 86.1 | 0.874 | 31 |
| RAFT-small (ours) | 0.086 | 89.3 | 0.927 | 27 |
| RAFT-large | 0.084 | 89.7 | 0.934 | 18 |
| Method | Training | Memory | FLOPs | Inference | GPU | Power | Deployment |
|---|---|---|---|---|---|---|---|
| Time (h) | (GB) | (G) | (ms) | Util (%) | (W) | Ready | |
| LiDAR4D | 72 | 24.5 | 156.8 | 167 | 95 | 185 | No |
| GS-LiDAR | 48 | 18.2 | 134.2 | 125 | 87 | 162 | No |
| SAMFusion | 54 | 21.3 | 98.4 | 91 | 82 | 148 | Partial |
| PARA-Drive | 36 | 16.8 | 89.6 | 71 | 78 | 134 | Yes |
| Ours | 24 | 12.8 | 62.1 | 37 | 68 | 98 | Yes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lu, Z.; Pang, J.; Zhou, Z. Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving. Electronics 2025, 14, 4663. https://doi.org/10.3390/electronics14234663
Lu Z, Pang J, Zhou Z. Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving. Electronics. 2025; 14(23):4663. https://doi.org/10.3390/electronics14234663
Chicago/Turabian StyleLu, Zhengyi, Jinxiang Pang, and Zhehai Zhou. 2025. "Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving" Electronics 14, no. 23: 4663. https://doi.org/10.3390/electronics14234663
APA StyleLu, Z., Pang, J., & Zhou, Z. (2025). Geometry-Aware Cross-Modal Translation with Temporal Consistency for Robust Multi-Sensor Fusion in Autonomous Driving. Electronics, 14(23), 4663. https://doi.org/10.3390/electronics14234663

