Real-Time 3D Scene Understanding for Road Safety: Depth Estimation and Object Detection for Autonomous Vehicle Awareness
Abstract
1. Introduction
2. State of the Art
2.1. Deep Learning-Based Stereo Models
- DeepPruner [3] introduced a differentiable PatchMatch module that prunes disparity candidates efficiently, reducing the computational burden of constructing a full cost volume. Instead of evaluating all possible disparities across the stereo image pair, the network learns to select a sparse set of promising candidates for each pixel. This approach enables near real-time inference while maintaining competitive accuracy on KITTI and SceneFlow benchmarks.
- Stereo Anywhere [4] introduced a dual-branch architecture that fuses geometric stereo constraints with monocular depth priors from vision foundation models. It achieves robust zero-shot generalization and handles challenging cases such as mirrors and transparent surfaces by leveraging large-scale monocular priors alongside stereo cues. The authors also present the MonoTrap dataset, designed to evaluate stereo systems under optical illusions and perspective ambiguities.
- S3M-Net [5] proposed a joint learning framework that combines semantic segmentation and stereo matching. This integration improves structural consistency and scene understanding, which is critical for autonomous navigation. By introducing modules for feature fusion and semantic consistency-guided loss, S3M-Net ensures that disparity maps respect object boundaries and semantic structures, reducing errors in occluded or textureless regions.
- FAMNet [12] is a lightweight stereo matching network designed for real-time depth estimation in autonomous driving. It uses attention-guided cost volume construction and multi-scale aggregation to balance accuracy and efficiency. The Fusion Attention-based Cost Volume module reduces reliance on computationally expensive 3D convolutions, while multi-scale attention aggregation enhances disparity prediction through hierarchical feature integration. FAMNet achieves significant improvements over baseline models on KITTI while maintaining low latency, making it suitable for deployment on resource-constrained automotive platforms.
- Cross-Spectral Gated RGB Stereo [13] combines gated imaging with stereo HDR cameras to improve depth estimation in low-light and long-range scenarios, outperforming LiDAR in certain conditions. By fusing RGB, near-infrared, and active illumination cues, this approach addresses certain limitations of passive stereo and single-modality sensors, even in some adverse weather or nighttime environments.
2.2. Transformer-Based Stereo Matching
- One such example is the STereo TRansformer [15]. STTR formulates stereo matching as a sequence-to-sequence prediction task, leveraging a pure transformer architecture to directly estimate disparities without constructing a traditional cost volume. The model employs cross-attention to establish correspondences between features from the left and right images, and self-attention to enhance contextual understanding within each view. This design enables STTR to reason globally along epipolar lines, making it particularly effective in handling large disparity ranges, occlusions, and textureless regions. STTR uses positional encodings along the horizontal axis, which aligns with the epipolar geometry of rectified stereo pairs. This allows the model to maintain spatial awareness while reducing computational complexity.
- Context-Enhanced Stereo Transformer [16] builds on STTR by introducing the Context Enhanced Path (CEP) module, which captures global scene information beyond epipolar constraints. This addition addresses failure cases in uniform or textureless regions where local cues are insufficient. By integrating CEP into the Transformer pipeline, CSTR improves performance in zero-shot synthetic-to-real scenarios.
- FoundationStereo [6] is a foundation model for stereo depth estimation. It leverages large-scale synthetic training combined with a self-curation pipeline to ensure data diversity and quality. Architecturally, FoundationStereo integrates Transformer modules with CNN backbones and introduces side-tuning adapters to incorporate monocular priors from vision foundation models. This hybrid approach bridges the sim-to-real gap and enables strong zero-shot generalization across diverse domains, including indoor, outdoor, reflective, and transparent surfaces. Trained on over one million synthetic stereo pairs, it achieves state-of-the-art performance on benchmarks such as Middlebury and ETH3D, and generalizes across domains without fine-tuning.
3. Materials and Methods
4. Results
4.1. Accuracy Analysis of the Static Cameras
4.2. Qualitative Results
4.3. Textureless Regions, Reflective Surfaces and Varying Illumination Conditions
4.4. Roadside Conditions
4.5. KITTI 2015 Stereo Accuracy Evaluation
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AbsRel | Absolute Relative Error |
| CEP | Context Enhanced Path |
| CNN | Convolutional Neural Network |
| CPU | Central Processing Unit |
| CUDA | Compute Unified Device Architecture |
| cx, cy | Principal point coordinates |
| DLA | Deep Learning Accelerator |
| EPE | End-Point Error |
| FOV | Field of View |
| FPS | Frames Per Second |
| FP16 | 16-bit Floating Point |
| fx,fy | Focal lengths |
| GPU | Graphics Processing Unit |
| HD | High Definition |
| HDR | High Dynamic Range |
| ISP | Image Signal Processor |
| KITTI | Karlsruhe Institute of Technology and Toyota Technological Institute dataset |
| LiDAR | Light Detection and Ranging |
| ONNX | Open Neural Network Exchange |
| PCIe | Peripheral Component Interconnect Express |
| RGB | Red Green Blue |
| RMSE | Root Mean Square Error |
| ROS | Robot Operating System |
| S2M2 | Scalable Stereo Matching Model |
| S3M-Net | Semantic Segmentation and Stereo Matching Network |
| SoC | System on Chip |
| STTR | STereo TRansformer |
| VAP | Virtual Alignment Point |
| V2X | Vehicle-to-Everything |
| YOLO11 | You Only Look Once version 11 |
References
- Fan, R.; Wang, L.; Bocus, M.J.; Pitas, I. Computer Stereo Vision for Autonomous Driving. arXiv 2020, arXiv:2012.03194. [Google Scholar] [CrossRef]
- Fan, R.; Guo, S.; Bocus, M.J. (Eds.) Autonomous Driving Perception: Fundamentals and Applications; Advances in Computer Vision and Pattern Recognition; Springer Nature: Singapore, 2023. [Google Scholar] [CrossRef]
- Blankenberg, E.; Blankenberg, S. Survey of Disparity Map Algorithms Intended for Real-Time Stereoscopic Depth Estimation. 2021. Available online: https://www.semanticscholar.org/paper/fe59e7f76602b1979ecd120b0a88fe97b78e1d96 (accessed on 4 November 2025).
- Bartolomei, L.; Tosi, F.; Poggi, M.; Mattoccia, S. Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail. arXiv 2025, arXiv:2412.04472. [Google Scholar] [CrossRef]
- Wu, Z.; Feng, Y.; Liu, C.-W.; Yu, F.; Chen, Q.; Fan, R. S3M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 3940–3951. [Google Scholar] [CrossRef]
- Wen, B.; Trepte, M.; Aribido, J.; Kautz, J.; Gallo, O.; Birchfield, S. FoundationStereo: Zero-Shot Stereo Matching. arXiv 2025, arXiv:2501.09898. [Google Scholar] [CrossRef]
- Ultralytics. YOLO11. Available online: https://docs.ultralytics.com/models/yolo11 (accessed on 4 November 2025).
- Tosi, F.; Bartolomei, L.; Poggi, M. A Survey on Deep Stereo Matching in the Twenties. Int. J. Comput. Vis. 2025, 133, 4245–4276. [Google Scholar] [CrossRef]
- Chang, J.R.; Chang, P.C.; Chen, Y.S. Attention-Aware Feature Aggregation for Real-Time Stereo Matching on Edge Devices. In Computer Vision—ACCV 2020; Ishikawa, H., Liu, C.L., Pajdla, T., Shi, J., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12622. [Google Scholar] [CrossRef]
- Huang, Q.; Zhang, Y.; Zheng, J.; Shang, G.; Chen, G. A CNN-Based Real-Time Dense Stereo SLAM System on Embedded FPGA. In Artificial Intelligence. CICAI 2023; Fang, L., Pei, J., Zhai, G., Wang, R., Eds.; Lecture Notes in Computer Science; Springer: Singapore, 2024; Volume 14474. [Google Scholar] [CrossRef]
- Iqbal, W.; Paffenholz, J.A.; Mehltretter, M. Guiding Deep Learning with Expert Knowledge for Dense Stereo Matching. PFG–J. Photogramm. Remote Sens. Geoinf. Sci. 2023, 91, 365–380. [Google Scholar] [CrossRef]
- Zhang, J.; Tong, Q.; Yan, N.; Liu, X. FAMNet: A Lightweight Stereo Matching Network for Real-Time Depth Estimation in Autonomous Driving. Symmetry 2025, 17, 1214. [Google Scholar] [CrossRef]
- Brucker, S.; Walz, S.; Bijelic, M.; Heide, F. Cross-Spectral Gated-RGB Stereo Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Seattle, WA, USA, 2024; pp. 21654–21665. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Liu, X.; Drenkow, N.; Ding, A.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers. arXiv 2021, arXiv:2011.02910. [Google Scholar] [CrossRef]
- Guo, W.; Li, Z.; Yang, Y.; Wang, Z.; Taylor, R.H.; Unberath, M.; Yuille, A.; Li, Y. Context-Enhanced Stereo Transformer. arXiv 2022, arXiv:2210.11719. [Google Scholar] [CrossRef]
- Min, J.; Jeon, Y.; Kim, J.; Choi, M. S2M2: Scalable Stereo Matching Model for Reliable Depth Estimation. arXiv 2025, arXiv:2507.13229. [Google Scholar] [CrossRef]
- Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
- Kannala, J.; Brandt, S.S. A Generic Camera Model and Calibration Method for Conventional, Wide-Angle, and Fish-Eye Lenses. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1335–1340. [Google Scholar] [CrossRef] [PubMed]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
- Hartley, R.I. Theory and Practice of Projective Rectification. Int. J. Comput. Vis. 1999, 35, 115–127. [Google Scholar] [CrossRef]
- Menze, M.; Heipke, C.; Geiger, A. Object Scene Flow. ISPRS J. Photogramm. Remote Sens. 2018, 140, 60–76. [Google Scholar] [CrossRef]
- Menze, M.; Heipke, C.; Geiger, A. Joint 3D Estimation of Vehicles and Scene Flow. In Proceedings of the ISPRS Workshop on Image Sequence Analysis (ISA); International Society for Photogrammetry and Remote Sensing (ISPRS): Munich, Germany, 2015. [Google Scholar] [CrossRef]
- Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. Evaluating the Evolution of YOLO Models: A Comprehensive Benchmark Study of YOLO11 and Its Predecessors. arXiv 2025, arXiv:2411.00201v1. Available online: https://arxiv.org/html/2411.00201v1 (accessed on 21 November 2025).
- Simeonov, M.; Kamencay, P.; Dado, M. Evaluating YOLOv11’s Role in Robust Real-Time Object Detection for Autonomous Driving. In Proceedings of RADIOELEKTRONIKA 2025; IEEE: Bratislava, Slovakia, 2025; p. 5. [Google Scholar] [CrossRef]
- 5GAA. C-V2X Roadmap White Paper III. 5G Automotive Association, January 2025. Available online: https://5gaa.org/content/uploads/2025/01/5gaa-wi-cv2xrm-iii-roadmap-white-paper.pdf (accessed on 21 November 2025).
- ETSI. Intelligent Transport Systems (ITS); Service Requirements for V2X Services. ETSI TS 122 186 V17.0.0 (2025-01). Available online: https://www.etsi.org/deliver/etsi_ts/122100_122199/122186/17.00.00_60/ts_122186v170000p.pdf (accessed on 21 November 2025).
- IEEE Spectrum. Camera Crushes Lidar, Claims Startup. IEEE Spectrum 2021. Available online: https://spectrum.ieee.org/camera-crushes-lidar (accessed on 26 November 2025).
- NODAR. Stereo Vision Technology for Autonomous Vehicles. Available online: https://www.nodarsensor.com/ (accessed on 26 November 2025).
- You, Y.; Wang, Y.; Chao, W.-L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection. arXiv 2019, arXiv:1906.06310. [Google Scholar] [CrossRef]
- Cucor, B.; Petrov, T.; Kamencay, P.; Pourhashem, G.; Dado, M. Physical and Digital Infrastructure Readiness Index for Connected and Automated Vehicles. Sensors 2022, 22, 7315. [Google Scholar] [CrossRef] [PubMed]
- Cucor, B.; Petrov, T.; Kamencay, P.; Simeonov, M.; Dado, M. Digital Infrastructure Quality Assessment System Methodology for Connected and Automated Vehicles. Electronics 2023, 12, 3886. [Google Scholar] [CrossRef]
- ETH3D. Low-Resolution Two-View Stereo Benchmark. Available online: https://www.eth3d.net/low_res_two_view (accessed on 20 December 2025).
- Middlebury Stereo Vision. Stereo Evaluation (Eval3). Available online: https://vision.middlebury.edu/stereo/eval3/ (accessed on 20 December 2025).
- Wen, B.; Dewan, S.; Birchfield, S. Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching. arXiv 2025, arXiv:2512.11130. [Google Scholar] [CrossRef]












| Device | Resolution | Latency (ms) | FPS |
|---|---|---|---|
| RTX A6000 | 256 × 256 | 16.362 | 61.1 |
| RTX A6000 | 512 × 512 | 94.772 | 10.55 |
| AGX Orin | 256 × 256 | 38.04 | 26.28 |
| AGX Orin | 512 × 512 | 198.3 | 5.04 |
| Method | Baseline (m) | Mean (m) | Max (m) | Std (m) |
|---|---|---|---|---|
| Sampling | 3.6 | 0.0114 | 0.0369 | 0.0058 |
| Sampling | 3.3 | 0.0144 | 0.0444 | 0.0095 |
| Raytrace | 3.6 | 0.0094 | 0.0344 | 0.0065 |
| Raytrace | 3.3 | 0.0123 | 0.0441 | 0.0106 |
| Resolution | EPE (px) | D1 (%) | AbsRel | RMSE (m) |
|---|---|---|---|---|
| 448 × 160 | 0.901 | 3.483 | 0.075 | 20.387 |
| Frame | EPE (px) | D1 (%) | AbsRel | RMSE (m) |
|---|---|---|---|---|
| 000079_10 | 0.896 | 2.800 | 0.101 | 20.417 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Simeonov, M.; Kurdiumov, A.; Dado, M. Real-Time 3D Scene Understanding for Road Safety: Depth Estimation and Object Detection for Autonomous Vehicle Awareness. Vehicles 2026, 8, 28. https://doi.org/10.3390/vehicles8020028
Simeonov M, Kurdiumov A, Dado M. Real-Time 3D Scene Understanding for Road Safety: Depth Estimation and Object Detection for Autonomous Vehicle Awareness. Vehicles. 2026; 8(2):28. https://doi.org/10.3390/vehicles8020028
Chicago/Turabian StyleSimeonov, Marcel, Andrei Kurdiumov, and Milan Dado. 2026. "Real-Time 3D Scene Understanding for Road Safety: Depth Estimation and Object Detection for Autonomous Vehicle Awareness" Vehicles 8, no. 2: 28. https://doi.org/10.3390/vehicles8020028
APA StyleSimeonov, M., Kurdiumov, A., & Dado, M. (2026). Real-Time 3D Scene Understanding for Road Safety: Depth Estimation and Object Detection for Autonomous Vehicle Awareness. Vehicles, 8(2), 28. https://doi.org/10.3390/vehicles8020028

