Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly
Abstract
1. Introduction
- A Novel Off-Road Dataset: We present a large-scale dataset (14,879 images) specifically designed to tackle the variability of unstructured environments. By providing a more realistic and comprehensive training foundation than existing benchmarks, this dataset enables perception models to generalize more effectively to challenging off-road conditions.
- Confusion-Aware Loss (CAL) Function: To enhance pixel-level classification accuracy, we propose this novel loss component as a complement to standard Cross-Entropy (CE). By leveraging a confusion matrix computed after each epoch, CAL identifies and penalizes systematic classification errors, enabling the model to distinguish visually similar terrain features better. Empirical results using an NVIDIA SegFormer model demonstrate a significant performance gain, with mean Intersection over Union (mIoU) increasing from 68.66% to 70.06%. The generalization of this approach is further validated through consistent performance gains on the Cityscapes benchmark.
- Spatial Overlay Video Encoding Scheme: Building on our semantic segmentation results, we developed a new spatial overlay video encoding scheme to optimize data transmission. This method significantly reduces bandwidth requirements by intelligently combining a high-fidelity Region of Interest (ROI) from the original video frame with background elements derived from the semantic segmentation output. This approach ensures that crucial details are preserved for a remote operator. At the same time, less critical areas are compressed more efficiently, significantly improving video quality and situational awareness for real-time remote driving.
2. Background
2.1. Datasets for Off-Road Environments
2.1.1. Foundational Forest and Terrain: Dataset and Model Architecture
2.1.2. Multimodal Sensor Fusion and Off-Road Benchmarks
2.1.3. Specialized and Extreme-Scale Off-Road Data
2.1.4. Data Integration and Synthetic Augmentation
2.2. Semantic Segmentation Frameworks
2.2.1. Real-Time Multi-Branch Architectures
2.2.2. Transformer-Based Architectures
2.2.3. Transfer Learning and Domain Adaptation
2.3. Semantic Communication for Teleoperation
2.3.1. Foundations of Semantic Communication in Vehicular Networks
2.3.2. Task-Oriented Transmission and Edge Perception
3. Dataset Preview
4. Off-Road and On-Road Raw Video Acquisition
5. Methods
5.1. Loss Function
5.1.1. Normalized Confusion Matrix
5.1.2. CAL Formulation
- is the ground truth class label for pixel i.
- is the network’s predicted probability for class j at pixel i.
- is an indicator function that is equal to 1 if class j is not the true class , and 0 otherwise.
- is a small constant (e.g., 10−6) used for numerical stability to prevent the logarithm of zero.
5.2. Model Training
- Phase 1 (Baseline): The model was trained for 500 epochs using OHEM loss. To mitigate overfitting and ensure optimal weight convergence, we integrated an early stopping mechanism [34], with the total computation time for this stage spanning 336 GPU hours.
- Phase 2 (Retrain): The model underwent an additional 50 epochs of fine-tuning using the proposed Composite Confusion-Aware Loss (CCAL) function.
5.3. Results of the Spatial Overlaid Video Assembly
5.4. System-Level Latency and Teleoperation Impact
5.5. Influence on Path Planning and Motion Control
6. Conclusions
7. Limitations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ATV | All-Terrain Vehicle |
| AV1 | AOMedia Video 1 |
| AVC | Advanced Video Coding |
| BEV | Bird’s Eye View |
| BiSeNet | Bilateral Segmentation Network |
| CAL | Confusion-Aware Loss |
| CCAL | Composite Confusion-Aware Loss |
| CE | Cross-Entropy |
| CNN | Convolutional Neural Network |
| CWT | Complex Worksite Terrain |
| DAPPM | Deep Aggregation Pyramid Pooling Module |
| DCNN | Deep Convolutional Neural Network |
| DDRNet | Deep Dual-resolution Network |
| DSLR | Digital Single-Lens Reflex |
| EVI | Enhanced Vegetation Index |
| FFV1 | FF Video 1 |
| FPS | Frames Per Second |
| GAN | Generative Adversarial Network |
| G-MSC | Generative AI-enhanced Multimodal Semantic Communication |
| HDMI | High-Definition Multimedia Interface |
| HEVC | High Efficiency Video Coding |
| HFOV | Horizontal Field of View |
| IoV | Internet of Vehicles |
| LiDAR | Light Detection and Ranging |
| mIoU | mean Intersection over Union |
| MiT | Mix Transformer |
| MLP | Multi-Layer Perceptron |
| OHEM | Online Hard Example Mining |
| ORFD | Off-Road Freespace Detection |
| OWL | Ontology Web Language |
| PID | Proportional Integral Derivative |
| PSNR | Peak Signal-to-Noise Ratio |
| QoS | Quality of Service |
| QP | Quantization Parameter |
| RGB | Red, Green, Blue |
| ROI | Region of Interest |
| SA-PSNR | Semantic-Aware Signal-to-Noise Ratio |
| SA-SSIM | Semantic-Aware Structural Similarity |
| SAC | Semantic-Aware Video Compression |
| SETR | Segmentation Transformer |
| SNR | Signal-to-Noise Ratio |
| SSD | Solid-State Drive |
| UGV | Unmanned Ground Vehicle |
| V2I | Vehicle-to-Infrastructure |
| V2V | Vehicle-to-Vehicle |
| VANET | Vehicular Ad-hoc Network |
| VMAF | Video Multi-Method Assessment Fusion |
| YCOR | Yamaha-CMU Off-Road |
| YUV | Luminance (Y) and Chrominance (UV) Color Space |
References
- Sheela, S.; Nataraj, K.R.; Mallikarjunaswamy, S. A Comprehensive Exploration of Resource Allocation Strategies within Vehicle Ad-Hoc Networks. Mechatron. Intell. Transp. Syst. 2023, 2, 169–190. [Google Scholar] [CrossRef]
- Pallikonda, A.K.; Bandarapalli, V.K.; Aruna, V. Enhancing Performance and Reducing Latency in Autonomous Systems Through Edge Computing for Real-Time Data Processing. Mechatron. Intell. Transp. Syst. 2025, 4, 154–165. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef]
- Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
- Valada, A.; Oliveira, G.L.; Brox, T.; Burgard, W. Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In International Symposium on Experimental Robotics; Springer: Cham, Switzerland, 2016; pp. 465–477. [Google Scholar]
- Maturana, D.; Chou, P.W.; Uenoyama, M.; Scherer, S. Real-time semantic mapping for autonomous off-road navigation. In Field and Service Robotics: Results of the 11th International Conference; Springer: Cham, Switzerland, 2017; pp. 335–350. [Google Scholar]
- Min, C.; Jiang, W.; Zhao, D.; Xu, J.; Xiao, L.; Nie, Y.; Dai, B. ORFD: A Dataset and Benchmark for Off-Road Freespace Detection. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2532–2538. [Google Scholar] [CrossRef]
- Jiang, P.; Osteen, P.; Wigness, M.; Saripalli, S. RELLIS-3D Dataset: Data, Benchmarks and Analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 1110–1116. [Google Scholar] [CrossRef]
- Viswanath, K.; Singh, K.; Jiang, P.; Sujit, P.B.; Saripalli, S. OffSeg: A Semantic Segmentation Framework for Off-Road Driving. In Proceedings of the 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), Lyon, France, 23–27 August 2021; pp. 354–359. [Google Scholar]
- Shaban, A.; Meng, X.; Lee, J.; Boots, B.; Fox, D. Semantic terrain classification for off-road autonomous driving. In Conference on Robot Learning; PMLR: New York, NY, USA, 2022; pp. 619–629. [Google Scholar]
- Guan, T.; He, Z.; Song, R.; Manocha, D.; Zhang, L. TNS: Terrain Traversability Mapping and Navigation System for Autonomous Excavators. In Proceedings of the Robotics: Science and Systems (RSS), New York, NY, USA, 27 June–1 July 2022. [Google Scholar] [CrossRef]
- Min, C.; Mei, J.; Zhai, H.; Wang, S.; Sun, T.; Kong, F.; Li, H.; Mao, F.; Liu, F.; Wang, S.; et al. Advancing Off-Road Autonomous Driving: The Large-Scale ORAD-3D Dataset and Comprehensive Benchmarks. arXiv 2025, arXiv:2510.16500. [Google Scholar]
- Medellin, A.; Bhamri, A.; Ma, A.; Lanagri, R.; Gopalswamy, S.; Grabowsky, D.; Mikulski, D. Applications of Unifying Off-Road Datasets Through Ontology; Technical Report, SAE Technical Paper; SAE: Warrendale, PA, USA, 2024. [Google Scholar]
- Małek, K.; Dybała, J.; Kordecki, A.; Hondra, P.; Kijania, K. OffRoadSynth Open Dataset for Semantic Segmentation Using Synthetic-Data-Based Weight Initialization for Autonomous UGV in Off-Road Environments. J. Intell. Robot. Syst. 2024, 110, 76. [Google Scholar] [CrossRef]
- Wijayathunga, L.; Dabare, D.; Rassau, A.; Chai, D.; Islam, S.M.S. A High-fidelity Multimodal Synthetic Dataset Generation Framework for Off-road Unstructured Terrain Navigation Training of Autonomous Robots. J. Intell. Robot. Syst. 2026, 112, 10. [Google Scholar] [CrossRef]
- Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
- Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 12077–12090. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Sharma, S.; Ball, J.E.; Tang, B.; Carruth, D.W.; Doude, M.; Islam, M.A. Semantic segmentation with transfer learning for off-road autonomous driving. Sensors 2019, 19, 2577. [Google Scholar] [CrossRef] [PubMed]
- Ye, S.; Wu, Q.; Fan, P.; Fan, Q. A survey on semantic communications in internet of vehicles. Entropy 2025, 27, 445. [Google Scholar] [CrossRef] [PubMed]
- Lv, J.; Tong, H.; Pan, Q.; Zhang, Z.; He, X.; Luo, T.; Yin, C. Importance-aware image segmentation-based semantic communication for autonomous driving. arXiv 2024, arXiv:2401.10153. [Google Scholar]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
- Chakraborty, P.; Bandopadhyay, A.; Agrawal, K.; Zhang, J.; Leung, M.F. IndiSegNet: Real-Time Semantic Segmentation for Unstructured Road Scenes in Intelligent Transportation Systems. Intell. Syst. Appl. 2026, 29, 200629. [Google Scholar] [CrossRef]
- Niedermayer, M.; Rice, D.; Martinez, J. RFC 9043: FFV1 Video Coding Format Versions 0, 1, and 3. 2021. Available online: https://datatracker.ietf.org/doc/html/rfc9043 (accessed on 15 March 2026).
- FFmpeg Team. FFmpeg, version 6.1; A Complete, Cross-Platform Solution to Record, Convert and Stream Audio and Video. 2025. Available online: https://www.ffmpeg.org/ (accessed on 10 March 2026).
- Sithu, M. Semantic-Segmentation: A Collection of Various Semantic Segmentation Models. GitHub Repository. 2022. Available online: https://github.com/sithu31296/semantic-segmentation (accessed on 15 March 2026).
- Azad, R.; Heidari, M.; Yilmaz, K.; Hüttemann, M.; Karimijafarbigloo, S.; Wu, Y.; Schmeink, A.; Merhof, D. Loss Functions in the Era of Semantic Segmentation: A Survey and Outlook. arXiv 2023, arXiv:2312.05391. [Google Scholar] [CrossRef]
- Csurka, G.; Volpi, R.; Chidlovskii, B. Semantic image segmentation: Two decades of research. arXiv 2023, arXiv:2302.06378. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar] [CrossRef]
- Li, Z.; Aaron, A.; Katsavounidis, I.; Anantharaman, A.; Guo, F. Toward a Practical Perceptual Video Quality Metric. Netflix Technology Blog. 2016. Available online: https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652 (accessed on 15 March 2026).
- Kamtam, S.B.; Lu, Q.; Bouali, F.; Haas, O.C.; Birrell, S. Network Latency in Teleoperation of Connected and Autonomous Vehicles: A Review of Trends, Challenges, and Mitigation Strategies. Sensors 2024, 24, 3957. [Google Scholar] [CrossRef]
- Adwani, N.; Silvestrini-Cordero, K.; Rojas-Cessa, R.; Han, T.; Wang, C. A special-purpose video streaming codec for internet-based remote driving. In Proceedings of the 2024 IEEE International Conference on Advanced Intelligent Mechatronics (AIM), Boston, MA, USA, 15–19 July 2024; pp. 1109–1116. [Google Scholar]
- Nowakowski, M. Operational Environment Impact on Sensor Capabilities in Special Purpose Unmanned Ground Vehicles. In Proceedings of the 2024 21st International Conference on Mechatronics-Mechatronika (ME), Brno, Czech Republic, 4–6 December 2024; pp. 1–5. [Google Scholar]
- Feng, Y.; Shen, H.; Shan, Z.; Yang, Q.; Shi, X. Semantic Communication for Edge Intelligence Enabled Autonomous Driving System. IEEE Netw. 2025, 39, 149–157. [Google Scholar] [CrossRef]
- Musau, H.; Ruganuza, D.; Indah, D.; Mukwaya, A.; Gyimah, N.K.; Patil, A.; Bhosale, M.; Gupta, P.; Mwakalonge, J.; Jia, Y.; et al. A Review of Off-Road Datasets, Sensor Technologies and Terrain Traversability Analysis. SAE Int. J. Adv. Curr. Pract. Mobil. 2025, 5, 2568–2588. [Google Scholar] [CrossRef]









| Loss Functions | Best Loss mIoU (%) | Mean | Std | Best Improvement (%) |
|---|---|---|---|---|
| CCAL | 76.73 | 76.49 | 0.026 | 0.49 |
| Focal | 76.64 | 76.37 | 0.061 | 0.40 |
| balanced CE | 76.48 | 76.27 | 0.310 | 0.24 |
| Dice | 76.05 | 75.71 | 0.033 | −0.19 |
| Tversky | 75.48 | 75.18 | 0.108 | −0.76 |
| Model Variant | Baseline [19] mIoU (%) | CCAL mIoU (%) | mIoU Improvement (%) |
|---|---|---|---|
| SegFormer-B0 | 76.24 | 76.73 | 0.49 |
| SegFormer-B1 | 78.55 | 78.71 | 0.16 |
| SegFormer-B2 | 80.83 | 81.11 | 0.28 |
| SegFormer-B3 | 81.53 | 81.96 | 0.43 |
| SegFormer-B4 | 82.33 | 82.66 | 0.33 |
| SegFormer-B5 | 82.26 | 82.60 | 0.34 |
| Configuration (1920 × 1080) | Video a (Mbps) | Video b (Mbps) |
|---|---|---|
| Original YUV420p baseline | 746 | 746 |
| Lossless FFV1 (Standard) | 280 | 396 |
| Purely Semantic (FFV1) | 14.5 | 24.0 |
| Spatially composite (Overlay), FFV1 | 117 | 151 |
| Bandwidth (Mbps) | Codec | Standard Video | Semantic Map | Spatial Overlay | |||||
|---|---|---|---|---|---|---|---|---|---|
| PSNR (dB) | VMAF | PSNR | VMAF | PSNR | VMAF | ||||
| 1 | H.264 | 32.0 | 25.0 | 33.4 | 38.1 | 32.1 | 26.1 | ||
| 2 | H.264 | 33.8 | 44.0 | 41.1 | 89.1 | 35.5 | 66.8 | ||
| 4 | H.264 | 35.7 | 64.0 | 43.6 | 94.8 | 38.7 | 79.3 | ||
| 1 | H.265 | 26.1 | 25.6 | 30.3 | 68.7 | 33.5 | 51.8 | ||
| 2 | H.265 | 27.1 | 39.2 | 32.8 | 83.3 | 36.0 | 70.8 | ||
| 4 | H.265 | 28.4 | 54.5 | 35.9 | 91.5 | 38.9 | 86.2 | ||
| 1 | AV1 | 31.2 | 42.8 | 35.7 | 71.7 | 33.2 | 54.2 | ||
| 2 | AV1 | 32.5 | 56.5 | 38.0 | 83.1 | 35.3 | 71.4 | ||
| 4 | AV1 | 34.2 | 70.7 | 41.7 | 93.3 | 37.7 | 85.2 | ||
| Bandwidth (Mbps) | Encoding Scheme | PSNR | ||
|---|---|---|---|---|
| Overall | ROI | Background | ||
| 1 | Standard | 31.2 | 31.7 | 30.9 |
| Semantic | 35.7 | 40.2 | 34.5 | |
| Spatial Overlay | 33.2 | 31.8 | 34.0 | |
| 2 | Standard | 32.5 | 32.9 | 32.3 |
| Semantic | 38.0 | 42.2 | 36.8 | |
| Spatial Overlay | 35.3 | 33.5 | 36.4 | |
| 4 | Standard | 34.2 | 34.5 | 34.1 |
| Semantic | 41.7 | 45.5 | 40.6 | |
| Spatial Overlay | 37.7 | 35.6 | 39.0 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Dror, I.; Aviv, O.; Hadar, O. Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly. Sensors 2026, 26, 1944. https://doi.org/10.3390/s26061944
Dror I, Aviv O, Hadar O. Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly. Sensors. 2026; 26(6):1944. https://doi.org/10.3390/s26061944
Chicago/Turabian StyleDror, Itai, Omer Aviv, and Ofer Hadar. 2026. "Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly" Sensors 26, no. 6: 1944. https://doi.org/10.3390/s26061944
APA StyleDror, I., Aviv, O., & Hadar, O. (2026). Off-Road Autonomous Vehicle Semantic Segmentation and Spatial Overlay Video Assembly. Sensors, 26(6), 1944. https://doi.org/10.3390/s26061944

