Previous Article in Journal
New Paradigms in Automotive Engineering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Review

Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion

by
Savvas Nikolaidis
and
Paraskevas Koukaras
*
School of Science and Technology, International Hellenic University, 14th km Thessaloniki-Moudania, 57001 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
World Electr. Veh. J. 2026, 17(6), 277; https://doi.org/10.3390/wevj17060277
Submission received: 7 March 2026 / Revised: 8 May 2026 / Accepted: 19 May 2026 / Published: 22 May 2026
(This article belongs to the Section Automated and Connected Vehicles)

Abstract

The rapid development of autonomous vehicles is based mainly on their ability to accurately perceive their environment, where artificial intelligence and computer vision act as the core of environmental perception. In this regard, deep learning-based perception architectures have revolutionized the field of autonomous driving. However, as the use of single sensors fails to ensure reliability in complex scenarios, multimodal sensor fusion has become an essential part of modern deep learning architectures. In this context, covering the literature from 2020 to 2025, we analyze the transition from traditional Convolutional Neural Networks (CNNs) to modern Vision Transformers (ViTs) and explore data fusion design methodologies at various processing levels. In addition, significant limitations related to adverse weather conditions and dynamic environments, computational resources and overall quality and management of data are identified. The conducted comparative analysis indicates that vision-transformer and multimodal fusion methodologies provide higher accuracy in perception tasks but at the cost of increased computational requirements and sensor synchronization challenges. Finally, it becomes clear that achieving full autonomy requires further research in subjects such as collaborative perception, unsupervised domain adaptation and the creation of lightweight models, thus offering a roadmap for future developments.
Keywords: autonomous driving; perception; object detection; semantic/instance segmentation; depth estimation; LiDAR; camera; multimodal sensor fusion; transformers; GNNs autonomous driving; perception; object detection; semantic/instance segmentation; depth estimation; LiDAR; camera; multimodal sensor fusion; transformers; GNNs

Share and Cite

MDPI and ACS Style

Nikolaidis, S.; Koukaras, P. Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion. World Electr. Veh. J. 2026, 17, 277. https://doi.org/10.3390/wevj17060277

AMA Style

Nikolaidis S, Koukaras P. Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion. World Electric Vehicle Journal. 2026; 17(6):277. https://doi.org/10.3390/wevj17060277

Chicago/Turabian Style

Nikolaidis, Savvas, and Paraskevas Koukaras. 2026. "Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion" World Electric Vehicle Journal 17, no. 6: 277. https://doi.org/10.3390/wevj17060277

APA Style

Nikolaidis, S., & Koukaras, P. (2026). Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion. World Electric Vehicle Journal, 17(6), 277. https://doi.org/10.3390/wevj17060277

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop