Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion

Nikolaidis, Savvas; Koukaras, Paraskevas

doi:10.3390/wevj17060277

Open AccessReview

Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion

by

Savvas Nikolaidis

and

Paraskevas Koukaras

^*

School of Science and Technology, International Hellenic University, 14th km Thessaloniki-Moudania, 57001 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2026, 17(6), 277; https://doi.org/10.3390/wevj17060277

Submission received: 7 March 2026 / Revised: 8 May 2026 / Accepted: 19 May 2026 / Published: 22 May 2026

(This article belongs to the Section Automated and Connected Vehicles)

Download

Browse Figures

Versions Notes

Abstract

The rapid development of autonomous vehicles is based mainly on their ability to accurately perceive their environment, where artificial intelligence and computer vision act as the core of environmental perception. In this regard, deep learning-based perception architectures have revolutionized the field of autonomous driving. However, as the use of single sensors fails to ensure reliability in complex scenarios, multimodal sensor fusion has become an essential part of modern deep learning architectures. In this context, covering the literature from 2020 to 2025, we analyze the transition from traditional Convolutional Neural Networks (CNNs) to modern Vision Transformers (ViTs) and explore data fusion design methodologies at various processing levels. In addition, significant limitations related to adverse weather conditions and dynamic environments, computational resources and overall quality and management of data are identified. The conducted comparative analysis indicates that vision-transformer and multimodal fusion methodologies provide higher accuracy in perception tasks but at the cost of increased computational requirements and sensor synchronization challenges. Finally, it becomes clear that achieving full autonomy requires further research in subjects such as collaborative perception, unsupervised domain adaptation and the creation of lightweight models, thus offering a roadmap for future developments.

Keywords:

autonomous driving; perception; object detection; semantic/instance segmentation; depth estimation; LiDAR; camera; multimodal sensor fusion; transformers; GNNs

Graphical Abstract

1. Introduction

Artificial Intelligence (AI) has gone through many winters and summers over the years, with each cycle influencing the development of autonomous driving technologies. The current AI summer that is unfolding marked the transition from the notion of “it is hard to imagine discovering a set of rules that can replicate a driver’s behavior” [1] to an immense data-driven disruption in the field of autonomous driving. In the same manner, the evolution of transportation is undergoing a profound transformation with the emergence of autonomous vehicles, heralding a new era defined by the intricate integration of cutting-edge AI and computer vision systems [2]. The realization of fully autonomous vehicles necessitates a synergy of technological advancements, where AI algorithms act as the brain, processing vast amounts of sensory data, while computer vision serves as the eyes, interpreting the visual world [3]. This synergistic integration of cutting-edge AI and computer vision technologies empowers vehicles to perceive, interpret, and navigate their surroundings with unprecedented levels of autonomy and precision [4,5].

1.1. Scope of the Review

Although deep learning and computer vision technologies have made significant progress, large-scale deployment of fully autonomous vehicles capable of operating without supervision in complex environments remains a substantial challenge. In this context, the scope of this study is to explore the integration of cutting-edge deep learning-based techniques that can enhance the accuracy, efficiency, and overall performance of computer vision systems in autonomous vehicles and address the key challenges and limitations of current frameworks and approaches. For this reason, this study adopts a literature-based methodology, utilizing a wide range of scientific sources published between 2020 and 2025, to offer insights into deep learning-based autonomous driving. As a result, this review serves as a strategic roadmap, focusing on a high-level synthesis of architectures rather than granular technical details. Its goal is to provide a conceptual framework that guides researchers toward primary sources for more specialized technical analysis.

1.2. Novelty and Contribution

Rather than presenting experimental results or developing a prototype system, the objective of this study is to provide a structured and comprehensive evaluation of the most promising vision and multimodal perception technologies for autonomous driving. Therefore, the main contributions of this paper are to:

Provide a structured and comprehensive evaluation of the most promising modern technologies. Unlike earlier reviews that focus on traditional CNN-based fusion, this work provides a comprehensive evaluation of State-of-the-Art technologies, such as Vision Transformers (ViTs).
Provide a comparative analysis, including 2D vs. 3D perception, single-sensor vs. multimodal fusion, and classic vs. Transformer-based models, in order to highlight their respective strengths and limitations.
Identify key challenges and outline promising future directions, so that this work can serve not only as a review of the state-of-the-art but also as a roadmap for future research.

1.3. Related Work

While several surveys have explored sensor fusion in autonomous vehicles, we consider an updated analysis necessary, due to the rapid evolution of deep learning and the introduction of novel methods and approaches. In addition, foundational surveys until 2020 laid the groundwork for multimodal integration, so they remain complementary, but focused on earlier architectural paradigms since they mainly centered on traditional deep learning architectures such as CNNs. As a result, concepts such as Attention mechanisms and Transformer-based fusion are not sufficiently addressed in those works. Table 1 provides a comparative overview of our work in relation to existing literature.

1.4. Structure of the Paper

This paper is structured to provide a comprehensive exploration of the advancements and innovations in autonomous vehicle technology, with a particular focus on the integration of deep learning-based AI and computer vision. The methodological approach followed for this review is presented in Section 2. Section 3 explores the concept of foundational deep learning architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Vision Transformers (ViTs), Deep Belief Networks (DBNs) and Autoencoders and Graph Neural Networks (GNNs). Section 4 examines the core task of perception and scene understanding in autonomous driving. Section 5 explores multimodal sensor fusion in autonomous vehicles. Section 6 provides a discussion on the challenges and limitations in perception for autonomous driving, as well as open problems and benchmarking gaps. Section 7 presents a summary of key findings, a comparative analysis of all autonomous driving approaches examined throughout the study, followed by future research directions.

2. Materials and Methods

Methodological Approach

A qualitative, thematic review methodology was adopted for this study. The selected literature was categorized by functional domain and further organized according to architectural approaches. This allowed for a comparative analysis of different methodologies and a synthesis of their respective benefits, challenges and limitations. Emphasis was placed on recent developments published between 2020 and 2025, with a focus on topics such as sensor technologies, neural network architectures (e.g., CNNs, RNNs, Transformers).

To ensure comprehensive coverage, the review spanned over 100 academic and industrial sources identified through targeted keyword searches across several databases including Google Scholar, IEEE Xplore, ScienceDirect and Scopus among others. The keywords and search strings used in the search process are listed in Table 2. Initially, we identified 326 sources, which after removing 18 duplicates, reduced to 308 sources, selected for screening of titles and abstracts. This initial screening resulted in the exclusion of 104 records. Subsequently, 204 reports were sought for retrieval, of which nine could not be fully recovered, leaving 195 reports to be assessed for eligibility through full text assessment. Then, after applying the exclusion criteria defined in Table 3, 72 reports were excluded and 123 sources were chosen based on chronological relevance and their contribution on the field. Beyond a qualitative review, this study performs a quantitative meta-analysis of the selected 123 sources to identify statistical shifts in architectural preferences and sensor strategies between 2020 and 2025. The process of performing the literature review is illustrated in the PRISMA flow diagram of Figure 1.

To ensure the integrity of the analysis, we established the criteria outlined in Table 3 focusing exclusively on high-fidelity, peer-reviewed research from 2020–2025. In addition, in order to mitigate selection bias, we utilized a multi-database approach (IEEE, Scopus, ScienceDirect, etc.), avoiding reliance on a single publisher. Since we also recognized the tendency for journals to prioritize positive results, we include in our analysis paradigms of failure cases and technical trade-offs. Finally, the use of the PRISMA protocol (Figure 1) ensures the transparency of the methodology.

3. Background

Starting from an overview of key principles of autonomous driving, this section presents foundational deep learning architectures that constitute the backbone of these systems.

3.1. Autonomous Driving

The existing autonomous driving architectures typically employ a hierarchical structure, encompassing perception, localization, planning, and control modules [9]. The perception module, which is examined in this study, is responsible for gathering and interpreting data from various sensors, such as cameras, LiDAR (Light Detection and Ranging), radar, and ultrasonic sensors, to create a comprehensive representation of the vehicle’s surroundings [10]. These sensors are used to gather data, allowing the vehicle to create an accurate depiction of its surroundings [11].

Vehicle automation is separated into six levels, as formally standardized by SAE International, ranging from 0, denoting no automation, to 5, representing full automation under all driving conditions [2]. Achieving Level 5 automation, where the vehicle can handle all driving tasks in all conditions without human intervention, remains the ultimate goal for many car manufacturers and researchers [2]. As technology matures, autonomous vehicles are poised to reshape urban landscapes, revolutionize logistics, and redefine personal mobility, paving the way for a more connected, efficient, and sustainable transportation ecosystem [12].

3.2. Foundational Deep Learning Architectures for Perception

3.2.1. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) have emerged as a fundamental building block for perception tasks in autonomous vehicles, demonstrating exceptional performance in image and object recognition [13,14]. These neural networks exploit the inherent spatial structure of visual data, allowing them to hierarchically extract features and understand complex relationships within scenes. Initially, they start by extracting low-level features such as edges and textures in the shallow layers, and progress to high-level features like object parts and shapes in the deeper layers [15,16].

Figure 2 presents the basic architecture of a Deep Convolutional Neural Network (CNN) used for image recognition and classification. Besides the aforementioned architecture, three are the basic concepts that constitute the heart of a CNN: receptive fields, shared weights and translation invariance. Each neuron in a convolutional layer has a specific “field of view”. The input of the layer outside of this field of view cannot alter the activation of the neuron. In other words, in CNNs, each neuron is only responsible for processing data from a specific region of the input image, which is known as the receptive field [17]. So, the “receptive field” of a neuron in a layer of the network refers to the region of the input that significantly influences the output of that neuron and it corresponds to the region of the feature map covered by the kernel which provided that unit’s value. This approach allows the network to focus on local patterns and features, which is particularly useful in visual data [18].

The practice of weight-sharing, where different connections share the same weights (parameters), is a widely adopted and effective technique in neural networks and deep learning. There are two primary yet distinct purposes typically associated with weight-sharing in deep learning architectures. The first is to reduce the number of free parameters that need to be stored or updated during learning, thereby reducing training time and cost as well as helping against overfitting. The second function of weight-sharing is to apply the same operation at different locations of the input data in order to process the data uniformly and provide a basis for translation-invariant recognition in CNN architectures. Weight-sharing, in conjunction with back propagation, which is the process of tuning the weights of neural networks, has proven to be very useful in computer vision [19].

Finally, translation invariance, which is believed to be either an important property of CNNs or an ability that can be learned [20], equips them with the ability to recognize patterns regardless of their spatial location within an image. This characteristic, which is of paramount importance in autonomous driving tasks, enhances their robustness in processing real-world visual data where objects can appear at varying positions [21]. CNNs achieve translation invariance by applying the same filters across the entire input image, enabling the network to detect features regardless of their location [19,21].

There are numerous CNN methods of autonomous driving, which are widely used in self-driving vehicles due to their ability to extract features effectively [22]. Depending on the application and the dataset used for training, each CNN network type has a different level of performance [23]. A crucial component of these systems, that affects the overall performance of a model, is the backbone architecture, which extracts features from sensor data (camera images, LiDAR point clouds) for 2D and 3D object detection [15].

For 2D object detection tasks, VGG, ResNet, and Inception are popular architectures that are used as backbone networks to extract features from sensor data and excel at capturing spatial relationships within images [15]. In the same manner, MobileNet, designed for mobile and embedded vision applications, is able to reduce the number of parameters and increase computational efficiency. This backbone network is widely used in SSD (Single Shot MultiBox Detector) for real-time object detection, while its advancements, MobileNetV2 and MobileNetV3, further enhance efficiency and performance, rendering them well-suited for edge computing in autonomous vehicles [15]. Last but not least, EfficientNet, focuses on balancing accuracy and computational efficiency and is ideal for real-time applications in autonomous vehicles [15].

Similarly, in 3D object detection, point cloud and voxel-based 3D convolutional neural networks have played a crucial role in autonomous driving. In the case of point cloud networks, PointNet directly processed raw point cloud data using shared multi-layer perceptrons to effectively extract features, while its successor, PointNet++, introduced a hierarchical feature learning approach to effectively capture the local structures within point clouds through a multi-scale representation. PointCNN, employed the X-convolution technique to effectively capture the local geometric structures of point cloud data, enabling the learning of feature representations from unstructured point sets [15]. In the case of Voxel-based networks VoxelNet, transformed point cloud data into a voxel grid and then applied 3D convolutional layers to extract features. In this way, pointwise features within each voxel were combined for robust detection. Finally, SECOND succeeded in considerably decreasing computational complexity and improving processing speed through the use of sparse 3D convolutions [15].

CNNs can achieve high accuracy, yet their speed in real-time scenarios remains a concern due to high computational costs. However, there are lightweight models that can achieve superior computational speed and inference times. In addition, while the performance of cameras (which are the main input sensor for CNNs) degrades significantly in darkness or adverse weather, CNNs remain robust against image transformations like translation and rotation. Finally, deep architectures increase training time and complexity, but libraries like TensorFlow and the use of GPUs/TPUs enable rapid training and inference, making deployment on autonomous vehicles feasible [22].

3.2.2. Recurrent Neural Networks

As autonomous vehicles rely on the perception of the constantly evolving surrounding environment, it is crucial to obtain a more comprehensive representation by storing and tracking all relevant information from the past. Recurrent Neural Networks can be employed efficiently to address this issue, as they are primarily utilized to capture the temporal dynamics of a sequence [24].

Recurrent Neural Networks (RNNs) were specifically designed to process temporal information, in contrast to traditional machine learning techniques and Convolutional Neural Networks, which are proficient at handling spatial information [25]. The basic architecture of RNNs presented in Figure 3, consists of an input layer, a hidden layer and an output layer with recurrent connections, allowing information to cycle within the networks [26].

As opposed to feedforward neural networks, RNNs have the distinctive capability of retaining a memory of previous inputs by using their internal state (memory) to process sequences of inputs. In this way, they are ideal for cases where the temporal order of data is critical [26]. However, when dealing with a large number of time steps, the RNN’s gradient of the loss function can either vanish or grow exponentially. In the first case the network is prevented from learning long-term dependencies effectively, while in the latter case the model may converge too quickly to a poor local minimum. In order to address this issue, gated RNNs like the Long Short-Term Memory (LSTM) Network and Gated Recurrent Unit (GRU) have been introduced [25,26]. LSTM networks have the advantage of maintaining and updating their internal state over long periods, thus allowing them to be efficient for tasks requiring the modeling of long-term dependencies. An LSTM cell, as presented in Figure 4, consists of the input gate, the forget gate and the output gate, which regulate the cell state and hidden state.

The GRU architecture, as presented in Figure 5, consists of two gates, namely the update gate and the reset gate, which control the flow of information and ensure that relevant information is preserved while irrelevant information is excluded. The above advantages of GRUs make them suitable for tasks where computational resources are limited or when faster training is necessary [26].

RNNs offer high accuracy in long-term predictions as they effectively model temporal dependencies. However, their sequential computation nature can increase latency compared to other architectures. Furthermore, similar to CNNs, sensor limitations significantly impact their robustness in adverse weather conditions [25]. Finally, their complexity and “black-box” nature lead to low explainability, creating significant obstacles for the validation and approval of these systems for autonomous driving deployment [25,26].

3.2.3. Transformers

Another paradigm of foundational deep learning architectures for autonomous driving is Transformers, which have revolutionized the field of Natural Language Processing and are currently being applied to many fields related to image and video analysis, point clouds, sound and time series data [27]. Their success stems from their ability to handle long-range dependencies and parallelize computations, overcoming the limitations of RNNs [28,29]. In addition, the use of attention mechanisms to aggregate sets of features across an entire image or within local neighborhoods allows the model to focus on the most relevant parts of the input [30].

The two main components that compose the architecture of the Transformer are the Encoder and the Decoder. The Encoder component processes the input embeddings with the use of a Multi-Head Attention mechanism and Feed-Forward Networks, while these two operations are further enhanced by Layer Normalization and residual connections. Subsequently, the Decoder component, which is structurally akin to the Encoder, directs its focus on the Encoder output, generating the final output sequence. Finally, the crucial task of recognizing the sequence order in this architecture is achieved through the use of positional encodings [31]. Another core component of the Transformer architecture is the Self-Attention mechanism, which, along with the Multi-Head Attention mechanism, is responsible for the assessment of the relation among various segments of the input sequence regardless of their respective distance [31,32]. In addition, the aforementioned Feed-Forward Network helps the model learn high-level representations, while Skip connections, another integral part of Transformer models, mitigate the vanishing gradient problem and facilitate stable learning if combined with Layer Normalization. Finally, the output layer is of paramount importance as it converts the processed data into interpretable outputs that are vital in various tasks [31].

As far as computer vision is concerned, transformers have recently marked a significant breakthrough with the development of the Vision Transformer (ViT) which is a modified version of the Transformer architecture specifically tailored for visual tasks [33]. As presented in Figure 6, contrary to traditional CNNs, the ViT treats an image as a sequence of feature vectors derived from image patches. This approach enables the model to utilize the Transformer’s strength in managing sequential data and capturing long-range dependencies across the image. The ViT architecture closely resembles that of the original Transformer, consisting of multiple identical layers, each containing two main components, a Multi-Head Self-Attention mechanism and a Feed-Forward Network. The Self-Attention mechanism helps the model capture the significance of various image patches and understand the relationships among them, while the Feed-Forward network further processes this information using two linear transformations and a non-linear activation function [34].

In the context of autonomous driving, the aforementioned ViT architecture has demonstrated outstanding performance across a range of computer vision tasks, including image classification, object detection, tracking and image segmentation, which are essential for comprehensive environmental perception [31,34]. Additionally, ViTs can achieve global scene understanding, since they function as advanced feature extractors, offering a distinct advantage over traditional CNNs by integrating information across wider visual fields, while their ability to process data in parallel contributes to significant computational efficiency, which is crucial for real-time operations in autonomous vehicles [31]. Furthermore, safe navigation is feasible, since the Self-Attention mechanism, which is inherent to ViTs manages spatial-temporal data and enables the detection of complex, long-range dependencies [31]. Moreover, ViTs are ideal for trajectory prediction, behavior prediction, and motion planning tasks, thanks to the encoder-decoder structure that they incorporate [35]. Finally, by adopting a hybrid model that combines convolutional and Self-Attention layers, ViTs can be involved in the decision-making tasks of autonomous driving, since they can process complex visual data in an efficient way [31].

The application of ViTs in autonomous driving has led to the development of a great number of approaches utilized on different tasks including perception, prediction, planning and decision making. In 2D perception, models like PersFormer [36] and CurveFormer [37] show enhanced performance in lane detection tasks. In addition, Panoptic SegFormer offers a comprehensive solution to panoptic segmentation with Transformers, by unifying semantic and instance segmentation within a single framework [38]. Last but not least, VectorMapNet, which can be used for high-definition map generation, enabling explicit modeling of spatial relationships between map elements [39].

In the case of 3D perception, approaches such as the BEVFormer model have contributed significantly in this field. In specific, BEVFormer employs a spatiotemporal ViT architecture, seamlessly integrating spatial and temporal data to achieve unified BEV (Bird’s-Eye View) representations [40]. Finally, 3D object tracking for autonomous vehicles has been evolved through the development of approaches like MOTR (End-to-End Multiple-Object Tracking with Transformer), which introduces a “track query” mechanism that models temporal variations across video sequences, avoiding the need for conventional heuristic-based methods [41].

Finally, in the field of prediction, planning and decision-making numerous approaches have been proposed such as the LaneTransformer [42], a high efficiency trajectory prediction model, which significantly decreases computational time while preserving accuracy, as well as TransFuser [43], a sophisticated approach for planning and decision making, which utilizes multiple Transformer modules for comprehensive data processing and fusion.

Despite their superior accuracy in complex scene recognition, ViTs entail high computational demands, making real-time processing a significant challenge. Furthermore, as models become more complex, they require innovative solutions and techniques to balance computational requirements with deployment feasibility on embedded devices. As a result, transformer-based models in autonomous driving remain comparatively less deployment-mature than established CNN-based pipelines, particularly in latency-constrained and resource-limited embedded settings [31].

Table 4 summarizes the performance trade-offs of CNNs, RNNs and ViTs that have been analyzed above.

3.2.4. Autoencoders

During the autonomous driving process, a large volume of images is collected, resulting in high-dimensional data and thus hindering computational efficiency. To address this issue, Autoencoders are commonly employed mainly for dimensionality reduction and feature extraction [24]. An autoencoder is a type of neural network designed to compress input data into a lower-dimensional, meaningful representation and then reconstruct it as closely as possible to the original [44]. As shown in Figure 7, the Autoencoder architecture consists of two main components, an Encoder that compresses the input and a Decoder that reconstructs it, aiming to minimize the reconstruction error [25]. The goal is to ensure that reconstruction is sufficiently accurate while also producing a latent representation that is both useful and meaningful [45]. However, traditional Autoencoders have been criticized for their limited generative capabilities and tendency to just “memorize” the training data resulting in overfitting. In contrast, the Variational Autoencoder (VAE) introduces a probabilistic approach, allowing for a regularized latent space and generative modeling across the entire data distribution. VAEs optimize both reconstruction loss and similarity loss to ensure that latent space is continuous and meaningful [25].

3.2.5. Graph Neural Networks

Graph Neural Networks (GNNs) are a specialized class of neural networks designed to process data represented in graph structures. Graphs, which are mathematical representations consisting of nodes and edges, are well-suited for capturing complex relationships and dependencies within data. GNNs operate with the use of a message-passing mechanism through which nodes exchange information with their neighbors, updating their state based on the received messages and their own features [46]. This mechanism allows GNNs to effectively learn from and reason about graph-structured data, supporting a wide range of applications [46]. In this context, GNNs provide a powerful way to model the relationships between different entities in a scene, such as vehicles, pedestrians, and traffic signals [47]. There are various types of GNNs including Graph Convolutional Networks (GCNs), Graph Recurrent Networks (GRNs) and Graph Attention Networks (GATs) [48].

Graph-based approaches in computer vision, have proven to be highly effective in various autonomous driving applications including among others, behavior recognition [48], traffic prediction [49], scene input representation [50], scene understanding [51], motion prediction [47] and motion planning [52]. For instance, Shi and Rajkumar introduced a GNN-based framework called Point-GNN, designed for object detection using LiDAR point cloud data [53]. Additionally, in regard to the task of trajectory prediction, Sheng et al. introduced a Graph-based Spatio-Temporal Convolutional Network (GSTCN) that predicts the future trajectory distributions of surrounding vehicles based on their past movements [54]. Last but not least, Klimke et al. employed GNNs to develop a cooperative motion planning approach for multiple vehicles navigating urban intersections [55].

4. Perception and Scene Understanding

Recent advances in deep learning have significantly enhanced the performance of various perception tasks, such as object detection, semantic segmentation, and depth estimation, which are essential for enabling autonomous vehicles to perceive and comprehend their surroundings and make safe and informed decisions [12,56]. On this basis, this section provides a technical overview of deep learning methods for perception tasks in autonomous driving. It also explores the state of the art as well as deep learning fundamental concepts and architectures in autonomous vehicle perception and discusses novel deep learning approaches, critical in autonomous vehicles.

4.1. Object Detection and Classification

Object detection is a crucial subtask within the domain of computer vision, often closely intertwined with the task of object classification. However, object classification involves identifying the different object classes present in an image, while object detection goes a step further by determining the precise locations of those objects with the use of bounding boxes [57].

In general, two-stage object detection models demonstrate higher accuracy compared to one-stage detectors. Furthermore, two-stage approaches excel at identifying small vehicles due to their effective candidate region proposal mechanisms. As a result, many detection algorithms in autonomous driving leverage two-stage network architectures. However, the real-time performance which is a prerequisite in autonomous driving systems necessitates the future predominance of one-stage detection methods. As far as the comparison between anchor-based and anchor-free architectures is concerned, the latter exhibit greater computational efficiency compared to the former [57,58].

4.2. Semantic Segmentation

Semantic segmentation techniques assign a label or category to each individual pixel within an image. In contrast to target-level object detection methods, semantic segmentation-based approaches appear to exhibit greater accuracy and precision, as they recognize clusters of pixels that correspond to distinct categories. Furthermore, these methods demonstrate a proficiency in representing the location and contour of objects, which is highly crucial for perceiving the environment in autonomous vehicle applications [58]. Semantic segmentation techniques can be broadly classified into two distinct categories, encoder–decoder structure models and modified convolution structure models, each exhibiting its own characteristics and functions [59].

In recent years, the Transformer-based architecture has been increasingly utilized as a powerful feature extractor for semantic vehicle detection [57]. For instance, the SETR model leverages the ViT as its backbone while integrating multiple CNN decoders to enlarge feature resolution [60]. Likewise, SegFormer introduces a novel hierarchical structured Transformer block to capture multiscale features and then employs MLPs (Multi-Layer Perceptrons) to efficiently ensemble the features from different layers for decoding [61]. Furthermore, the SeaFormer architecture uses a squeeze axial and detail-enhanced attention module to achieve an optimal balance between segmentation accuracy and latency, particularly for deployment on ARM-based mobile platforms [62].

Semantic segmentation-based vehicle detection methods often demand high computational resources, leading to slower inference speeds compared to other vehicle detection approaches. Consequently, the design and deployment of computationally efficient models that require a balance between speed and accuracy, is a critical consideration for future development. Recent research has focused on lightweight vehicle semantic segmentation architectures, such as ESPNet, which employs optimized convolutional modules and is 22 times faster and 180 times smaller than existing state-of-the-art models. Similarly, DFANet starts with a lightweight backbone and progressively consolidates discriminative features through a cascade of sub-networks and sub-stages, while LEDNet utilizes an asymmetric encoder-decoder structure for real-time semantic segmentation. So, it becomes apparent that the development of lightweight, high-performance models will be a key priority for researchers in the field of autonomous driving [57,58].

4.3. Instance Segmentation

While semantic segmentation assigns class labels to each pixel in an image, instance segmentation extends this capability by distinguishing individual object instances. This distinction is particularly important in autonomous driving, where recognizing not only the type of objects but also their spatial relationships is essential for safe navigation. Notably, instance segmentation is valuable for assessing the motion state of dynamic obstacles, such as pedestrians and vehicles. Instance segmentation approaches are generally categorized into two main types: region proposal-based methods and masking-based methods [59].

In conclusion, region proposal-based methods generally achieve higher accuracy compared to masking-based approaches. However, significant challenges persist in instance segmentation, particularly in accurately detecting small objects and developing efficient end-to-end models along with effective training strategies [59].

4.4. 3D Perception and Depth Estimation

To address the limitations of 2D image-based perception, researchers have investigated approaches for 3D scene reconstruction and depth estimation by utilizing sensor data from LiDAR, radar, and stereo cameras. These modalities enable more accurate and reliable environmental representations. In this context, 3D object detection enhances an autonomous vehicle’s ability to perceive and recognize surrounding objects with greater precision, as data from cameras, LiDAR and radar offer valuable depth cues [63]. In other words, 3D object detectors utilize data from multiple sensors to generate three-dimensional bounding boxes, which include spatial coordinates (x, y, z), object dimensions (height, width, length) and yaw information [15].

According to Wang et al., 3D object detection methods can be broadly categorized into three main groups depending on the type of sensor data they utilize: camera-based approaches, LiDAR-based techniques, and multi-sensor fusion strategies [64]. Mao et al. recognize four more distinctive classifications, namely transformer-based 3D Object Detection, temporal 3D Object Detection, label-efficient 3D Object Detection and 3D Object Detection in Driving Systems [65]. In this section, camera-based, LiDAR-based as well as transformer-based approaches will be explored.

4.4.1. Camera-Based Approaches

Camera-based methods can be further categorized into monocular, stereo-based, and multi-view 3D object detection. These approaches estimate depth and detect 3D objects with the use of monocular or stereo images. However, this task remains an ill-posed inverse problem, as identical objects in varying 3D poses can produce significantly different visual appearances in the image plane, thereby complicating the learning of reliable representations [66].

The next main class of camera-based 3D object detection is stereo-based methods. Stereo-based 3D object detection leverages a pair of images captured by two spatially separated cameras to mimic human binocular vision. By analyzing the disparities between corresponding points in the stereo image pairs, algorithms can compute accurate depth information that enables the detection of 3D objects [67]. Compared to monocular images, stereo image pairs offer additional geometric constraints that can be leveraged to infer depth information more accurately. Consequently, stereo-based methods generally demonstrate superior detection performance when compared to monocular-based approaches [65]. On the other hand, the use of stereo cameras in real-world applications is often hindered by the requirement for accurate calibration, which can be challenging to achieve in practice [65].

Although stereo camera-based methods have demonstrated promising outcomes, they can be susceptible to challenges presented by low-texture areas, occlusions and depth discontinuities, which may result in inaccurate depth estimations and consequently introduce errors in 3D object detection. These methods can struggle to maintain reliable performance in the face of such complex real-world scenarios [15]. In addition, compared to monocular-based approaches, stereo-based methods leverage the depth and disparity information derived from stereo image matching, which significantly enhances their overall detection performance. However, this advantage comes at the cost of additional computational requirements from the auxiliary stereo matching network. Finally, although stereo-based 3D detection offers a more affordable solution for autonomous driving applications compared to LiDAR-based methods, a non-negligible performance gap still exists between the two approaches [65].

The last category of camera-based 3D object detection methods, namely multi-view fusion, utilizes multiple RGB images captured from different viewpoints to jointly interpret the 3D scene. By integrating information from various perspectives, these approaches aim to address the limitations of monocular and stereo techniques, enabling more precise 3D object localization and detection [15]. A state-of-the-art paradigm of multi-view 3D object detection is PointPainting, which enhances LiDAR data by incorporating semantic information from RGB images effectively improving accuracy [68].

These methodologies, while promising, often necessitate substantial computational resources and intricate calibration procedures, which may restrict their applicability in real-time autonomous navigation [51].

4.4.2. LiDAR-Based Approaches

LiDAR sensors, which are an indispensable component of 3D detection tasks in autonomous driving, emit laser pulses and measure the time taken for them to reflect off surrounding objects, accurately calculating distances and generating precise 3D point cloud data. This detailed environmental mapping is crucial for advanced driver assistance systems. However, unlike cameras or radar, LiDAR operates reliably across different lighting conditions, as it does not depend on external light sources. As a result, this technology presents higher detection accuracy in comparison to camera-based methods [64].

Despite their advantages, LiDAR systems also face a number of limitations, the primary being point cloud sparsity, since as distance increases, laser pulse energy weakens, reducing the density and accuracy of point cloud data. In the same manner, the divergence of laser beams causes wider spatial distribution of points at greater distances, resulting again in cloud data sparsity. Furthermore, LiDAR’s insensitivity to color and texture restricts its effectiveness in detailed scene understanding and object recognition [64]. LiDAR-based approaches employ a variety of techniques including point-based, grid-based, point-voxel based and range-based methods to detect and localize objects within the environment [65,69].

Voxel-based methods transform irregular and sparse point cloud data into a structured volumetric representation known as voxels. This process, called voxelization, divides the 3D space into a grid of equally sized volumetric cells, where each point is assigned to a corresponding voxel based on its spatial coordinates. Organizing the point cloud into a 3D grid structure, where each voxel represents a small volume within the 3D space, enables the application of traditional CNNs, which are well-suited for processing regular, grid-like data. Through 3D convolutions, voxel-based methods effectively capture spatial relationships and local features within the voxelized representation. By processing voxelized data with CNNs, voxel-based methods can effectively exploit the spatially local correlations that CNNs are inherently designed to capture [69].

Voxel-based methods have been widely adopted for 3D object detection and have demonstrated promising performance [69]. Their primary advantages include reducing the dimensionality of point cloud data and improving the handling of occlusions, as voxelized representations capture spatial relationships between objects more effectively. Despite their advantages, voxel-based methods also present certain limitations. In particular, these methods can be computationally expensive and memory-intensive, given the large voxel grids and the exponential growth of computation with the increase in empty voxels. As a result, careful consideration of voxel size is crucial for achieving accurate detection, since smaller voxels offer higher resolution but significantly increase memory demands [70]. Moreover, the discretization process inherent to voxelization can result in information loss, potentially reducing detection accuracy. Nevertheless, voxel-based approaches have proven highly effective in capturing spatial structures and detecting objects within point cloud data, maintaining their importance in the field of 3D object detection [69].

In the case of the third category of LiDAR-based approaches, namely Point-voxel-based methods, a hybrid architecture that combines both point and voxel representations for 3D object detection is utilized [65]. These methods are designed to represent raw point clouds in a multi-feature format, offering a balance between capturing the detailed information of point clouds and the computational efficiency provided by voxel-based representations. This hybrid approach enhances the overall performance of 3D object detection by leveraging the strengths of both representations [64,70].

To conclude, by integrating point-based local feature learning with voxel-based global feature extraction, point-voxel methods strive to balance the capture of fine-grained details with the modeling of broader spatial relationships within a 3D scene. This hybrid strategy combines the efficiency and structured nature of voxel-based approaches with the flexibility of point-based methods for handling local variations [69]. Overall, the point-voxel based detection methods demonstrate superior detection accuracy compared to their pure voxel-based counterparts, albeit at the expense of increased inference time [65].

In range-based methods, the core component is the range image, a dense and compact 2D representation where each pixel encodes 3D distance information rather than RGB values. These methods tackle the detection task through two main strategies: developing models and operators specifically designed for range images and selecting appropriate viewpoints for detection. Due to the inherent characteristics of range images, conventional or specialized 2D convolutions can be efficiently applied. However, range view detection remains susceptible to occlusions and scale variations. As a result, combining feature extraction from the range view with object detection from the bird’s-eye view has become the most practical approach for range-based 3D object detection [65].

4.4.3. Transformer-Based Approaches

Transformer architectures have been widely incorporated in every category of 3D object detection [65]. For instance, Pointfomer has been proposed for point-based 3D object detection. In specific, it introduces a Local Transformer module to capture interactions within local regions, a Global Transformer to model broader scene-level context and a Local-Global Transformer to bridge the two previous scales, thereby improving the quality of object proposal generation and overall detection accuracy [71]. In the field of occupancy prediction, TPVFormer which is used for 3D segmentation in autonomous driving, is a representative paradigm. It reduces the computational load by converting volumes into BEV planes, while still maintaining high accuracy in semantic occupancy predictions [72].

In voxel-based 3D detection, SWFormer introduces a Sparse Window Transformer framework that converts 3D points into sparse voxels and groups them into variable-length windows, which are then processed efficiently using a bucketing mechanism. In order to enhance detection from sparse inputs, a novel voxel diffusion technique is applied, allowing the model to generate more accurate object representations [73]. In the case of point-voxel based 3D object detection, CT3D is a two-stage 3D object detection framework that combines a high-quality region proposal network with a Channel-wise Transformer architecture. It focuses on enhancing point feature representation within each proposal through proposal-aware embedding and channel-wise context aggregation to improve detection accuracy with minimal manual design [74].

Apart from LiDAR-based 3D detection, transformer-based approaches have been also adopted both in monocular and multi-view 3D object detection. For instance, MonoDTR is an end-to-end transformer-based framework designed for monocular 3D object detection featuring the Depth-Aware Feature Enhancement (DFE) module and the Depth-Aware Transformer (DTR) module. MonoDTR incorporates depth information directly into the transformer’s attention mechanism, thus improving detection accuracy [75]. Finally, DETR3D is a transformer-based framework for multi-view 3D object detection, which unlike traditional methods that rely on monocular depth estimation, directly performs predictions in 3D space [76].

5. Multimodal Sensor Fusion in Autonomous Vehicles

The role of multimodal sensor fusion is critical in enhancing the perception capabilities of autonomous vehicles. In this regard, this section begins with an overview of the multimodal fusion approach and the key sensors used in autonomous driving, followed by a discussion on fusion methodologies.

5.1. Overview

In real-world autonomous driving, relying on a single sensor modality for 3D object detection is insufficient due to the inherent limitations of each sensor type. For instance, camera-based systems offer rich visual information, in regard to shape and texture properties, but lack accurate depth information and are highly sensitive to lighting and weather changes. Moreover, the computational cost grows as camera resolution increases [56,77]. On the other hand, LiDAR sensors provide precise spatial measurements and superior 3D geometry, but they suffer from reduced resolution at long distances and produce sparse point clouds, especially when detecting small objects. In addition, although single-sensor approaches are relatively easy to implement and have shown competitive results in benchmark datasets, they typically lack the robustness required for real-world deployment. LiDAR systems, while often achieving higher detection accuracy than cameras, are limited by high deployment costs. As a result, given the complexity and variability of driving environments, single-sensor systems are often inadequate for ensuring reliable performance across diverse real-world conditions [78].

To address this issue, researchers have focused on sensor fusion techniques that integrate data from multiple sensing modalities, such as cameras, LiDAR and radar. By leveraging the complementary strengths of each sensor type, fusion-based approaches significantly extend the perception range, improve detection accuracy, and enhance system robustness. As a result, multi-sensor fusion plays a critical role in enabling the safe and effective operation of fully autonomous driving systems [79].

5.2. Sensors in Autonomous Vehicles

The modern autonomous vehicle perceives the world through a suite of sensors, including cameras, radar and LiDAR, generating a deluge of raw data that must be processed and interpreted in real-time (Figure 8) [80].

Object detection systems typically rely on multiple sensors to compensate for the limitations of individual sensing technologies. These sensors can be broadly categorized into two types: passive sensors, such as monocular and stereo cameras, which capture environmental data without emitting signals and active sensors, such as LiDAR, radar, and ultrasonic devices, which actively emit signals to detect and interpret their surroundings [81]. The following paragraphs focus on LiDAR and radar, which are the two primary non-visual sensing modalities considered in the present multimodal fusion discussion.

LiDAR: LiDAR uses laser pulses to detect objects by measuring the time it takes for the laser beam to reflect back from surfaces. This allows for highly accurate distance measurements and the construction of detailed 3D representations. Unlike cameras, LiDAR is not affected by lighting conditions, making it a reliable sensor for perception tasks in autonomous driving. Despite its advantages, such as high ranging accuracy and insensitivity to lighting changes, LiDAR also presents notable limitations, including high manufacturing costs, limited adaptability in adverse weather conditions and an inability to capture color and texture information. However, when integrated with additional sensors, LiDAR can fully leverage its strengths and improve system adaptability across diverse scenarios, solidifying its role as a critical component in multi-sensor fusion [79].

Radar: Radar systems operate by emitting electromagnetic waves to detect the range, velocity and direction of multiple objects in the environment. Most automotive radar units employ the linear Frequency Modulated Continuous Wave (FMCW) technique to enable simultaneous range and velocity estimation. Compared to LiDAR, radar demonstrates greater resilience to varying lighting and adverse weather conditions. However, radar sensors typically suffer from low elevation resolution, which can hinder their ability to accurately classify objects. Additionally, radar systems may be susceptible to signal interference from nearby radars or communication devices, potentially affecting their reliability [81].

5.3. Multimodal Sensor Fusion Design Methodologies

When developing a deep neural network for multimodal perception, three fundamental questions must be considered: “what to fuse”, “how to fuse” and “when to fuse”. The “what to fuse” question is related to the selection of sensing modalities, such as camera, LiDAR or radar and how to appropriately represent and process them. The “how to fuse” question involves determining the most effective fusion strategies or operations to integrate the information. Lastly, the “when to fuse” question addresses the stage within the neural network at which multimodal features should be combined [13].

The first case of “what to fuse” design methodology addresses the form or representation in which multimodal data are introduced into the fusion module. This design choice is highly diverse and reflects the distinct strategies of each fusion architecture. In LiDAR-camera fusion, for example, the fusion input may consist of raw sensor data, intermediate feature representations or even output-level results from the respective image or point cloud processing branches. More specifically, LiDAR data can be represented as raw point clouds, voxel grids or projections onto Bird’s Eye View (BEV). Similarly, camera data may be provided as image feature maps, segmentation masks or pseudo-LiDAR point clouds derived from depth estimation techniques [56].

The “how to fuse” design methodology focuses on determining the level of granularity in which data from different modalities are integrated. There are usually three fusion strategies: Region of Interest (RoI)-level, voxel-level and point-level fusion, with point-level representing the highest granularity. In general, the choice of fusion granularity plays a critical role in the complexity and effectiveness of the fusion framework. While finer-grained fusion generally demands greater computational resources, it often results in enhanced performance [56].

Finally, the “when to fuse” design methodology refers to the stage at which multimodal data fusion is performed within the processing pipeline and can be classified as early (or data-level), middle (or feature-level) and late (or decision-level) fusion [79,82]. There is also a fourth category identified, named deep (feature) fusion [81,83]. The categories of “when to fuse” design methodology are presented in Figure 9.

Early fusion, also referred to as data-level fusion, typically takes place at the input stage of each sensor branch. In this case, raw data from different modalities are mapped to the same space using data alignment and translation techniques. The fused data thus obtained is richer and more expressive [83]. Early fusion has the advantage of understanding how diverse types of data work together and complement each other in early stages, for instance, matching the labels and visual patterns from a camera with 3D shapes produced by LiDAR. However, this heterogeneous flow of raw information is the core limitation of this fusion strategy since by simply mashing these data together produces confusion or redundancy. Additionally, early fusion creates a huge amount of data to process, meaning that much larger models with a massive amount of training data are needed [84]. So, while early fusion enables the rapid and effective establishment of relationships between sensors, it generally entails higher computational demands [83].

Middle fusion, also known as feature-level fusion, involves combining data from multiple heterogeneous sensors after the initial feature extraction stage. In this approach, features such as edges, corners, shapes and motion patterns are first extracted from raw sensor data and then integrated into a single feature set. In this way, the system can leverage the different perspectives of the same target captured by different sensors, thereby improving object detection and recognition performance [82]. As a result, more detailed feature representations can be achieved, making this the most popular multimodal fusion strategy in the field of autonomous driving [85]. One major downside however, is the possible omission of subtle details caused by the initial separate feature extraction. For this reason, cross-attention approaches are often used as a solution to this problem [84]. Finally, although feature-level fusion offers high flexibility, identifying the optimal way to fuse middle layers within a specific network architecture, remains a significant challenge [13].

Deep fusion is performed during the feature extraction stage, where multimodal data are integrated within the feature space to produce a set of fused features. This approach allows the system to compensate for missing features in one modality by leveraging complementary information from others. The resulting fusion features are then used for classification or regression tasks during the prediction phase. In deep fusion, the level of granularity is slightly coarser compared to early fusion, which helps reduce equipment performance demands. However, a common drawback of this method is the issue of dimensionality explosion, in which feature dimensions become excessively large, thus after a particular level, model performance may degrade and the risk of information loss increases [83].

In late fusion, also known as decision-level fusion, each sensor operates independently, performing data collection, feature extraction and environmental perception. The final output is derived by combining or selecting among these independently generated outputs. This method allows each sensor to function autonomously, reducing dependency on specific data types and facilitating the integration of heterogeneous sensor types, while also enhancing the anti-interference capabilities of the system. However, an important limitation of this method is that it does not fully exploit all the raw data provided by each sensor, resulting in significant information loss. Furthermore, independent decision outputs may introduce redundancy or inconsistencies, requiring the development of sophisticated fusion algorithms. This complexity can negatively impact both system efficiency and maintainability [79].

In reality, many autonomous driving systems use a hybrid fusion strategy [84]. An example of such system is HydraFusion which adjusts the fusion strategy between early fusion and late fusion, as well as combinations in-between, depending on the driving conditions (e.g., weather and lighting conditions, sensor obstruction), increasing robustness and efficiency in a diverse set of environments [86]. Finally, since each method has its pros and cons, deciding the best “when to fuse” strategy depends on the sensors that are used, the available computational power and the specific driving goal [84].

Beyond traditional fusion methods, there are advanced mechanisms which handle the interaction of multiple modalities. These include cross-attention fusion which allows for dynamic weighting of sensor features [87] and BEV-space unification which provides a common geometric plane for projection [84]. Furthermore, uncertainty-aware fusion helps the system prioritize the most reliable sensor in adverse condition [88], while asynchronous fusion techniques address the temporal misalignment between sensors [89]. Finaly, in Table 5 the main trade-offs of each fusion strategy are presented.

5.4. Applications of Multimodal Sensor Fusion

Multimodal sensor fusion plays a vital role in a wide range of applications, in the context of autonomous driving. Specifically, object detection and tracking, localization and mapping and scene segmentation are three major areas in which multimodal sensor fusion creates huge impact [82]. This section focuses on the areas of object detection and tracking and scene segmentation.

5.4.1. Object Detection and Tracking

In Object Detection and Tracking, a fusion approach combining millimeter-wave (MMW) radar and camera vision has been proposed for pedestrian tracking. In this method, radar-based tracking was performed using an Unscented Kalman Filter (UKF), while pedestrian detection and localization from camera data were achieved through the YOLOv5 algorithm in conjunction with DeepSORT. The error covariance from each sensing module was applied to perform fusion-based tracking [90].

Furthermore, a radar-camera fusion approach for multi-target detection and tracking in intelligent transportation systems has been presented. Unlike traditional techniques that rely predominantly on one sensor, this method uses mutual referencing between MMW radar and camera data to deduce object positions. In specific, this approach employs a position inference algorithm combined with an improved Extended Kalman Filter (EKF) to fuse data from radar and camera sensors for the detection and tracking of pedestrians and vehicles. Finally, the visual sensor, with its superior azimuth accuracy and detection rate, enhances lateral positioning and reduces missed detections [91].

Moreover, another state-of-the-art method introduces a novel deep learning approach for multi-object tracking (MOT) that fuses data from MMW radar and camera sensors to enhance the accuracy and robustness of perception in autonomous driving. The method utilizes a bidirectional long short-term memory (Bi-LSTM) network to capture long-term temporal dependencies and improve motion prediction. Additionally, an appearance feature model inspired by FaceNet is employed to maintain consistent object tracking across frames by effectively associating objects [92].

In the same manner, CAMO-MOT is a 3D MOT framework that fuses camera and LiDAR data to improve tracking performance in autonomous driving. It addresses two key challenges: occlusion and false detection. Finally, CAMO-MOT introduces a multi-category cost mechanism to implement MOT in multi-category scenes [93]. DeepFusion is another method that enhances 3D object detection in autonomous driving by effectively integrating LiDAR and camera data. Unlike traditional approaches that attach image features to raw LiDAR points, Deep-Fusion fuses image features with deep LiDAR features for improved performance. However, this method is faced with the challenge of aligning features from different modalities. In order to address this issue two key techniques are used: InverseAug, which reverses geometric augmentations and LearnableAlign, which employs cross-attention to dynamically match LiDAR and image features. These innovations enable more precise sensor fusion, resulting in a family of multimodal detection models that outperform earlier methods in accuracy and robustness [94].

5.4.2. Scene Segmentation

Scene segmentation based on multi-sensor fusion aims to accurately classify various elements of the environment by integrating data from multiple sensors. This approach enables precise segmentation and categorization of different regions, thereby enhancing the environmental awareness of autonomous vehicles and improving object detection and localization capabilities. In general, scenes can be categorized into structured environments, such as urban roads and parking lots, and unstructured environments, including rural areas or wilderness. Unstructured scenes are more complex due to the absence of clear lane markings and the influence of irregular lighting conditions, while on the other hand, structured scenes are easier to manage. Most existing large-scale open-source datasets primarily focus on structured environments, with only limited representation of unstructured scenes [82].

In this regard, an entropy-based adaptive multimodal fusion method aimed at addressing the challenges of night-time lane segmentation has been introduced. The proposed approach leverages attention mechanisms to model spatial relationships between modalities and illumination distribution, enabling adaptive fusion. As a result, the proposed lane feature enhancement module is used to strengthen both global and local lane features, thereby improving the network’s capability to detect lanes under low-light conditions [95].

In addition, taking into account that existing methods are often not directly applicable to scenes with varying structural characteristics, recent research has increasingly focused on segmentation in unstructured environments to support more robust autonomous driving systems [82]. In this context, the M2F2-Net has been introduced, which is an effective multimodal network designed for free space detection in unstructured off-road environments. The network incorporates a Multimodal Cross Fusion (MCF) module that integrates features from RGB images and surface normal maps, the latter being derived from LiDAR point clouds. This fusion strategy enhances the network’s ability to perceive and understand complex terrain structures commonly found in unstructured scenes [96].

Furthermore, a fusion approach that combines RGB image and LiDAR data for improved drivable area classification has been proposed. Initially, a Grouped Attention Network (GA-Nav) is employed to classify between drivable and obstacle regions in RGB images, while at the same time the Patchwork++ algorithm segments the LiDAR point cloud into ground and non-ground areas. Lastly, a late fusion strategy is introduced to fuse the outputs of both modalities. This fusion process enhances classification accuracy by compensating for the limitations of the camera-based system. For instance, correcting instances where vegetation such as bushes is mistakenly identified as drivable, are reclassified as an obstacle using LiDAR data [97].

Last but not least, a multi-sensor fusion network that incorporates surface normals (SN) to enhance segmentation in unstructured scenes has been introduced. The proposed approach effectively integrates 3D geometric data from LiDAR with high-resolution color and texture features from RGB images. Additionally, by fusing point cloud representations, derived from point and range views, with multi-scale reweighted RGB images, the network improves scene understanding. Finally, surface normal features extracted from the LiDAR data are used to reweight the RGB inputs, thereby mitigating the influence of unreliable or inaccurate visual information, particularly under low-light environments [98].

Finally, within the category of Scene Segmentation, recent models like OccFusion demonstrate the transition from traditional semantic masks to 3D Semantic Occupancy. Occfusion greatly increases the accuracy and robustness of occupancy prediction, leading in high performance on the nuScenes benchmark dataset [99].

5.4.3. Transformer-Based Multimodal Fusion 3D Object Detection

Due to the ability of the transformers to model long-range dependencies and adapt to various datasets and tasks, vision transformers have gained significant attention in multimodal fusion for 3D object detection. Several approaches integrate CNNs for feature extraction, while leveraging transformers in the intermediate fusion process and detection heads [81]. For instance, TokenFusion is a multimodal fusion method designed for transformer-based vision tasks. This approach enables the transformer to effectively learn cross-modal correlations while preserving the single-modal transformer architecture, achieving state-of-the-art performance in segmentation, translation and 3D object detection tasks [100].

In addition, TransFusion is a transformer-based LiDAR-camera fusion method designed to enhance 3D object detection, particularly under poor image conditions. It employs convolutional backbones and a two-layer transformer decoder. TransFusion achieves state-of-the-art results on large-scale datasets, demonstrating both robustness and accuracy [101]. Finally, UVTR (Unifying Voxel-based Representation with Transformer) introduces a unified framework for multi-modality 3D object detection that integrates various sensor inputs in a voxel feature space. Unlike previous methods, it preserves the voxel space without height compression, reducing semantic ambiguity and maintaining spatial relationships. This approach leverages cross-modality interaction, which includes knowledge transfer and modality fusion to enhance performance. By integrating geometry-aware point cloud data and context-rich image features, UVTR improves performance and robustness [102].

Alternatively, some methods adopt a fully end-to-end design using transformer-based architectures for multimodal and multi-task fusion [81]. For example, BEVFusion is an end-to-end efficient multi-sensor fusion framework that unifies features from different modalities into a shared BEV representation. This space retains both geometric and semantic information. By optimizing BEV pooling, the method significantly reduces processing latency and supports various 3D perception tasks without requiring major architectural changes [103]. MotionTrack is an end-to-end transformer-based algorithm designed for multi-object tracking (MOT) using multimodal sensor inputs in autonomous driving. It integrates a transformer-based data association module and a transformer-based query enhancement module to perform MOT and multiple object detection (MOD) simultaneously, establishing a robust transformer baseline for tracking objects of various classes [104].

Finally, Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection that effectively fuses point clouds and image data. By projecting points to image patches, BrT enhances cross-modal interaction and achieves superior performance compared to existing state-of-the-art methods [105].

It is more than apparent, that the field is now shifting towards more specialized domains such as BEV learning and occupancy prediction as shown by the paradigms analyzed previously, as well as cross-modal attention [106], temporal fusion [107], and online calibration [108].

6. Challenges and Limitations in Perception for Autonomous Driving

Despite significant advancements in perception technologies for autonomous vehicles, several challenges and limitations continue to hinder the large-scale deployment of object detection systems. Consequently, the following sections outline both perception-specific and fusion related challenges currently limiting progress in the field. Finally, open problems and benchmarking gaps are analyzed.

6.1. Perception-Specific Challenges

Dynamic environments: Autonomous vehicles operate in dynamic and unpredictable environments, often surrounded by moving vehicles, pedestrians, cyclists, and other obstacles. Accurately detecting and predicting the movement of such entities in real time, while essential, remains also a primary failure mode, especially in complex urban or highway settings where interactions can change rapidly. Moreover, occlusions that usually are present in such environments, lead to missed detections or incorrect orientation estimates. Therefore, these dynamic conditions pose a significant challenge for real-time object detection and decision-making systems [67].

Limited computational resources: While deep learning models have greatly enhanced the accuracy of 3D object detection, their high demand on computational resources limits their applicability on real-time systems [67]. Additionally, training these models requires large, annotated datasets, which can be resource-intensive and time-consuming to acquire [15]. Furthermore, in real-world scenarios, inference speed is a critical factor in the performance of object detection systems. However, many models prioritize accuracy, often at the expense of efficiency, which leads to increased inference overhead. For instance, even minor delays as short as 400 ms can lead to safety-critical errors. Given that computational resources are often limited, achieving the same inference speed as in controlled environments is challenging. Consequently, inference speed and computational efficiency are closely interdependent, each significantly influencing the other [109].

Adverse weather and varying sensing conditions: Autonomous systems rely on sensors such as LiDAR and cameras to perceive their surroundings. However, adverse weather conditions such as heavy rain, fog, and snow can degrade sensor performance and reduce detection accuracy [110]. For instance, dense fog or heavy rain can block LiDAR detection almost entirely or create “fake obstacles” due to laser backscattering, while cameras are highly vulnerable to blockage and distortion just from a single water drop on the lens. In the case snow, situations where it is whirled from the ground or falling in the atmosphere cause voids in point clouds, shortening view distances, while snow accumulation can completely obscure road markings and lane lines. In the same manner, extreme temperatures can cause a time delay of LiDAR measurement up to 6.8 ns widening the LiDAR ranging higher than 1 m and lowering the precision at near field.

Additionally, Direct strong light or glares from skyscrapers and other vehicles can blind normal visible cameras leading even to 0 visibility and also cause LiDAR signal loss, appearing as large black areas around the light source in the point cloud. Finally, road dirt on sensor emitter windows and smog can reduce LiDAR’s maximum range by as much as 75% [111]. Such degradation may result in low performance of point-based and camera-based models leading to incorrect decisions, compromising driving safety. Consequently, addressing these limitations requires advances in sensor hardware and more robust vision algorithms that maintain performance across diverse weather conditions [110].

Insufficient training sets and unbalanced datasets: Deep learning models are often sensitive to changes in the data domain. For instance, a model trained in one country may not generalize well in another due to differences in road infrastructure, traffic signs or vehicle behavior. Domain adaptation is therefore critical for ensuring robustness across different environments. Furthermore, many datasets used in training are imbalanced, with some object classes significantly underrepresented. This bias can cause models to favor well-represented classes while neglecting rarer ones. Addressing this issue involves augmenting datasets with underrepresented classes, introducing loss functions or generating synthetic data from simulators and training with real data to balance training distributions [112].

6.2. Fusion-Related Challenges

Dataset Quality: A crucial challenge in multimodal detection is the limited availability of high-quality datasets. Common issues of existing datasets include small dataset sizes, imbalanced class distributions and labeling errors. In order to increase data volume and diversity, some datasets incorporate synthetic data. However, this introduces a domain gap between synthetic and real-world data, which can affect model performance. While techniques such as generative adversarial networks (GANs) have been proposed to address this gap, further research is needed. Additionally, imbalanced data may cause the model to rely solely on a particular modality, reducing the overall effectiveness of multimodal fusion. Therefore, ensuring balanced category representation is crucial [83]. Furthermore, there is a distinct lack of data regarding occluded, truncated, or small objects at a distance, which prevents models from accurately reflecting complex real-world driving scenarios. Even large-scale datasets like nuScenes, with 1000 scenes, are insufficient to capture the full variability of the real world, leading to a generalization gap when models are deployed in unobserved environments [78].

Data Noise: Effectively fusing multimodal information remains one of the main challenges in multimodal learning. The presence of multiple sensors introduces an information gap among the data provided from different modalities, leading to a lack of synchronization between the information streams. This issue introduces noise during feature fusion, which can negatively impact the quality of learned representations. For example, the use of two-stage detectors can lead to the incorporation of background features from the image during fusion, often due to the misalignment or differing dimensions of the Regions of Interest (ROI) relative to the 3D data. Recent studies have explored the use of BEV’s representations to align heterogeneous sensor data, offering a promising approach to mitigate these issues and providing a valuable direction for future research [113].

Unstructured data formats: Sensor data lacks a consistent, organized structure, making it harder to process, interpret, or combine across different sources. Consequently, processing high volumes of real-time data from multiple sensors poses significant challenges. For instance, unlike image data, LiDAR data is unstructured, unordered and sparse, making them more difficult to process efficiently. As a result, it is crucial for the safety and effectiveness of autonomous vehicles that their computing systems can rapidly process and respond to this data in order to make timely driving decisions [15,67]. Additionally, the limited reception field of many open-source datasets hinders performance, suggesting that sensors with a 360-degree range, such as those in the Waymo and nuScenes datasets, are essential for improving coverage in complex, unstructured environments [113].

Temporal synchronization: Temporal synchronization is a critical challenge in multimodal 3D object detection. Differences in the sensor sampling rates, operating modes and acquisition speeds can cause time discrepancies, leading to misaligned sensor data. This misalignment negatively impacts the accuracy and efficiency of detection. This issue arises either from inaccuracies in sensor timestamps or due to delays or frame loss of the sensor data. Even when hardware is used for temporal synchronization, fully guaranteeing the consistency of timestamps is difficult and often requires expensive equipment. Consequently, software-based synchronization methods are frequently employed, such as timestamp interpolation, Kalman filter-based algorithms, or deep learning-based approaches. To solve this problem, cache mechanisms can be used to manage delayed or missing data, while interpolation or extrapolation techniques can help fill in the vacant part of the data. In general, temporal synchronization in multimodal 3D object detection is a complex issue that requires combination strategies to effectively resolve [113].

High Computation Complexity: Another key challenge in multimodal 3D object detection for autonomous driving is achieving fast and real-time performance. Processing data from multiple sensors increases the number of model parameters and computational requirements, resulting in longer training and inference times. This complexity often prevents such systems from meeting real-time demands. To address this, recent state-of-the-art methods like MVP and BEVFusion have begun incorporating Frames Per Second (FPS) as a primary model evaluation metric during experiments on datasets such as nuScenes. From an engineering perspective, these high computational loads often exceed the memory and power limits of embedded edge hardware. To reduce computational load, future research should investigate model pruning and quantization techniques. These methods aim to simplify model architectures and reduce parameter sizes, making real-time deployment more feasible [113].

Data Alignment: Multi-sensor 3D object detection requires accurate alignment and fusion of data from different modalities. A primary challenge arises from the varying positions and viewing angles of sensors. For example, LiDAR is typically mounted on the roof of a vehicle, while cameras are installed at the front. Even when placed close together, their inherent imaging perspectives differ. The common solution involves using a projection matrix to align LiDAR points with image pixels. However, this approach requires precise calibration based on sensor parameters, positions and viewing angles, which is a process that is both time-consuming and prone to error. Additionally, real-world driving conditions, such as uneven roads and vehicle vibrations, can cause slight shifts in sensor positioning, making previously calibrated matrices invalid. One promising alternative is to integrate LiDAR and cameras into a single sensor unit, minimizing alignment errors. Beyond physical misalignment, the fundamental imaging principles of these sensors differ, since cameras capture 2D images using optical principles, while LiDAR directly senses the 3D structure of the environment. These differences, along with mismatched data dimensions, make sensor fusion complex and present ongoing challenges in aligning multimodal data accurately [78].

Information Loss: Preserving multimodal information during fusion is a key challenge in 3D object detection. When data from different sensors are combined, some valuable information may be lost. For instance, during the fusion process, important semantic details from images can be lost when mapped to point cloud features. This issue limits the effective use of image data and can reduce overall model performance [113]. Specifically, a major engineering constraint arises from the dimensionality explosion during integration. For example, mapping 3D point cloud data to 2D Bird’s-Eye View (BEV) for fusion with 2D data increases feature dimensions tremendously, requiring complex reduction steps that often lead to further geometric information loss [83]. To address this problem, state-of-the-art models in multimodal learning offer promising strategies for improving sensor fusion. Therefore, exploring new fusion techniques and neural network architectures that retain as much information as possible from each modality is essential for enhancing detection accuracy and robustness [113].

6.3. Open Problems and Benchmarking Gaps

Safety, reliability and interpretability: Safety and reliability are paramount in autonomous driving systems. Even minor errors in pose or scene estimation can lead to catastrophic failures. Additionally, another major concern with deep learning models is their lack of transparency, often referred to as the “black-box” problem, which exacerbates safety concerns in mission-critical tasks. A solution to this issue could be uncertainty estimation which has emerged as a valuable tool, offering a belief metric that reflects the trust in the predictions of the model. As a result, by identifying predictions with low uncertainty, the system can selectively mitigate or avoid them, thus enhancing overall robustness and operational safety [6,114].

Data sampling: Another persistent issue lies in the challenge of data sampling. As a result, great care must be taken regarding the sampling of data in order to address the long-tailed distribution of rare traffic scenarios effectively. Either naive training on the entire dataset or overemphasizing the rare instances, can both lead to decreased performance. Instead, it is crucial to sample the data in a manner that enables generalization even for rare data points [115,116].

Evaluation: The growing availability of real-world datasets and advanced simulation platforms has accelerated progress in using supervised deep learning for autonomous driving. In particular, models utilizing non-standard cameras or intermediate visual representations have shown promise compared to traditional approaches relying solely on video frames. However, a major obstacle in evaluating model performance lies in the lack of consistency in benchmarks, datasets, and evaluation metrics. This diversity prevents fair comparisons between models and limits the ability to draw conclusive insights about their performance. As a result, establishing standard benchmarks across multiple datasets and simulators would be a valuable step toward addressing this gap [117].

Scalability Although deep learning-based systems have shown promising results on benchmark datasets, they often present limited generalizability. For instance, most models are trained and evaluated in structured environments such as urban streets or highways. However, their performance in less structured areas, such as rural or forested regions, remains largely unexplored. This lack of scalability across different domains and environments restricts their broader deployment in real-world driving conditions [114].

7. Discussion and Future Directions

While substantial progress has been made across various components of the autonomous driving pipeline, ongoing research remains essential to address persistent challenges and further refine system capabilities. This section presents an in-depth comparative analysis and outlines key areas where continuous innovation is needed to ensure safer, more reliable, and more intelligent autonomous systems in the years to come.

7.1. Comparative Analysis

In this section, a dual-level evaluation is performed to highlight the technological shift in the field. Initially, a quantitative meta-analysis is conducted in order to identify trends in the literature. Subsequently, a critical technical synthesis is presented that analyzes the trade-offs and performance evolution of the presented architectures.

7.1.1. Quantitative Meta-Analysis

As shown in Figure 10, a significant technological transition has occurred between two distinct periods, the former spanning from 2020 to 2022 and the latter from 2023 to 2025. Our analysis of primary sources (studies and state of the art publications) showed that the adoption of multi-sensor fusion escalated from 28% to 68%, while Transformer-based architectures maintained a strong presence, increasing from 55% to 69%.

Following a similar trajectory, the analysis of secondary sources (consisting of reviews) showed that the inclusion of transformers skyrocketed from 13% to 63%, while multimodal sensor fusion appearance reached an astonishing 90% in the second period. These data show that attention-driven multimodal perception has been established as an academic and research standard.

7.1.2. Critical Technical Synthesis and Trade-Off Analysis

The landscape of autonomous driving is defined by a variety of deep learning paradigms, each tailored to address efficiently the task of perception. This section presents a comparative analysis among the primary perception approaches in autonomous driving presented throughout this study.

2D perception vs. 3D perception: In 2D perception the primary sensors are cameras, used to identify and locate objects within a 2D image plane. Object detection architectures use 2D bounding boxes, while semantic and instance segmentation approaches recognize distinct categories and instances of objects at pixel level. Although 2D perception models are highly advanced and rely on cost-effective and widely available camera sensors, the lack of direct depth estimation that is inherent to these approaches is a huge drawback which is reflected in the lower mAP values [15,57,58].

In contrast, 3D perception models have the ability to identify and locate objects in three-dimensional space, since they provide depth information. These approaches use a plethora of sensors, including LiDAR, Radar, and depth-inferring camera systems, which, by using 3D bounding boxes, provide significantly higher accuracy and reliability for scene understanding across various environments and conditions. However, these approaches do not come without limitations since they use costly sensors in most cases, which produce data formats that lack a consistent and organized structure and are computationally expensive [15,65,78]. The above comparative analysis is presented in Table 6.

Classic vs. Transformer-based approaches: Classical approaches rely primarily on CNN-based feature extraction and, despite their architectural diversity, represent the more mature family from a deployment perspective. Their main strengths lie in their extensive optimization history, availability of lightweight backbones, and suitability for real-time inference on embedded automotive hardware. For this reason, CNN-based models remain highly relevant in production-oriented perception pipelines, particularly when low latency, computational efficiency, and implementation stability are primary requirements [57,58].

Transformer-based approaches, by contrast, have introduced a major advance in autonomous driving perception by modeling long-range dependencies and global context more effectively, often achieving stronger performance in complex scene understanding and multimodal fusion tasks. However, these methods should be viewed as comparatively less deployment-mature rather than simply superior replacements for CNNs. In many cases, they require larger training datasets, greater computational resources, and more extensive validation under heterogeneous real-world conditions before broad deployment can be considered reliable. Therefore, in the current autonomous driving landscape, CNN-based methods can be regarded as more operationally mature, whereas Transformer-based methods represent a rapidly advancing and highly promising direction, especially for next-generation perception and fusion systems [31,34]. The aforementioned comparative analysis is presented in Table 7.

Single-sensor vs. Multimodal approaches: Single-sensor approaches can achieve competitive results on benchmark datasets and present reduced system complexity, while at the same time, they are relatively easy to implement. However, each sensor type presents specific inherent limitations, meaning that relying only on one sensor modality leads to limited reliability and robustness, and thus in insufficient object detection. For instance, camera based methods, lack accurate depth information while LiDAR sensors suffer from reduced resolution at long distances and produce sparse point clouds. Their overall performance is further reduced by specific failure points such as weather and light, in which specific type of sensors are vulnerable. As a result, single sensor approaches present limited environmental understanding which combined with the aforementioned vulnerabilities render them insufficient for real-world driving.

On the contrary, by integrating data from multiple sensors, autonomous vehicles can achieve a more comprehensive and accurate understanding of their environment. Quantitative analysis confirms that multimodal fusion effectively addresses the inherent weaknesses of individual sensors, showing approximately an increase of 20% mAP over single sensor methods. As a result, multimodal sensor fusion offers several critical advantages over single sensor approaches as presented below:

Sensor Complementarity: Cameras excel at capturing high-resolution visual information and are particularly effective for recognizing features such as lane markings, traffic signs, and object shapes. However, their performance can degrade under adverse weather conditions, including heavy rain, fog or low-light environments. In contrast, radar sensors are highly reliable in such conditions and are particularly effective at estimating the speed and distance of surrounding objects. As a result, by fusing data from two or more modalities, sensor fusion systems can offset the limitations of individual sensors, thereby enhancing perception accuracy and system reliability [120].

Redundancy and Reliability: The inclusion of redundant sensors significantly enhances the reliability and safety of autonomous driving systems. Specifically, in situations where one sensor fails or delivers inaccurate data, other sensors can serve as a means of validation or correction. This redundancy minimizes the likelihood of perception errors and improves the ability of the system to maintain safe operation [120].

Object Detection and Tracking: The integration of diverse sensor data facilitates more precise object detection and tracking within the driving environment. For example, while a camera may identify the presence of a pedestrian, radar can simultaneously measure the pedestrian’s speed and distance. By fusing these complementary data sources, the autonomous system can more accurately anticipate the pedestrian’s movement, thereby enabling safer and more informed decision-making [120].

Filling Sensor Blind Spots: Cameras often exhibit blind spots, particularly around the perimeter of the vehicle, which can limit their ability to provide complete situational awareness. Other sensors such as radar can effectively compensate for these limitations by detecting objects outside the camera’s field of view, such as vehicles in adjacent lanes or approaching from behind [120].

However, the main weakness of multimodal approaches is the complexity of the fusion algorithms, which leads to the need for more sophisticated designs. As a result, this complexity can affect the overall efficiency and maintainability of the system [78,120]. A summary of the aforementioned comparative analysis is presented in Table 8.

7.2. Future Research Directions

The field of autonomous driving is advancing rapidly, driven by continuous innovations in deep learning and computer vision. This section outlines key areas where continued innovation is needed to ensure safer, more reliable, and more intelligent autonomous systems in the years to come. For this reason, we have organized the identified future directions into a strategic roadmap which we estimate reflects the logical progression of autonomous perception, starting from the immediate need for operational stability and reliability, moving toward system scalability and finally aiming for long-term evolution.

7.2.1. Operational Stability and Reliability

Robustness to Sensor, Scene and Weather Variations: Autonomous driving systems must operate reliably across diverse environments, which often differ significantly in visual characteristics, traffic dynamics and sensor inputs. For instance, urban and rural settings present distinct challenges, while adverse weather conditions such as rain, fog or snow can further degrade perception performance, particularly in 3D object detection. Enhancing robustness to such variations is critical for dependable autonomous operation. Therefore, future research should focus on developing adaptive perception models capable of maintaining high performance under varying environmental and operational conditions [15].

Improvement of Computational Efficiency and Lightweight Models: With the growing need for real-time decision-making in autonomous vehicles, enhancing the computational efficiency of 3D object detection systems has become a critical research focus. Future work should prioritize the development of lightweight neural network architectures specifically designed for deployment on embedded systems, aiming to preserve high accuracy while minimizing resource consumption. Advanced techniques such as model compression, pruning, and quantization offer promising pathways to reduce computational demands without compromising performance. Furthermore, optimizing hardware accelerators, including GPUs, TPUs, and FPGAs (Field-Programmable Gate Arrays), will be essential to support the intensive parallel processing requirements of 3D perception tasks in real-time environments [67].

7.2.2. System Scalability

Unsupervised Domain Adaptation: In many real-world scenarios, such as rural environments, acquiring labeled data for new domains can be challenging or infeasible. To address this issue, future research should focus on unsupervised domain adaptation techniques, which allow models to learn from unlabeled data in the target domain by aligning its structural and feature distributions with those of the labeled source domain. This enables improved generalization and performance across diverse driving environments without the need for extensive manual annotation [15].

Scalability: Deep learning-based methods have demonstrated promising performance on established benchmarks. However, their applicability remains largely constrained to specific environments, such as urban or roadway settings. The effectiveness of these techniques in more diverse contexts, such as rural, off-road or forested areas, remains an open research question. Furthermore, existing approaches are often limited to simplified scenarios. As a result, future research should investigate how to scale these methods to handle more complex, large-scale, and realistic environments [114].

7.2.3. Long-Term Evolution

Collaborative Perception: Collaborative perception involves the sharing and integration of information from multiple sensing systems or agents to enhance environmental awareness [109]. The realization of truly autonomous driving depends not only on advances in perception but also on the vehicle’s ability to communicate effectively with its surrounding environment [121]. Moreover, the deployment of 5G networks has helped in this direction, since it enables vehicles to share real-time data and coordinate movements, optimizing traffic flow and minimizing congestion [122]. From this point of view, collaborative perception is closely linked with Vehicle-to-Everything (V2X) communication, which is an umbrella term that includes vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I) and vehicle-to-pedestrian (V2P) communications [123] and enables real-time exchange of sensory and situational data between vehicles and infrastructure [109]. In the context of autonomous driving, collaborative perception typically refers to cooperation between vehicles or between distributed sensors to overcome the limitations of individual perception systems, such as occlusions, limited sensor range, or sparse data. By exchanging perception data, vehicles can achieve a more complete and accurate understanding of their surroundings, improving the detection of obstacles, traffic conditions, and other critical elements. This approach significantly enhances 3D object perception, particularly in complex or dynamic environments. As such, collaborative perception is considered a promising direction for the future advancement of autonomous driving technologies [109].

Improvement of Sensors: Cameras, LiDAR and radar have distinct advantages and drawbacks. LiDAR provides accurate depth but is costly and less effective at close range. Cameras capture rich visual details but lack depth information, while radar is robust in poor weather but suffers from low resolution and signal interference. Future work should focus on improving sensor performance, reducing cost and enhancing multi-sensor fusion techniques to compensate for individual weaknesses [15].

Among all the aforementioned trends and as explained previously, the field is moving rapidly toward unified BEV learning and occupancy prediction, while should also prioritize cross-modal attention, temporal fusion, and online calibration. In the same manner, modern fusion techniques such as cross-attention fusion, BEV-space unification, uncertainty-aware fusion and asynchronous fusion shpuld be the subject of future research.

8. Conclusions

This study has presented a detailed investigation into the integration of deep learning-based AI and computer vision techniques in the field of autonomous driving. By analyzing the task of perception which can be referred to as a core functional component of autonomous vehicle systems, the study has highlighted the transformative impact of advanced deep learning architectures. Beyond traditional approaches, this work explored the growing significance of multimodal sensor fusion, which greatly enhances the capabilities of 3D object detection in autonomous driving.

Rapid advancements in deep learning-based perception and multimodal fusion have undoubtedly brought the dream of full autonomy one step closer to reality. However, this work presents several fundamental challenges that continue to limit the robustness, scalability and overall performance of autonomous driving systems. These challenges include both perception-specific issues, such as dynamic environments, varying weather and sensing conditions, and dataset limitations, as well as fusion-related challenges, which include data noise and misalignment, temporal synchronization, and high computational demands among others. In addition, broader concerns related to safety, interpretability, benchmarking inconsistencies and scalability across diverse environments further hinder real-world deployment.

Furthermore, a comparative analysis of various autonomous driving approaches was conducted to examine and clarify their strengths, limitations and suitability across different deployment scenarios. In particular, three types of comparisons were carried out, namely 2D versus 3D perception, classic versus transformer-based approaches and single-sensor versus multimodal approaches. The findings extracted from this comparative analysis showcase how recent advances in perception and multimodal fusion address the limitations and challenges that have been identified above and support more reliable autonomous driving systems.

Finally, the study identified key areas for future research, including collaborative perception, improving sensors and robustness across diverse environments, developing computationally efficient models for embedded deployment, enhancing dataset scalability and diversity and focusing on unsupervised domain adaptation.

In conclusion, despite the remarkable progress in deep learning technologies, the widespread deployment of fully autonomous vehicles capable of unsupervised driving in complex scenarios remains a formidable challenge. In this regard, this work contributes a thorough evaluation of current advancements in deep learning-based autonomous driving while offering a roadmap for future research.

Author Contributions

Conceptualisation, S.N. and P.K.; Methodology, S.N.; Software, S.N.; Validation, S.N. and P.K.; Formal analysis, S.N.; Investigation, S.N. and P.K.; Resources, S.N.; Data curation, S.N.; Writing—original draft, S.N.; Writing—review & editing, S.N. and P.K.; Visualisation, S.N.; Supervision, P.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ARM	Advanced RISC Machines
BEV	Bird’s-Eye View
CNN	Convolutional Neural Network
DETR	Detection Transformer
DFE	Depth-Aware Feature Enhancement
DTR	Depth-Aware Transformer
FMCW	Frequency Modulated Continuous Wave
FPGA	Field-Programmable Gate Array
GAN	Generative Adversarial Network
GAT	Graph Attention Network
GCN	Graph Convolutional Network
GNN	Graph Neural Network
GNSS	Global Navigation Satellite System
GPU	Graphics Processing Unit
GRN	Graph Recurrent Network
GRU	Gated Recurrent Unit
GSTCN	Graph-based Spatio-Temporal Convolutional Network
LiDAR	Light Detection and Ranging
LSTM	Long Short-Term Memory
R-CNN	Region-Based Convolutional Neural Network
RGB	Red-Green-Blue
RNN	Recurrent Neural Network
RoI	Region of Interest
SAE	Society of Automotive Engineers (SAE International)
V2I	Vehicle-to-Infrastructure
V2P	Vehicle-to-Pedestrian
V2V	Vehicle-to-Vehicle
V2X	Vehicle-to-Everything
VAE	Variational Autoencoder
ViT	Vision Transformer

References

Levy, F.; Murnane, R.J. The New Division of Labor: How Computers Are Creating the Next Job Market; Princeton University Press: Princeton, NJ, USA, 2004. [Google Scholar] [CrossRef]
Bathla, G.; Bhadane, K.; Singh, R.K.; Kumar, R.; Aluvalu, R.; Krishnamurthi, R.; Kumar, A.; Thakur, R.N.; Basheer, S. Autonomous Vehicles and Intelligent Automation: Applications, Challenges, and Opportunities. Mob. Inf. Syst. 2022, 2022, 7632892. [Google Scholar] [CrossRef]
Zablocki, É.; Ben-Younes, H.; Pérez, P.; Cord, M. Explainability of deep vision-based autonomous driving systems: Review and challenges. Int. J. Comput. Vis. 2022, 130, 2425–2452. [Google Scholar] [CrossRef]
Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A Survey on Multimodal Large Language Models for Autonomous Driving. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 1–6 January 2024; pp. 958–979. [Google Scholar] [CrossRef]
Fourati, S.; Jaafar, W.; Baccar, N.; Alfattani, S. XLM for Autonomous Driving Systems: A Comprehensive Review. arXiv 2024, arXiv:2409.10484. [Google Scholar] [CrossRef]
Fayyad, J.; Jaradat, M.A.; Gruyer, D.; Najjaran, H. Deep Learning Sensor Fusion for Autonomous Vehicle Perception and Localization: A Review. Sensors 2020, 20, 4220. [Google Scholar] [CrossRef]
Xiang, C.; Feng, C.; Xie, X.; Shi, B.; Lu, H.; Lv, Y.; Yang, M.; Niu, Z. Multi-Sensor Fusion and Cooperative Perception for Autonomous Driving: A Review. IEEE Intell. Transp. Syst. Mag. 2023, 15, 36–58. [Google Scholar] [CrossRef]
Huang, K.; Shi, B.; Li, X.; Li, X.; Huang, S.; Li, Y. Multi-modal Sensor Fusion for Auto Driving Perception: A Survey. arXiv 2024, arXiv:2202.02703. [Google Scholar]
Charroud, A.; El Moutaouakil, K.; Palade, V.; Yahyaouy, A.; Onyekpe, U.; Eyo, E.U. Localization and Mapping for Self-Driving Vehicles: A Survey. Machines 2024, 12, 118. [Google Scholar] [CrossRef]
Wang, X.; Maleki, M.A.; Azhar, M.W.; Trancoso, P. Moving Forward: A Review of Autonomous Driving Software and Hardware Systems. arXiv 2024, arXiv:2411.10291. [Google Scholar] [CrossRef]
Sadaf, M.; Iqbal, Z.; Javed, A.R.; Saba, I.; Krichen, M.; Majeed, S.; Raza, A. Connected and Automated Vehicles: Infrastructure, Applications, Security, Critical Challenges, and Future Aspects. Technologies 2023, 11, 117. [Google Scholar] [CrossRef]
Narisetty, V.S.C.P.; Maddineni, T. Revolutionizing Mobility: The Latest Advancements in Autonomous Vehicle Technology. arXiv 2024, arXiv:2412.20688. [Google Scholar] [CrossRef]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Gläser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
Huang, Y.; Chen, Y. Autonomous Driving with Deep Learning: A Survey of State-of-Art Technologies. arXiv 2020, arXiv:2006.06091. [Google Scholar]
Pravallika, A.; Hashmi, M.F.; Gupta, A. Deep Learning Frontiers in 3D Object Detection: A Comprehensive Review for Autonomous Driving. IEEE Access 2024, 12, 173936–173980. [Google Scholar] [CrossRef]
Fernandes, D.; Silva, A.; Névoa, R.; Simões, C.; Gonzalez, D.; Guevara, M.; Novais, P.; Monteiro, J.; Melo-Pinto, P. Point-cloud based 3D object detection and classification methods for self-driving applications: A survey and taxonomy. Inf. Fusion 2021, 68, 161–191. [Google Scholar] [CrossRef]
Koutini, K.; Eghbal-zadeh, H.; Widmer, G. Receptive Field Regularization Techniques for Audio Classification and Tagging with Deep Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1987–2000. [Google Scholar] [CrossRef]
Bruton, J.; Wang, H. Translated Skip Connections—Expanding the Receptive Fields of Fully Convolutional Neural Networks. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: New York, NY, USA, 2022; pp. 631–635. [Google Scholar] [CrossRef]
Ott, J.; Linstead, E.; LaHaye, N.; Baldi, P. Learning in the machine: To share or not to share? Neural Netw. 2020, 126, 235–249. [Google Scholar] [CrossRef] [PubMed]
Biscione, V.; Bowers, J.S. Convolutional Neural Networks Are Not Invariant to Translation, but They Can Learn to Be. J. Mach. Learn. Res. 2021, 22, 1–28. [Google Scholar]
Audinys, R.; Šlikas, Ž.; Radkevičius, J.; Šutas, M.; Ostreika, A. Deep Reinforcement Learning for a Self-Driving Vehicle Operating Solely on Visual Information. Electronics 2025, 14, 825. [Google Scholar] [CrossRef]
Gupta, A.; Anpalagan, A.; Guan, L.; Khwaja, A.S. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array 2021, 10, 100057. [Google Scholar] [CrossRef]
Khanum, A.; Lee, C.Y.; Yang, C.S. Involvement of Deep Learning for Vision Sensor-Based Autonomous Driving Control: A Review. IEEE Sens. J. 2023, 23, 15321–15341. [Google Scholar] [CrossRef]
Ni, J.; Chen, Y.; Chen, Y.; Zhu, J.; Ali, D.; Cao, W. A Survey on Theories and Applications for Self-Driving Cars Based on Deep Learning Methods. Appl. Sci. 2020, 10, 2749. [Google Scholar] [CrossRef]
Bharilya, V.; Kumar, N. Machine learning for autonomous vehicle’s trajectory prediction: A comprehensive survey, challenges, and future research directions. Veh. Commun. 2024, 46, 100733. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
Cholakov, R.; Kolev, T. Transformers predicting the future. Applying attention in next-frame and time series forecasting. arXiv 2021, arXiv:2108.08224. [Google Scholar] [CrossRef]
Topal, M.O.; Bas, A.; van Heerden, I. Exploring Transformers in Natural Language Generation: GPT, BERT, and XLNet. arXiv 2021, arXiv:2102.08036. [Google Scholar] [CrossRef]
Wang, Y.; Deng, Y.; Zheng, Y.; Chattopadhyay, P.; Wang, L. Vision Transformers for Image Classification: A Comparative Survey. Technologies 2025, 13, 32. [Google Scholar] [CrossRef]
Aggarwal, C.C. Attention Mechanisms and Transformers. In Machine Learning for Text; Springer: Berlin/Heidelberg, Germany, 2022; pp. 369–391. [Google Scholar]
Lai-Dang, Q.V. A Survey of Vision Transformers in Autonomous Driving: Current Trends and Future Directions. arXiv 2024, arXiv:2403.07542. [Google Scholar] [CrossRef]
Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative artificial intelligence: A systematic review and applications. Multimed. Tools Appl. 2025, 84, 23661–23700. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Bavirisetti, D.P.; Martinsen, H.R.; Kiss, G.H.; Lindseth, F. A Multi-Task Vision Transformer for Segmentation and Monocular Depth Estimation for Autonomous Vehicles. IEEE Open J. Intell. Transp. Syst. 2023, 4, 909–928. [Google Scholar] [CrossRef]
Guo, S.; Wang, S.; Yang, Z.; Wang, L.; Zhang, H.; Guo, P.; Gao, Y.; Guo, J. A Review of Deep Learning-Based Visual Multi-Object Tracking Algorithms for Autonomous Driving. Appl. Sci. 2022, 12, 10741. [Google Scholar] [CrossRef]
Chen, L.; Sima, C.; Li, Y.; Zheng, Z.; Xu, J.; Geng, X.; Li, H.; He, C.; Shi, J.; Qiao, Y.; et al. PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 550–567. [Google Scholar]
Bai, Y.; Chen, Z.; Fu, Z.; Peng, L.; Liang, P.; Cheng, E. CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7062–7068. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Xie, E.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P.; Lu, T. Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 1270–1279. [Google Scholar] [CrossRef]
Liu, Y.; Yuan, T.; Wang, Y.; Wang, Y.; Zhao, H. VectorMapNet: End-to-end Vectorized HD Map Learning. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 22352–22369. [Google Scholar]
Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.; Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17830–17839. [Google Scholar] [CrossRef]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-End Multiple-Object Tracking with Transformer. In Proceedings of the Computer Vision–ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 659–675. [Google Scholar]
Wang, Z.; Guo, J.; Hu, Z.; Zhang, H.; Zhang, J.; Pu, J. Lane Transformer: A High-Efficiency Trajectory Prediction Model. IEEE Open J. Intell. Transp. Syst. 2023, 4, 2–13. [Google Scholar] [CrossRef]
Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12878–12895. [Google Scholar] [CrossRef]
Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. arXiv 2021, arXiv:2003.05991. [Google Scholar]
Michelucci, U. An Introduction to Autoencoders. arXiv 2022, arXiv:2201.03898. [Google Scholar] [CrossRef]
Khemani, B.; Patil, S.; Kotecha, K.; Tanwar, S. A review of graph neural networks: Concepts, architectures, techniques, challenges, datasets, applications, and future directions. J. Big Data 2024, 11, 18. [Google Scholar] [CrossRef]
Carrasco Limeros, S.; Majchrowska, S.; Johnander, J.; Petersson, C.; Fernández Llorca, D. Towards explainable motion prediction using heterogeneous graph representations. Transp. Res. Part C Emerg. Technol. 2023, 157, 104405. [Google Scholar] [CrossRef]
Krzywda, M.; Łukasik, S.; Gandomi, A.H. Graph Neural Networks in Computer Vision—Architectures, Datasets and Common Approaches. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–10. [Google Scholar] [CrossRef]
Meyer, E.; Brenner, M.; Zhang, B.; Schickert, M.; Musani, B.; Althoff, M. Geometric Deep Learning for Autonomous Driving: Unlocking the Power of Graph Neural Networks with CommonRoad-Geometric. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
Huang, R.; Zhuo, G.; Xiong, L.; Lu, S.; Tian, W. A Review of Deep Learning-Based Vehicle Motion Prediction for Autonomous Driving. Sustainability 2023, 15, 14716. [Google Scholar] [CrossRef]
Wang, L.; Huang, Y. A Survey of 3D Point Cloud and Deep Learning-Based Approaches for Scene Understanding in Autonomous Driving. IEEE Intell. Transp. Syst. Mag. 2022, 14, 135–154. [Google Scholar] [CrossRef]
Rahmani, S.; Baghbani, A.; Bouguila, N.; Patterson, Z. Graph Neural Networks for Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8846–8885. [Google Scholar] [CrossRef]
Shi, W.; Rajkumar, R.R. Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1708–1716. [Google Scholar]
Sheng, Z.; Xu, Y.; Xue, S.; Li, D. Graph-Based Spatial-Temporal Convolutional Network for Vehicle Trajectory Prediction in Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17654–17665. [Google Scholar] [CrossRef]
Klimke, M.; Volz, B.; Buchholz, M. Cooperative Behavior Planning for Automated Driving Using Graph Neural Networks. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 5–9 June 2022; IEEE: New York, NY, USA, 2022; pp. 167–174. [Google Scholar] [CrossRef]
Wang, Y.; Mao, Q.; Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Li, H.; Zhang, Y. Multi-modal 3d object detection in autonomous driving: A survey. Int. J. Comput. Vis. 2023, 131, 2122–2152. [Google Scholar] [CrossRef]
Liang, L.; Ma, H.; Zhao, L.; Xie, X.; Hua, C.; Zhang, M.; Zhang, Y. Vehicle Detection Algorithms for Autonomous Driving: A Review. Sensors 2024, 24, 3088. [Google Scholar] [CrossRef]
Karangwa, J.; Liu, J.; Zeng, Z. Vehicle Detection for Autonomous Driving: A Review of Algorithms and Datasets. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11568–11594. [Google Scholar] [CrossRef]
Guo, Z.; Huang, Y.; Hu, X.; Wei, H.; Zhao, B. A Survey on Deep Learning Based Approaches for Scene Understanding in Autonomous Driving. Electronics 2021, 10, 471. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Wan, Q.; Huang, Z.; Lu, J.; Yu, G.; Zhang, L. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, X.; Li, J.; Ma, J.; Sun, H.; Xu, Z.; Zhang, T.; Yu, H. Deep transfer learning for intelligent vehicle perception: A survey. Green. Energy Intell. Transp. 2023, 2, 100125. [Google Scholar] [CrossRef]
Wang, Y.; Wang, S.; Li, Y.; Liu, M. A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions. arXiv 2024, arXiv:2408.16530. [Google Scholar] [CrossRef]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D Object Detection for Autonomous Driving: A Comprehensive Survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Qian, R.; Lai, X.; Li, X. 3D Object Detection for Autonomous Driving: A Survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
Fawole, O.A.; Rawat, D.B. Recent Advances in 3D Object Detection for Self-Driving Vehicles: A Survey. AI 2024, 5, 1255–1285. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4603–4611. [Google Scholar] [CrossRef]
Nagiub, A.S.; Fayez, M.; Khaled, H.; Ghoniemy, S. 3D Object Detection for Autonomous Driving: A Comprehensive Review. In Proceedings of the 2024 6th International Conference on Computing and Informatics (ICCI), Cairo, Egypt, 6–7 March 2024; pp. 1–11. [Google Scholar] [CrossRef]
Aung, N.H.H.; Sangwongngam, P.; Jintamethasawat, R.; Shah, S.; Wuttisittikulkij, L. A Review of LiDAR-Based 3D Object Detection via Deep Learning Approaches Towards Robust Connected and Autonomous Vehicles. IEEE Trans. Intell. Veh. 2025, 10, 526–547. [Google Scholar] [CrossRef]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3D Object Detection with Pointformer. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7459–7468. [Google Scholar] [CrossRef]
Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9223–9232. [Google Scholar] [CrossRef]
Sun, P.; Tan, M.; Wang, W.; Liu, C.; Xia, F.; Leng, Z.; Anguelov, D. SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 426–442. [Google Scholar]
Shenga, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3D Object Detection with Channel-wise Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2723–2732. [Google Scholar] [CrossRef]
Huang, K.; Wu, T.; Su, H.; Hsu, W.H. MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 4002–4011. [Google Scholar] [CrossRef]
Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the 5th Conference on Robot Learning, PMLR, London, UK, 8–11 December 2022; pp. 180–191. [Google Scholar]
Tang, Y.; He, H.; Wang, Y.; Mao, Z.; Wang, H. Multi-modality 3D object detection in autonomous driving: A review. Neurocomputing 2023, 553, 126587. [Google Scholar] [CrossRef]
Wang, X.; Li, K.; Chehri, A. Multi-Sensor Fusion Technology for 3D Object Detection in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2024, 25, 1148–1165. [Google Scholar] [CrossRef]
Tian, D.; Li, J.; Lei, J. Multi-sensor information fusion in Internet of Vehicles based on deep learning: A review. Neurocomputing 2025, 614, 128886. [Google Scholar] [CrossRef]
Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef]
Alaba, S.Y.; Gurbuz, A.C.; Ball, J.E. Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection. World Electr. Veh. J. 2024, 15, 20. [Google Scholar] [CrossRef]
Yang, B.; Li, J.; Zeng, T. A Review of Environmental Perception Technology Based on Multi-Sensor Information Fusion in Autonomous Driving. World Electr. Veh. J. 2025, 16, 20. [Google Scholar] [CrossRef]
Jiao, T.; Guo, C.; Feng, X.; Chen, Y.; Song, J. A Comprehensive Survey on Deep Learning Multi-Modal Fusion: Methods, Technologies and Applications. Comput. Mater. Contin. 2024, 80, 1–35. [Google Scholar] [CrossRef]
Qian, H.; Wang, M.; Zhu, M.; Wang, H. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors 2025, 25, 6033. [Google Scholar] [CrossRef] [PubMed]
Wei, C.; Qin, Z.; Zhang, Z.; Wu, G.; Barth, M.J. Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles. In Proceedings of the 2025 IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania, 22–25 June 2025; pp. 1817–1824. [Google Scholar] [CrossRef]
Malawade, A.V.; Mortlock, T.; Al Faruque, M.A. HydraFusion: Context-Aware Selective Sensor Fusion for Robust and Efficient Autonomous Vehicle Perception. In Proceedings of the 2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS), Milano, Italy, 4–6 May 2022; pp. 68–79. [Google Scholar] [CrossRef]
Bi, J.; Wei, H.; Zhang, G.; Yang, K.; Song, Z. DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion. IEEE Lat. Am. Trans. 2024, 22, 106–112. [Google Scholar] [CrossRef]
Shao, Z.; Wang, H.; Cai, Y.; Chen, L.; Li, Y. UA-Fusion: Uncertainty-Aware Multimodal Data Fusion Framework for 3-D Object Detection of Autonomous Vehicles. IEEE Trans. Instrum. Meas. 2025, 74, 1–16. [Google Scholar] [CrossRef]
Hayes, S.; Sharma, S.; Eising, C. Velocity driven vision: Asynchronous sensor fusion birds eye view models for autonomous vehicles. IET Conf. Proc. 2024, 2024, 23–30. [Google Scholar] [CrossRef]
Yang, C.; Huan, S.; Wu, L.; Weng, Q.; Xiong, W. Fusion of Millimeter-Wave Radar and Camera Vision for Pedestrian Tracking. In Proceedings of the 2023 5th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 14–16 April 2023; pp. 317–321. [Google Scholar] [CrossRef]
Deng, J.; Zhu, B.; Chu, X.; Wang, L.; Lu, Z.; Hu, Z. Robust Target Detection, Position Deducing and Tracking Based on Radar Camera Fusion in Transportation Scenarios. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Cheng, L.; Sengupta, A.; Cao, S. Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17218–17233. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Qin, W.; Li, X.; Gao, J.; Yang, L.; Li, Z.; Li, J.; Zhu, L.; Wang, H.; et al. CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking with Camera-LiDAR Fusion. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11981–11996. [Google Scholar] [CrossRef]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Wu, B.; Lu, Y.; Zhou, D.; et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. arXiv 2022, arXiv:2203.08195. [Google Scholar]
Zhang, X.; Yin, X.; Gao, X.; Qiu, T.; Wang, L.; Yu, G.; Wang, Y.; Zhang, G.; Li, J. Adaptive Entropy Multi-Modal Fusion for Nighttime Lane Segmentation. IEEE Trans. Intell. Veh. 2024, 9, 6990–7002. [Google Scholar] [CrossRef]
Ye, H.; Mei, J.; Hu, Y. M2F2-Net: Multi-Modal Feature Fusion for Unstructured Off-Road Freespace Detection. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–7. [Google Scholar] [CrossRef]
Duraisamy, P.; Natarajan, S. Multi-Sensor Fusion Based Off-Road Drivable Region Detection and Its ROS Implementation. In Proceedings of the 2023 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET), Chennai, India, 29–31 March 2023; pp. 1–5. [Google Scholar] [CrossRef]
Feng, Y.; Li, X.; Ni, P.; Liu, X.; Jiang, T. Multisensor Fusion Network for Unstructured Scene Segmentation with Surface Normal Incorporated. IEEE Sens. J. 2024, 24, 13589–13603. [Google Scholar] [CrossRef]
Ming, Z.; Stephany Berrio, J.; Shan, M.; Worrall, S. OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction. IEEE Trans. Intell. Veh. 2025, 10, 3421–3433. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; Wang, Y. Multimodal Token Fusion for Vision Transformers. arXiv 2022, arXiv:2204.08721. [Google Scholar] [CrossRef]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. arXiv 2022, arXiv:2203.11496. [Google Scholar]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying Voxel-based Representation with Transformer for 3D Object Detection. arXiv 2022, arXiv:2206.00630. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, C.; Guo, Y.; Chen, L.; Happold, M. MotionTrack: End-to-End Transformer-based Multi-Object Tracing with LiDAR-Camera Fusion. arXiv 2023, arXiv:2306.17000. [Google Scholar]
Wang, Y.; Ye, T.; Cao, L.; Huang, W.; Sun, F.; He, F.; Tao, D. Bridged Transformer for Vision and Point Cloud 3D Object Detection. arXiv 2022, arXiv:2210.01391. [Google Scholar] [CrossRef]
Shi, H.; Wang, X.; Zhao, J.; Hua, X. A Cross-Modal Attention-Driven Multi-Sensor Fusion Method for Semantic Segmentation of Point Clouds. Sensors 2025, 25, 2474. [Google Scholar] [CrossRef]
Luo, Y.; Sun, A.; Hong, J. Autonomous Driving Decision-Making Method Based on Spatial-Temporal Fusion Trajectory Prediction. Appl. Sci. 2024, 14, 11913. [Google Scholar] [CrossRef]
Han, X.; Luo, J.; Wei, X.; Wang, Y. Online Calibration Method of LiDAR and Camera Based on Fusion of Multi-Scale Cost Volume. Information 2025, 16, 223. [Google Scholar] [CrossRef]
Zhu, M.; Gong, Y.; Tian, C.; Zhu, Z. A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones 2024, 8, 412. [Google Scholar] [CrossRef]
Dong, X.; Cappuccio, M.L. Applications of computer vision in autonomous vehicles: Methods, challenges and future directions. arXiv 2023, arXiv:2311.09093. [Google Scholar]
Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [Google Scholar] [CrossRef]
Alaba, S.Y.; Ball, J.E. Deep Learning-Based Image 3-D Object Detection for Autonomous Driving: Review. IEEE Sens. J. 2023, 23, 3378–3394. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Song, Z.; Bi, J.; Zhang, G.; Wei, H.; Tang, L.; Yang, L.; Li, J.; Jia, C.; et al. Multi-Modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [Google Scholar] [CrossRef]
Chen, C.; Wang, B.; Lu, C.X.; Trigoni, N.; Markham, A. Deep Learning for Visual Localization and Mapping: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 17000–17020. [Google Scholar] [CrossRef]
Tampuu, A.; Matiisen, T.; Semikin, M.; Fishman, D.; Muhammad, N. A Survey of End-to-End Driving: Architectures and Training Methods. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1364–1384. [Google Scholar] [CrossRef]
Chen, L.; Wu, P.; Chitta, K.; Jaeger, B.; Geiger, A.; Li, H. End-to-End Autonomous Driving: Challenges and Frontiers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10164–10183. [Google Scholar] [CrossRef] [PubMed]
Ly, A.O.; Akhloufi, M. Learning to Drive by Imitation: An Overview of Deep Behavior Cloning Methods. IEEE Trans. Intell. Veh. 2021, 6, 195–209. [Google Scholar] [CrossRef]
Wang, T.; Zhu, X.; Pang, J.; Lin, D. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 913–922. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 1–18. [Google Scholar]
Singh, A. Transformer-Based Sensor Fusion for Autonomous Driving: A Survey. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 3304–3309. [Google Scholar] [CrossRef]
Sonko, S.; Etukudoh, E.A.; Ibekwe, K.I.; Ilojianya, V.I.; Daudu, C.D. A comprehensive review of embedded systems in autonomous vehicles: Trends, challenges, and future directions. World J. Adv. Res. Rev. 2024, 21, 2009–2020. [Google Scholar] [CrossRef]
Damaj, I.W.; Yousafzai, J.K.; Mouftah, H.T. Future Trends in Connected and Autonomous Vehicles: Enabling Communications and Processing Technologies. IEEE Access 2022, 10, 42334–42345. [Google Scholar] [CrossRef]
Adnan Yusuf, S.; Khan, A.; Souissi, R. Vehicle-to-everything (V2X) in the autonomous vehicles domain—A technical review of communication, sensor, and AI technologies for road user safety. Transp. Res. Interdiscip. Perspect. 2024, 23, 100980. [Google Scholar] [CrossRef]

Figure 1. PRISMA diagram for implementation of the literature review process.

Figure 2. CNN architecture used for image recognition and classification.

Figure 3. Basic architecture of Recurrent Neural Network.

Figure 4. Basic architecture of an LSTM Network.

Figure 5. Basic architecture of GRU Network.

Figure 6. Basic architecture of the Vision Transformer.

Figure 7. Basic structure of an autoencoder.

Figure 8. Type and position of autonomous vehicle sensors.

Figure 9. “When to fuse” design methodology: (a) Early (Data-level) fusion, (b) Late (Decision-level) fusion, (c) Middle (Feature-level) fusion, and (d) Deep (Feature) fusion.

Figure 10. Quantitative distribution and evolution of perception trends (2020–2025).

Table 1. Comparison with existing literature.

Study	Year	Deep Learning Focus	Sensor Type	State-of-the-Art
Fayyad et al. [6]	2020	CNN Fusion	Camera/LiDAR/	No
			Radar
Xiang et al. [7]	2023	CNN Fusion	Camera/LiDAR	Limited
Huang et al. [8]	2024	CNN Fusion	Camera/LiDAR	Limited
This study	2026	Transformer/	Camera/LiDAR/	Yes
		Hybrid Fusion	Radar

Table 2. Methodology for source collection.

Criteria	Details
Sources	IEEE Xplore, Clarivate, Scopus, Springer Nature, ScienceDirect,
	MDPI, arXiv, Google Scholar
Keywords	Deep Learning, Autonomous Driving, Perception
Search Strings	Deep Learning in Autonomous Driving,
	Multimodal sensor fusion in Autonomous driving

Table 3. Inclusion and exclusion criteria.

Inclusion Criteria	Exclusion Criteria
Articles published in peer-reviewed	Editorial pieces, prefaces,
journals, conference proceedings and	summaries, book reviews and other
articles published in reputed journals	non-peer-reviewed materials
Studies focusing on deep learning-based	Articles not relevant to the targeted area of
autonomous driving technologies	deep learning-based autonomous driving
Publications in the English language	Non-English articles
Articles published between 2020 and 2025

Table 4. Performance trade-offs evaluation of CNN, RNN, and ViT architectures.

Evaluation Metric	CNNs	RNNs	ViTs
Accuracy	High	Medium	Superior
Latency	Lowest	Medium/High	Highest
Temporal Modeling	Low	High	Medium/High
Deployment Maturity	High	Medium	Medium
Explainability	Medium	Low	Low

Table 5. Comparison of multimodal fusion strategies.

Fusion Strategy	Point of Integration	Key Advantage	Core Limitation
Early	Input stage	Integration of	High computational
		diverse data types	demands
		Richer and more	Potential confusion
		expressive fused data	or redundancy
Middle	After initial	Leverages different	Possible omission
	feature	perspectives	of subtle details
	extraction	Detailed feature	Difficulty in
		representations	optimization
Deep	During	Compensates for	Dimensionality
	feature	missing features	explosion
	extraction	in one modality	Performance
			degradation risk
Late	Output stage	High anti-interference	Significant
		Reducing dependency	information loss
		on specific data types	Potential redundancy
			or inconsistencies

Table 6. Comparison between 2D and 3D perception approaches. The reported mAP ranges are based on the MS COCO dataset for pure 2D tasks and the nuScenes dataset for 2D-applied-to-3D and pure 3D cases.

Approach	Strengths	Limitations	mAP
	Cost effective and	Lack of direct	0.22–0.54
2D perception	widely available	depth estimation.	[57]
	camera sensors.	Restricted to	0.23–0.34
	Highly advanced	identifying objects	applied on
	models.	within a 2D	3D perception
		image plane.	[118]
	Higher accuracy and	Costly sensors.
	reliability for scene	Computationally
	understanding.	expensive.	0.40–0.62
3D perception	Identify and locate objects in	Data formats lack	[109]
	three-dimensional space	a consistent and
	with depth information.	organized structure.

Table 7. Comparison between classic and transformer-based techniques based on the nuScenes dataset.

Approach	Strengths	Limitations	mAP
Classic	Mature and varied field.	Cannot stand as a
Approaches	Can achieve high accuracy	standalone framework
(CNNs)	(e.g., Faster R-CNN).	in the modern	0.36–0.58
	Can achieve high	ecosystem of	[119]
	computational speed	autonomous driving.
	and inference time
	(e.g., YOLO).
Transformer-	Operate in a direct,	Less deployment-mature,
Based	end-to-end manner.	often higher computational	0.40–0.62
Techniques	Highly adaptable.	and training demands,	[109]
	Promising in	broader real-world
	multimodal fusion.	validation still needed.

Table 8. Comparison between single-sensor and multimodal sensor fusion techniques based on the nuScenes dataset.

Approach	Strengths	Limitations	mAP
		Limited reliability and robustness.
	Reduced system	Vulnerable to inherent
Single-Sensor	complexity.	limitations of the specific sensor.
Approaches	Relatively easy	Reduced performance in diverse	0.40–0.62
	to implement.	environmental conditions.	[109]
		Insufficient for real-world driving.
	High accuracy.	Complexity of fusion
	Sensor Complementarity.	algorithms.
Multimodal	High reliability and	Need for more	0.66–0.73
Sensor Fusion	redundancy.	sophisticated designs.	[109]
	High detection and	Efficiency and maintainability.
	tracking precision.	limitations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Nikolaidis, S.; Koukaras, P. Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion. World Electr. Veh. J. 2026, 17, 277. https://doi.org/10.3390/wevj17060277

AMA Style

Nikolaidis S, Koukaras P. Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion. World Electric Vehicle Journal. 2026; 17(6):277. https://doi.org/10.3390/wevj17060277

Chicago/Turabian Style

Nikolaidis, Savvas, and Paraskevas Koukaras. 2026. "Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion" World Electric Vehicle Journal 17, no. 6: 277. https://doi.org/10.3390/wevj17060277

APA Style

Nikolaidis, S., & Koukaras, P. (2026). Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion. World Electric Vehicle Journal, 17(6), 277. https://doi.org/10.3390/wevj17060277

Article Menu

Vision and Multimodal Perception for Autonomous Driving: Deep Learning Architectures, Tasks, and Sensor Fusion

Abstract

1. Introduction

1.1. Scope of the Review

1.2. Novelty and Contribution

1.3. Related Work

1.4. Structure of the Paper

2. Materials and Methods

Methodological Approach

3. Background

3.1. Autonomous Driving

3.2. Foundational Deep Learning Architectures for Perception

3.2.1. Convolutional Neural Networks

3.2.2. Recurrent Neural Networks

3.2.3. Transformers

3.2.4. Autoencoders

3.2.5. Graph Neural Networks

4. Perception and Scene Understanding

4.1. Object Detection and Classification

4.2. Semantic Segmentation

4.3. Instance Segmentation

4.4. 3D Perception and Depth Estimation

4.4.1. Camera-Based Approaches

4.4.2. LiDAR-Based Approaches

4.4.3. Transformer-Based Approaches

5. Multimodal Sensor Fusion in Autonomous Vehicles

5.1. Overview

5.2. Sensors in Autonomous Vehicles

5.3. Multimodal Sensor Fusion Design Methodologies

5.4. Applications of Multimodal Sensor Fusion

5.4.1. Object Detection and Tracking

5.4.2. Scene Segmentation

5.4.3. Transformer-Based Multimodal Fusion 3D Object Detection

6. Challenges and Limitations in Perception for Autonomous Driving

6.1. Perception-Specific Challenges

6.2. Fusion-Related Challenges

6.3. Open Problems and Benchmarking Gaps

7. Discussion and Future Directions

7.1. Comparative Analysis

7.1.1. Quantitative Meta-Analysis

7.1.2. Critical Technical Synthesis and Trade-Off Analysis

7.2. Future Research Directions

7.2.1. Operational Stability and Reliability

7.2.2. System Scalability

7.2.3. Long-Term Evolution

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI