Next Article in Journal
AI-Powered Building Ecosystems: A Narrative Mapping Review on the Integration of Digital Twins and LLMs for Proactive Comfort, IEQ, and Energy Management
Previous Article in Journal
Balancing Energy Consumption and Detection Accuracy in Cardiovascular Disease Diagnosis: A Spiking Neural Network-Based Approach with ECG and PCG Signals
Previous Article in Special Issue
Aging Effects on a Driver Position Sensor Integrated into a Woven Fabric
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities

by
Miguel Valverde
1,*,
Alexandra Moutinho
2,* and
João-Vitor Zacchi
3
1
Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
2
Instituto de Engenharia Mecânica, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
3
Fraunhofer IKS, 80686 Munich, Germany
*
Authors to whom correspondence should be addressed.
Sensors 2025, 25(17), 5264; https://doi.org/10.3390/s25175264
Submission received: 7 July 2025 / Revised: 29 July 2025 / Accepted: 19 August 2025 / Published: 24 August 2025
(This article belongs to the Special Issue Sensors and Sensor Fusion Technology in Autonomous Vehicles)

Abstract

This paper presents a comprehensive survey of deep learning-based methods for 3D object detection in autonomous driving, focusing on their use of diverse sensor modalities, including monocular cameras, stereo vision, LiDAR, radar, and multi-modal fusion. To systematically organize the literature, a structured taxonomy is proposed that categorizes methods by input modality. The review also outlines the chronological evolution of these approaches, highlighting major architectural developments and paradigm shifts. Furthermore, the surveyed methods are quantitatively compared using standard evaluation metrics across benchmark datasets in autonomous driving scenarios. Overall, this work provides a detailed and modality-agnostic overview of the current landscape of deep learning approaches for 3D object detection in autonomous driving. Results of this work are available in a github open repository.

1. Introduction

1.1. Overview of Autonomous Vehicles

Autonomous vehicles (AVs) have emerged as a transformative paradigm in mobility, with the potential to reshape how people and goods are transported and how societies function. Over the past decade, significant advances in sensor technology, computational power, and machine learning have enabled the transition from early concepts to real-world deployment. AVs are now being tested and operated in urban areas, industrial settings, and controlled environments worldwide. Governments and companies continue to invest heavily in their development, driven by the promise of reducing traffic accidents, lowering emissions, and improving effectiveness and efficiency in the transportation of people.
According to the Society of Automotive Engineers’ (SAE) norm for driving automation [1], there are six distinct levels (illustrated in Figure 1), ranging from level 0, where the driver is in command of the car, to level 5, where the vehicle assumes complete control over all driving aspects. Levels 1 and 2 of automation, which include advanced driver assistance systems (ADASs) such as braking assistance, cruise control, and lane switching [2], have now practically become standard in many cars. Safety concerns primarily drove their adoption but also paved the way for the development of more advanced systems.
For a long time, the automotive industry was stuck at SAE Level 2; however, vehicles with Level 3 automation have recently started entering the market. Notably, the Mercedes S-Class achieved Level 3 certification in Germany and in the United States, while the Honda Legend received the same certification in Japan [3]. Moreover, robotaxis are already operating at Level 4 [4], though some argue they can be considered Level 5 due to their full autonomy in constrained environments [5].
Figure 1. SAE levels of automation [1].
Figure 1. SAE levels of automation [1].
Sensors 25 05264 g001
Safety is a key driver of AD growth. Studies such as [6,7] highlight that approximately 90% to 94% of accidents are caused by human error. AVs aim to mitigate these risks by eliminating unsafe driver behaviours like distraction and speeding. Additionally, AVs are expected to reduce carbon emissions, smooth traffic flow, increase productivity, and have a positive economic impact. A study from America’s Workforce and Self-Driving Future [8] predicts that widespread adoption of AVs could lead to nearly $800 billion in annual social and economic benefits by 2050. Another study [9] states that autonomous technologies can reduce urban travel time by a third, decrease greenhouse gas emissions by two-thirds, decrease the number of vehicles in crowded cities by 30%, and decrease the need for parking spaces by 40%.
Autonomous driving (AD), computer vision (CV), robotics, and machine learning (ML) are among the most prominent areas of research at the moment [10]. AVs are the next step in the evolution of the automobile, as the industry is noticeably moving towards a world where less or even no human interaction is required. Figure 2a illustrates the trajectory of publications in AD since 1964, demonstrating a notable surge at the beginning of the 21st century. Figure 2b zooms on the last two decades (2004 to 2024), revealing a consistent annual publication growth. Over this period, the number of scientific publications on AD published by year increased from 2828 to 78,602, marking an almost 30-fold increase.
Figure 2. Number of publications per year on autonomous vehicles. Source Scopus—Elsevier (https://www.scopus.com/home.uri— advanced search using (self-driving car) OR (autonomous car) keywords on 25 April 2025).
Figure 2. Number of publications per year on autonomous vehicles. Source Scopus—Elsevier (https://www.scopus.com/home.uri— advanced search using (self-driving car) OR (autonomous car) keywords on 25 April 2025).
Sensors 25 05264 g002
The anticipated benefits of a global shift toward full or partial AVs are widely recognized. However, achieving large-scale deployment remains a complex challenge and is far from imminent. Despite substantial investment and rapid technological progress, fully automated vehicles still face significant obstacles across multiple domains. Concerns over AV safety remain widespread, particularly among vulnerable groups such as individuals with disabilities [11]. Moreover, documented instances of algorithmic bias, including decisions influenced by race, further complicate public trust and acceptance [12].
Not only ethical but technical challenges still persist. Ensuring reliable performance under adverse weather conditions, such as heavy rain, snow, or fog, and in difficult lighting environments, including night-time driving, tunnels, and shaded areas, remains difficult. These scenarios can impair sensor accuracy and disrupt vehicle control systems, underscoring the need for more robust algorithms and advanced sensor fusion techniques to guarantee dependable operation across varied environments [10]. Given the substantial societal and safety benefits AVs promise, it is critical to address both technical and ethical barriers to enable their successful integration into modern transportation systems.
An AV must perform all the functions a human driver would, including perceiving the environment, determining its position, anticipating the actions of other road users, planning a trajectory, and executing control commands to follow that trajectory. To achieve this, AV software is typically organized into distinct modules and submodules. While variations exist, a common framework follows the Sense–Plan–Act paradigm [13], illustrated in Figure 3, and includes the following components: Sensors, Perception, Planning, Control, and Car.
The Sensors module is responsible for acquiring data from a range of sensors, including cameras, Radio Detection and Ranging (radar), Light Detection and Ranging (LiDAR), Global Navigation Satellite Systems (GNSSs), and inertial sensors. The Perception module interprets these data to detect objects and estimate the state of the vehicle. Based on the output of perception, the vehicle plans a trajectory and generates control actions, such as steering and acceleration, to follow the planned path [14].
Some of the most complex challenges in autonomous driving stem from the perception stage, which relies on a combination of advanced sensors and state-of-the-art (SoA) algorithms. As an initial stage in the AV software pipeline, the performance of the Perception module directly affects all downstream tasks. Therefore, a thorough understanding of the various techniques used in perception is essential to advance reliable and safe autonomous driving systems.
Figure 3. Typical autonomous vehicle pipeline [15].
Figure 3. Typical autonomous vehicle pipeline [15].
Sensors 25 05264 g003

1.2. Scope, Aims, and Outline

Three-dimensional object detection plays a central role in autonomous driving by enabling the vehicle to understand and interact with its environment in three dimensions. It involves estimating the position, dimensions, and class of objects in the environment using sensor data. Modern perception systems typically rely on multimodal inputs, such as images from cameras, point clouds from LiDAR, or a combination of both, to extract geometric and semantic information from the scene. With the rapid progress of deep learning (DL), a wide range of 3D object detection methods have emerged, each tailored to specific sensor modalities and data representations. However, these methods are often developed and evaluated in isolation, lacking a unified perspective that spans different input types and architectural paradigms. This fragmentation makes it difficult to assess their relative strengths and limitations, particularly in the context of real-world autonomous driving scenarios. To address this gap, this work presents a structured and comparative review of 3D object detection methods across all major sensor modalities, aiming to clarify performance trends, highlight paradigm shifts, and guide future developments in the field.
This work aims to provide a comprehensive review of 3D object detection methods tailored for AD, offering an in-depth analysis and structured comparison of different approaches and associated sensor modalities. Unlike existing surveys, our review includes more recent developments and evaluates methods across all types of sensory and application contexts. Although some previous works, such as [16,17,18] present similar structures, they were published before 2023 and cover older methods. More recent surveys like [19,20,21] either lack breadth or fail to benchmark and compare methods on standard datasets. Other studies focus narrowly on specific modalities, such as [22] for image-based methods, [23] for point cloud-based DL methods, and [24] for multi-modal techniques. Transformer-based approaches are explored in [25,26].
This survey presents the first unified and structured review of recent 3D object detection methods using DL across all major sensor modalities. It introduces a refined taxonomy and highlights performance trends and paradigm shifts, to support fair and systematic evaluation in future research. The survey covers developments over the past decade, focusing particularly on methods published in top-tier computer vision venues. Alongside technical insights into 3D detectors, the paper discusses taxonomies, benchmark datasets, evaluation protocols, and open challenges. More specifically, the following contributions are provided:
  • The context for the task of 3D object detection, presenting its formulation, the sensor modalities required for it, and finally by presenting benchmark datasets and their respective evaluation metrics.
  • A comprehensive literature review of camera-based, LiDAR-based, radar-based and multimodal-based 3D perception methods, including an updated taxonomy and discussion of their evolution.
  • A performance and speed benchmark of selected 3D object detectors using standard datasets and evaluation metrics.
An updated project page listing all methods covered in this work is also maintained at (https://3d-object-detection-hub.github.io/, accessed on 25 April 2025).
The structure of this paper is organized as follows. Section 2 introduces the 3D object detection problem, followed by typical data representations and different sensor types. Then, Section 3 summarizes the commonly used datasets and metrics for 3D object detection. Section 4 presents the taxonomy of the methods considered and highlights the major paradigm shifts and developments. Finally, a speed and performance analysis is provided in Section 5. Section 6 offers concluding remarks and discusses future research trends.

2. Background

2.1. Problem Definition

Object detection (OD) is a fundamental task in CV, aiming to identify and localize objects of interest from sensory data such as images or point clouds (PCs) [27]. It typically involves a two-stage process: classifying object types and regressing bounding box parameters. While 2D OD has achieved strong performance in benchmarks like KITTI [28], its lack of depth information limits its usefulness in autonomous driving, particularly for critical tasks such as motion planning and collision avoidance [29].
Three-dimensional OD addresses these shortcomings by incorporating depth, allowing objects to be located and characterized in three-dimensional space. This additional dimensionality improves spatial reasoning but introduces greater computational demands and requires high-quality sensory inputs. Unlike its 2D counterpart, 3D OD still faces major challenges, including fewer large-scale annotated datasets and less standardized model architectures [29]. Additionally, estimating depth from monocular images remains an inherently ill-posed problem, as multiple 3D scenes can project to similar 2D views [16,30].
Formally, 3D OD aims to infer a set of bounding boxes B = { B 1 , B 2 , , B N } from sensory input I sensor , using a detection model f det :
B = f det ( I sensor ) .
Each detected object B i is typically represented as:
B i = [ x c , y c , z c , l , w , h , θ , c ] ,
where ( x c , y c , z c ) denote the centre coordinates, ( l , w , h ) the bounding box dimensions length, width, and height, θ is the rotation angle around the z-axis in the world frame, and c represents the object category [18]. The bounding boxes are usually parametrized in sensor-centric or vehicle-centric coordinate frames to support downstream tasks such as object tracking, motion prediction, and trajectory planning [16].

2.2. Data Representation

Three-dimensional data provides richer spatial structure than 2D imagery and can be represented either implicitly, via depth cues embedded in images, or explicitly, through direct geometric encoding. Implicit representations include multi-view images, RGB-D formats that append depth maps to RGB channels, and light fields, which encode light rays as functions of spatial and angular dimensions [31].
Explicit representations are more common in 3D OD and include the following:
  • Voxels, which discretize 3D space into volumetric grids;
  • Point clouds, composed of unordered 3D points ( x , y , z ) , sometimes augmented with intensity or reflectance values;
  • Meshes, which represent object surfaces through vertices, edges, and faces [31].
Among these, point clouds, particularly those obtained from LiDAR, are the most widely used input for 3D OD due to their high geometric fidelity. However, their sparse and irregular structure poses challenges for conventional DL models. Voxelization addresses this by converting point clouds into regular 3D grids, enabling the use of 3D convolutions. Yet, this comes with trade-offs: small voxels increase memory and computation costs cubically, while large voxels reduce spatial resolution and accuracy [16,32].
To balance efficiency and detail, pillar-based representations simplify voxelization by collapsing the vertical dimension, effectively projecting the 3D point cloud into a pseudo-image format. This allows 2D convolutional neural networks to be used while preserving horizontal spatial structure, significantly improving computational efficiency without sacrificing detection accuracy [31].

2.3. Sensors

Several types of sensors can provide raw data for 3D OD. The most widely adopted sensors in AVs are radar, LiDAR, and cameras. Their integration is essential for environmental perception, directly impacting system safety and reliability.
Based on their interaction with the environment, sensors are divided into active and passive types. Active sensors, such as radar and LiDAR, emit energy signals and measure the reflected responses. Passive sensors, such as cameras, capture ambient energy without emitting any signal. Alternatively, sensors can also be categorized based on the type of information they measure:
  • Exteroceptive sensors measure external variables and observe the surrounding environment. Examples include stereo, flash, infrared, and thermal cameras, as well as radar, LiDAR, and sonar [33].
  • Proprioceptive sensors measure variables related to the vehicle state, providing information about its position, velocity, orientation, and acceleration. Examples include global navigation satellite systems (GNSSs), inertial measurement units (IMUs), ground speed sensors (GSSs), encoders, gyroscopes, and accelerometers [33].
While some proprioceptive information may support perception tasks, these sensors are primarily used for vehicle state estimation. In contrast, exteroceptive sensors are mainly responsible for detecting static and dynamic objects in the environment. Table 1 summarizes the main categories of exteroceptive sensors and compares them based on data from [2,33,34,35].
It is important to note that Table 1 provides only a qualitative overview. The classification depends on the specific application and the relative comparison between the considered sensors. Moreover, performance can vary significantly across different models and manufacturers. Therefore, the table should serve only as a general guideline for assessing trade-offs between different sensing modalities. In the following, these trade-offs are further examined and discussed in detail for each individual sensor modality.
  • Monocular cameras are passive sensors that capture rich appearance information, including texture and colour, at low cost and high resolution. They produce images I cam R W   ×   H   ×   3 , but cannot directly recover depth, limiting their 3D localization capabilities [2,18]. Their performance deteriorates under adverse lighting conditions such as night-time, glare, fog, or rain [29].
  • Stereo vision systems estimate depth by triangulating points based on the disparity between images captured by two horizontally aligned cameras, enhancing 3D understanding. However, they require precise calibration and are sensitive to low-texture regions and lighting variations [16,29]. Other systems, such as Time-of-Flight cameras, infer depth using infrared pulses but offer lower resolution, while RGB-D sensors like Kinect combine colour and depth for a more complete spatial view.
  • Infrared cameras detect infrared radiation, including thermal emissions and other wavelengths of the electromagnetic spectrum. They enable perception in dark or low-visibility conditions but typically provide low resolution and are less effective for detailed object classification.
  • Sonar and ultrasonic sensors emit sound waves to detect nearby obstacles. They are compact, inexpensive, and reliable at short range, but provide low spatial resolution and are unsuitable for complex 3D perception tasks [2].
  • Radar systems emit electromagnetic waves and detect their reflections to measure the position and relative velocity of objects using the Doppler effect. They provide long-range robustness and function reliably in adverse weather, although their angular resolution is lower, making fine-grained object detection more challenging [33].
  • LiDAR sensors actively scan the environment with laser beams to generate detailed 3D PCs. A typical LiDAR unit emitting m beams over n rotations produces a range image I range R m   ×   n   ×   3 , which can be converted into a PC I point R N   ×   3 [18,29]. LiDAR provides high spatial accuracy independent of light conditions, although it remains relatively expensive and can be affected by environmental factors like fog, rain, or snow [16].
Each sensing modality presents unique trade-offs. Monocular cameras offer high semantic richness at a low cost, but they lack direct depth sensing and are sensitive to environmental conditions [2,29]. Stereo systems incorporate depth information but are still vulnerable to lighting and texture challenges [16,29]. Infrared cameras are advantageous in darkness but typically offer lower resolution, while sonar sensors are reliable for close-range detection yet inadequate for detailed 3D mapping [2]. Infrared and sonar sensors are often used in Unmanned Aerial Vehicles or as complementary components rather than as primary sensors for object detection in autonomous vehicles.
For long-range and high-precision sensing, radar and LiDAR are the primary choices. Radar ensures reliable detection under adverse weather while providing velocity information via the Doppler effect [33]. LiDAR generates highly detailed 3D spatial data, but at a higher cost and with some sensitivity to atmospheric disturbances [16,29].
Although actual computational cost depends on the specific algorithms used, considering the most commonly adopted methods, sensors can be broadly classified by the computational cost they impose. Monocular and stereo cameras typically incur high computational costs due to high-resolution image processing, making real-time edge-device deployment more challenging, whereas sensors such as sonar, radar, and LiDAR often impose lower-to-medium computational overhead, better suiting resource-constrained applications.
In the context of AVs, the most common perception stack combines cameras, radar, and LiDAR. Sensor fusion is, therefore, crucial for merging semantic understanding, thereby maximizing robustness against adverse conditions and the advantages of each sensor [16]. Reliable sensor fusion demands precise calibration [29] and careful balancing between cost, complexity, and robustness. By integrating complementary strengths, fusion also introduces redundancy, essential for ensuring safe autonomous operation [33,34].

3. Datasets and Evaluation Metrics

3.1. Benchmark Datasets

Numerous datasets have been developed to support the training and evaluation of DL models in CV. These benchmarks have shaped the progression of the field by enabling more diverse training as well as standardized comparisons between methods. In what follows, the most relevant datasets for this problem are reviewed. Most of these datasets were collected in AD scenarios by equipping vehicles with multi-modal sensors and manually annotating the collected data [18].
KITTI [28] was among the first 3D OD datasets and remains widely used due to its accessibility and standardized evaluation protocols. It features real-world driving data captured in Germany using stereo cameras and a Velodyne LiDAR sensor. More recent datasets, such as nuScenes [36] and Waymo Open [37], offer greater scene diversity, richer annotations, and broader environmental conditions. nuScenes includes multi-modal data from six cameras, one LiDAR and five radars, covering 1000 driving scenes. The Waymo Open Dataset provides synchronized data from five LiDARs and five cameras, with over 230,000 annotated frames.
Although KITTI played a foundational role in the development of 3D perception methods, its limited scale and lack of environmental variety have driven a shift toward more comprehensive datasets such as nuScenes and Waymo. These newer benchmarks better support real-world deployment by exposing models to rare events, night-time conditions, and adverse weather scenarios. Due to their scale and greater diversity, nuScenes and Waymo are currently the most commonly used datasets. Nevertheless, class imbalance remains a common issue, reflecting the natural distribution of object types in driving environments [16].
Other notable datasets include ApolloScape [38], ArgoVerse [39], Lyft Level 5 [40], and H3D [41]. Table 2 summarizes the main characteristics of these datasets, based on information from [20,24,26,42].
  • KITTI: The KITTI dataset (https://www.cvlibs.net/datasets/kitti/, accessed on 25 April 2025), developed by the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago, remains one of the most widely used benchmarks for AD. It provides stereo RGB images, LiDAR PCs, and calibration files. The dataset includes 7481 training and 7518 testing frames, with over 80,000 annotated 3D bounding boxes. Objects are labelled as easy, moderate, or hard based on occlusion, truncation, and object size [29]. Data were collected using a Velodyne LiDAR, stereo cameras, and GNSS/IMU sensors across 22 scenes in urban and highway environments [32].
  • nuScenes: The nuScenes dataset (https://www.nuscenes.org/, accessed on 25 April 2025), developed by Motional (https://motional.com/), comprises 1000 20-s driving scenes recorded at 2 Hz. Each scene contains annotations for 23 categories. The sensor suite includes six cameras, a 32-beam LiDAR, and five radars. In total, the dataset features over 390,000 LiDAR sweeps and 1.4 million annotated bounding boxes [24].
  • Waymo Open: The Waymo Open Dataset (https://waymo.com/open/, accessed on 25 April 2025) includes approximately 230,000 annotated frames and over 12 million 3D bounding boxes. It provides synchronized data from five LiDAR sensors and five cameras, spanning 798 training, 202 validation, and 150 test segments. Annotated classes cover vehicles, pedestrians, cyclists, and traffic signs [24].

3.2. Evaluation Metrics

3.2.1. General Metrics

To quantitatively evaluate 3D OD methods, a variety of evaluation metrics have been proposed. These metrics extend foundational concepts from 2D OD by incorporating depth, orientation, and volumetric estimation. While the core ideas, such as precision, recall, and average precision (AP), remain central, they are adapted to operate on 3D bounding boxes.
The major difference between 2D and 3D detection metrics lies in the matching criteria between ground truths and predictions when calculating precision and recall. The quality of a detection is commonly measured using the Intersection over Union (IoU), which computes the ratio of the overlapping volume between a predicted and ground-truth 3D bounding box. A prediction is typically considered a true positive (TP) if the IoU exceeds a predefined threshold (usually 0.5); otherwise, it is classified as a false positive (FP). Missed detections are treated as false negatives (FNs), and correctly ignored negatives are true negatives (TNs). These categories form the basis of the confusion matrix:
  • True Positives ( T P ): Correctly predicted positives.
  • False Positives ( F P ): Incorrectly predicted positives.
  • False Negatives ( F N ): Missed predictions.
  • True Negatives ( T N ): Correctly predicted negatives.
Based on these, two key performance metrics are defined:
Precision = T P T P + F P ,
Recall = T P T P + F N .
Precision quantifies the correctness of predicted positives, while recall captures the detector’s ability to find all relevant objects. These metrics are often visualized through precision–recall (P-R) curves, from which the average precision (AP) is derived. In theory, both precision and recall should be as close to 1 (one) as possible. However, in practice, they often conflict; improving one may degrade the other.
The AP for a given class is computed as the area under the P-R curve:
A P = 0 1 max { p ( r ) r r } d r ,
where p ( r ) is the precision at recall r.
Once the AP is computed for each object class, the mean average precision (mAP) is obtained by averaging across all classes. The setting of the IoU threshold plays a critical role here, as it affects the strictness and comprehensiveness of the evaluation. For instance, the COCO benchmark computes mAP over multiple IoU thresholds, ranging from 0.5 to 0.95 in steps of 0.05, to provide a robust evaluation of localization and classification performance.
IoU is computed using the volumes of the bounding boxes:
IoU = | A B | | A B | ,
where A and B represent the predicted and ground-truth 3D bounding boxes, respectively. This volumetric IoU is more sensitive to spatial misalignment than its 2D counterpart, making evaluation more stringent.
While AP and mAP remain the principal metrics for assessing detection quality, additional metrics are crucial in real-time or resource-constrained scenarios. These include the number of model parameters and floating-point operations per second (FLOPs), which reflect computational complexity, as well as latency and frames per second (FPS), which assess runtime efficiency. Together, these metrics offer a holistic view of a model’s accuracy, efficiency, and deployability.

3.2.2. Dataset-Specific Metrics

  • KITTI: The KITTI benchmark [28] remains one of the most commonly used datasets for 3D OD. It evaluates model performance using three primary metrics [16]:
  • AP2D: Average precision computed by projecting the predicted 3D bounding boxes into the 2D image plane and calculating 2D IoU.
  • AP3D: Average precision computed using the full 3D bounding box IoU.
  • APBEV: Average precision computed from a bird’s-eye view (BEV) projection of the 3D bounding box.
In addition to these metrics, KITTI also reports the Average Orientation Similarity (AOS), which assesses the alignment between predicted and ground-truth object orientations. Different object classes have distinct IoU thresholds for evaluation: for instance, 0.7 for cars and 0.5 for pedestrians and cyclists. This accounts for the relative difficulty of localizing small or occluded objects.
  • nuScenes: The nuScenes benchmark [36] proposes a more comprehensive evaluation scheme that moves beyond traditional IoU-based matching. The authors argue that using IoU alone does not capture all relevant aspects of detection quality in complex urban environments. Instead, nuScenes introduces centre-based matching, where predicted objects are associated with the ground truth based on their 2D centre distance on the ground plane. The newly introduced scores quantify how closely the predicted objects align with the ground truth not just in terms of location, but also shape, pose, and dynamic behaviour. The final nuScenes Detection Score (NDS) aggregates the mean average precision (mAP) and the mean TP metrics (mTP) into a single holistic score:
N D S = 1 10 5 · m A P + i = 1 5 ( 1 min ( 1 , m T P i ) )
This score captures both detection accuracy and multi-attribute consistency, making it a more holistic evaluation metric.
  • Waymo Open Dataset: The Waymo benchmark [37] evaluates detection at two levels:
  • Level 1 (L1): Objects with at least five LiDAR points inside the bounding box.
  • Level 2 (L2): All annotated objects, including sparse detections.
To capture different aspects of model performance, the dataset defines multiple variants of average precision. The standard AP measures detection accuracy using IoU-based matching, while APH extends this by incorporating heading angle accuracy, evaluating how well the model predicts object orientation. Additionally, Waymo introduces a variant based on the Hungarian algorithm, which performs optimal one-to-one assignment between predictions and ground truth, particularly beneficial when multiple nearby detections could otherwise result in ambiguity.

4. Taxonomy and Review

Artificial Intelligence (AI), and in particular neural networks (NNs), have become foundational to many modern technological applications. DL, a subfield of ML and AI, enables models to learn hierarchical and abstract representations from large volumes of data, significantly improving prediction accuracy across various domains. Its flexibility and high capacity for pattern recognition have made DL especially valuable in AVs, where many SoA systems rely heavily on deep CNNs.
CNNs have demonstrated exceptional performance in 2D image-based tasks, including classification and object detection. However, their extension to 3D perception, especially OD in PCs, introduces specific challenges. Unlike structured image grids, PCs are inherently unordered, irregular, and sparse, making standard convolution operations unsuitable without significant modification. Specialized architectures and preprocessing techniques are thus required to adapt CNN-based methods to 3D inputs.
The widespread adoption of DL has been driven by two key enablers: the availability of large-scale annotated datasets and advances in Graphics Processing Unit (GPU) hardware, which support high-throughput parallel computation [31]. DL models are composed of multiple layers of artificial neurons that transform input data using learnable weights and biases. These weights are optimized during training by minimizing a loss function that quantifies the error between predictions and ground truth [27]. The methods reviewed in this work are primarily based on supervised learning, where models are trained using labelled datasets to learn relevant spatial and semantic patterns.
This section presents an overview of the main families of 3D OD approaches, organized according to the type of sensor input and processing strategy. For each modality or hybrid configuration, we highlight the key trends and most influential methods that have shaped the field.

4.1. Taxonomy of 3D Object Detection

Compared to 2D OD, 3D OD introduces distinct challenges, particularly in terms of depth perception and spatial representation. When working with image data alone, the primary limitation is the lack of explicit depth information. Two common strategies are used to address this: (i) decoupling the task into 2D OD followed by depth estimation, often using geometric constraints or priors to reconstruct 3D positions; and (ii) employing DL models that directly infer 3D object properties from image inputs. In this work, we focus on the latter approach, where monocular, stereo, and multi-view camera configurations are used to extract depth-aware features through learned representations.
In contrast, when only PC data are available, typically acquired using LiDAR, the main challenge becomes learning from data that are sparse, unordered, and non-uniform. Various strategies have emerged to process such data effectively, and these are commonly grouped into four categories: point-based methods, which operate directly on raw PC coordinates; voxel-based methods, which discretize the space into regular 3D grids; point–voxel hybrid approaches, which combine the advantages of both representations; and projection-based methods, which transform 3D data into 2D representations to leverage mature 2D CNN architectures.
Radar-based 3D detection methods are also gaining attention due to radar’s robustness in adverse weather and lighting conditions. However, the relatively low spatial resolution and noisy measurements of radar data pose unique challenges. Most radar-based approaches either project radar points into bird’s-eye view representations or fuse them with other modalities to compensate for their limitations.
Multi-modal fusion methods attempt to harness the complementary strengths of different sensor modalities, most commonly combining camera and LiDAR data. These methods are generally categorized based on the stage at which the fusion occurs within the perception pipeline: early fusion combines raw sensor data before feature extraction; mid-level fusion integrates intermediate features from each modality; and late fusion merges the outputs of independent modality-specific detectors.
Although there is some agreement in the literature regarding the categorization of 3D OD methods, variations in terminology and classification criteria remain. To address this, a high-level taxonomy was developed to capture the dominant and most widely adopted strategies. Figure 4 illustrates the proposed taxonomy and maps out the representative methods covered in this review.
Figure 4. Taxonomy of 3D object detection methods categorized by input modality.
Figure 4. Taxonomy of 3D object detection methods categorized by input modality.
Sensors 25 05264 g004

4.2. Camera-Based Methods

Three-dimensional OD methods often build upon principles inherited from 2D detectors. However, the transition from 2D to 3D introduces increased sensitivity to localization errors. While a few pixels of deviation in 2D detection may still result in acceptable IoU scores, small spatial misalignments in 3D, even at the decimetre scale, can significantly degrade detection accuracy. This section presents the most prominent camera-based strategies for 3D OD, including monocular, stereo, and multi-view or multi-camera approaches.

4.2.1. Monocular-Based Methods

Monocular 3D OD aims to predict object depth, orientation, and size using only a single RGB image. Its minimal hardware requirements make it attractive for large-scale deployment, particularly in cost-sensitive or lightweight platforms [29].
One of the earliest works, 3DVP [43], introduced a viewpoint-aware classification model that leveraged sub-category structures. Mono3D [44] advanced the field by generating 3D proposals on a learned ground plane, providing a robust geometry-aware baseline. Deep3DBox [45] streamlined the pipeline with a single network predicting object orientation and dimensions end-to-end.
Subsequent efforts focused on improving both accuracy and efficiency. M3D-RPN [46] was the first to introduce a two-stage Region Proposal Network tailored for monocular 3D detection. MonoDIS [47] disentangled depth, shape, and orientation learning into separate branches, improving training stability. Lightweight single-shot architectures like SS3D [48] and SMOKE [49] removed the need for dense proposals, achieving real-time inference.
Recent developments leverage auxiliary knowledge sources. MonoDistill [50] and MonoSKD [51] showed how student models can benefit from distilled supervision by larger teacher networks. MonoNeRD [52] incorporated neural radiance field priors to model fine-grained object appearance and geometry. MonoCD [53] achieved SoA results without relying on LiDAR or dense proposal mechanisms.

4.2.2. Stereo-Based Methods

Stereo-based 3D OD estimates depth using the disparity between stereo image pairs. The resulting depth maps are fused with 2D detection outputs to localize objects in 3D. Compared to monocular approaches, stereo methods benefit from geometric priors and disparity cues that enable more accurate depth perception [19,24].
Early approaches like 3DOP relied on hand-crafted features and sliding-window proposals. Stereo R-CNN [54] marked a shift toward DL-based stereo detection by integrating a disparity-aware ROI head into a Faster R-CNN backbone. RT3D-Stereo further improved efficiency by introducing a lightweight cost volume and a 3D-aware regression module.
DSGN [55] constructed a scene-level representation over stereo cost volumes and refined object predictions using message passing. YOLOStereo3D [56] adopted a single-shot architecture to streamline the detection process. LIGA-Stereo [57] combined sparse LiDAR signals with stereo features to guide attention. More recently, StereoDistill [58] demonstrated that compact student networks can replicate the performance of larger stereo teacher models through structured distillation.

4.2.3. Multi-View/Multi-Camera-Based Methods

Multi-camera systems are frequently used in AVs to provide 360° perception of the environment. These methods aggregate multiple views to construct a unified 3D scene representation, often by projecting image features into a bird’s-eye view (BEV) space [19]. However, this projection introduces ambiguity due to the absence of explicit depth information, which complicates accurate feature alignment.
To mitigate this, recent approaches have adopted transformer-based architectures that enhance spatial reasoning and cross-view feature fusion. DETR3D [59] introduced 3D object queries that attend to relevant regions across all views. PETR [60] enriched positional encoding to improve object localization. BEVFormer [61] leveraged temporal consistency across video frames to enhance BEV features. SparseBEV [62] and RoPETR [63] proposed attention-efficient alternatives to reduce computational cost while maintaining high spatial fidelity.

4.2.4. Discussion

Camera-based methods are attractive due to their affordability and ability to capture rich semantic features such as colour, texture, and appearance. Many 3D detectors for camera inputs borrow architectural principles from well-established 2D detection models [32]. However, monocular methods continue to face performance limitations. Small depth estimation errors can lead to significant inaccuracies in 3D localization, which is particularly problematic in safety-critical scenarios like AVs.
To address these limitations, pseudo-LiDAR methods have been proposed, which convert estimated depth into synthetic point clouds. While this approach brings monocular setups closer to LiDAR-like reasoning, its performance remains constrained by depth prediction quality and calibration accuracy. Stereo methods offer improved depth estimation through geometric constraints but require high computational resources for disparity matching and cost volume construction.
Multi-view approaches enable comprehensive scene understanding by leveraging multiple overlapping views. However, without depth supervision, aligning features across views remains a key challenge. Transformer-based methods have made significant progress in this area, particularly in benchmarks like nuScenes that provide complete camera coverage. These models continue to evolve toward more efficient cross-view fusion and real-time inference.
In terms of evaluation, traditional 2D metrics such as AP2D or orientation similarity scores are insufficient to fully assess 3D detection performance. The most relevant metric for these models is AP3D, often accompanied by APBEV to capture localization accuracy in the ground plane. To support fair comparison, Table 3 summarizes the performance of key camera-based 3D detectors, reporting AP2D, AP3D, and APBEV on KITTI (for easy (E), moderate (M), and hard (H) difficulty levels), as well as mAP and NDS on nuScenes, and L1/L2 mAP on Waymo. Inference times and hardware details are also included.

4.3. LiDAR-Based Methods

Unlike images, where pixels are regularly distributed and can be processed using conventional CNNs, PCs are sparse, unordered, and irregular 3D representations. LiDAR sensors generate such PCs by emitting laser pulses and recording their return time, capturing the high-resolution geometric structure of the environment with strong spatial accuracy [64]. However, this structure introduces unique challenges for DL models, which are typically designed for dense and grid-like data.
Traditional LiDAR processing relied on geometric and topological cues to extract object-level information, for example by detecting abrupt changes in surface orientation or reflectivity to infer object boundaries [65]. While computationally efficient and interpretable, such classical methods often struggle in complex scenes with occlusions, clutter, or sparse measurements.
Modern DL approaches, by contrast, have been built around more structured or learned representations of the PC. To make LiDAR data compatible with NN architectures, the raw points are typically transformed into one of several possible intermediate formats. Most works in the literature categorize these into four main classes: point-based, voxel-based, hybrid point–voxel, and projection-based representations. Other alternatives, such as range-view, graph-based, and transformer-based models, have also emerged. Each representation encodes spatial information differently, leading to trade-offs in runtime, memory consumption, and geometric fidelity.

4.3.1. Projection-Based Methods

Projection-based methods convert 3D point clouds into 2D representations to leverage efficient and mature 2D CNNs. Common projection types include front view, spherical view, and most prominently, BEV, which projects points onto the x y -plane. BEV is especially popular in AD due to the constrained ground motion of vehicles.
BEV reduces occlusions, enables consistent metric scaling, and avoids overlapping object bounding boxes. The process usually involves discretizing x and y positions into pixel grids, while collapsing the height dimension [66]. The resulting pseudo-image encodes statistics such as point density, intensity, and height per cell.
Early works like PIXOR [67], Complex-YOLO [68], and BirdNet [69] adopted standard 2D CNNs to process BEV inputs, showing the feasibility of this approach. Complex-YOLO refined anchor definitions and introduced orientation modelling in polar coordinates. Despite their speed and simplicity, projection-based methods discard vertical detail and are sensitive to sparsity and resolution settings. Nonetheless, they offer an attractive balance between speed and accuracy in real-time scenarios.

4.3.2. Point-Based Methods

Point-based methods operate directly on raw 3D coordinates without intermediate discretization. These approaches preserve fine-grained geometry and avoid quantization artefacts, but must account for the unordered nature of point sets.
PointNet [70] was a foundational work in this space, introducing permutation-invariant architectures via shared MLPs and symmetric aggregation functions. However, it mainly captured global context, limiting its performance on local geometric structures. PointNet++ [71] addressed this limitation through hierarchical feature extraction on local neighbourhoods.
Subsequent methods such as PointRCNN [72], 3DSSD, STD, IA-SSD [73], and PointFormer [74] improved accuracy by incorporating attention mechanisms, local spatial encoding, and direct 3D box regression. These models maintain high geometric fidelity and excel in preserving object boundaries. However, they often involve expensive neighbourhood search operations and are computationally demanding, making real-time deployment more difficult.

4.3.3. Voxel-Based Methods

Voxel-based methods discretize the 3D space into a structured grid of equally sized volumetric elements (voxels), allowing the application of 3D convolutions. Each voxel can encode occupancy, intensity, and local statistics. While this enables efficient parallel processing, voxelization introduces resolution–memory trade-offs and can result in the loss of fine geometric detail.
VoxelNet [75] was the first end-to-end model to apply voxel-based feature learning. SECOND [76] introduced sparse 3D convolutions to reduce computational cost, while PointPillars [77] collapsed the vertical dimension, forming pseudo-pillars to enable fast 2D convolution on BEV-like features.
More recent voxel-based models have incorporated transformer architectures and attention mechanisms to improve performance. HotSpotNet [78], Voxel R-CNN [79], VoTr-TSD [80], and TED [81] represent this line of progress. TED currently holds state-of-the-art performance among LiDAR-only methods on the KITTI benchmark. These methods strike a balance between regular structure and accuracy but remain constrained by cubic memory growth with increasing resolution. Notably, only around 1% of voxels are occupied in KITTI and 3% in Waymo [16], highlighting inefficiencies in voxel-based representations.

4.3.4. Point–Voxel Hybrid Methods

Hybrid methods aim to combine the geometric richness of point-based models with the efficiency of voxel-based networks. Typically, voxelized backbones are used for coarse region proposals, which are then refined using point-level features for more precise bounding box prediction.
Fast Point R-CNN [82], PV-RCNN [83], and PDV [84] exemplify this family of detectors. These models perform repeated point-to-voxel and voxel-to-point transformations, enabling fine-grained multi-scale feature fusion. While they achieve high accuracy and robust object boundary modelling, their bidirectional feature interaction comes at the cost of additional memory usage and runtime complexity.

4.3.5. Other Representations

Beyond the dominant four categories, other representations have emerged. Range-based methods treat LiDAR scans as 2D range images, enabling the use of efficient 2D convolutions [70]. Graph-based methods model PCs as node–edge structures to capture spatial topology. More recently, transformer-based architectures have gained traction for their ability to model long-range dependencies and support parallel computation, making them increasingly popular in AD perception pipelines [85].

4.3.6. Discussion

LiDAR-based 3D OD has matured significantly, supported by diverse representations and architectural innovations. Each strategy introduces specific advantages and trade-offs.
Point-based models preserve raw geometry and deliver precise localization but remain computationally demanding and hard to scale to large scenes. Voxel-based models offer regular structure and efficient parallelization, but their memory usage scales cubically with resolution and suffers from data sparsity. Projection-based approaches enable real-time processing via 2D CNNs and are dominant in deployment settings, though their performance is bounded by the quality of projection and resolution settings. Hybrid methods strike a strong accuracy–runtime balance, but their complexity and synchronization cost limit scalability.
Ultimately, the choice of representation depends on the application: point-based for highest fidelity, voxel- or projection-based for efficient real-time inference, and hybrid models for balanced performance. A summary of the most representative LiDAR-based detection models, categorized by representation and benchmark performance, is provided in Table 4. This enables direct comparison with camera-based approaches in Table 3, and illustrates how architectural choices impact detection performance across the chosen datasets.

4.4. Radar-Based Methods

Unlike vision-based systems, radar offers consistent performance in challenging environmental conditions, making it well-suited for 3D object detection. The first DL approach, Radar-PointGNN (2021) [86], modelled the sparse range–Doppler point cloud as a graph and applied graph neural networks to capture spatial relationships. In 2022, K-Radar [87] introduced a grid-based fusion of multi-view radar snippets, enabling dense feature extraction across time, and KPConvPillars [88] adapted the KPConv convolution to radar pillars, significantly improving detection resolution and angular precision.
More recent methods leverage teacher–student distillation and large-scale pretraining to overcome data scarcity. RadarDistill (2024) [89] distilled LiDAR-based feature representations into a radar-only student network, achieving substantial gains in both range and velocity estimation. In 2025, RADLER [90] presented a multimodal foundation model that pretrains on synchronized radar–camera–LiDAR streams, demonstrating that rich cross-modal pretraining can elevate pure radar detection performance to higher levels of performance. Table 5 aggregates some of the most important radar-based methods compared under the few values reported in benchmarked datasets.

4.5. Multi-Modal-Based Methods

Multi-modal methods integrate multiple sensor modalities to exploit their complementary characteristics in 3D OD. The most prevalent fusion strategy combines LiDAR point clouds and RGB images, as these are the primary sensors in autonomous AD platforms. While LiDAR provides accurate geometric structure in 3D, images offer dense semantic and texture rich cues. When fused effectively, these modalities compensate for each other’s limitations, such as occlusions, sparse depth, or poor lighting, and improve detection robustness [24,32].
Single-modality systems are often insufficient for real-world deployment. Cameras, while rich in appearance information, lack direct depth perception. Conversely, LiDAR sensors offer spatial accuracy but no visual appearance or colour. Fusion thus enhances the overall reliability and accuracy of perception systems, particularly in adverse conditions or sensor failures. Most autonomous vehicles are equipped with both cameras and LiDAR, making sensor fusion a practical and effective design choice.
Fusion can be performed at different stages of the detection pipeline, broadly categorized into three levels [29,32]: early fusion, where raw or low-level features are combined; mid-level fusion, where intermediate features are merged after being learned independently; and late fusion, where modality-specific decisions are integrated at the final stage.

4.5.1. Early Fusion Methods

Early fusion strategies integrate raw sensor inputs or low-level features before any deep processing. A common approach is to project LiDAR points onto the image plane and enhance them with pixel-level semantic information. For instance, PointPainting [91] enriches LiDAR points with semantic segmentation scores from an image-based network, effectively augmenting 3D input with 2D semantic priors. Similarly, Frustum PointNet [92] first uses 2D detectors to define frustums in 3D space and then applies a point-based network to the subset of points within each frustum. This pipeline was further extended by F-ConvNet [93] and Frustum PointPillars [94], which replace the 3D backbone to improve detection. More recently, VirConvNet [95] has achieved SoA performance on KITTI by employing virtual convolutions across fused representations.
While early fusion offers tight cross-modal interaction and benefits from direct semantic guidance, it is highly sensitive to calibration errors and temporal misalignment. The requirement for accurate extrinsic and intrinsic calibration means any deviation can degrade performance. Additionally, the preprocessing stages, including 2D segmentation or detection, add computational overhead and increase latency during both training and inference.

4.5.2. Mid-Level Fusion Methods

Mid-level, or deep fusion, combines intermediate features extracted independently from each sensor modality. These features are often spatially aligned in a common space, typically BEV, frustum, or image views, before fusion. This allows the network to learn cross-modal interactions while retaining modality-specific inductive biases.
MV3D [96] was one of the earliest examples, combining features from LiDAR BEV, LiDAR front view, and image view during the proposal refinement stage. AVOD [97] extended this by integrating LiDAR and image features for both proposal generation and refinement. ContFuse [98] introduced continuous convolutions to merge multi-scale features in a pixel-wise manner, although its performance can degrade with sparse or noisy inputs. MMF [99] incorporated auxiliary tasks such as ground plane estimation and depth completion to improve feature fusion robustness.
Later models introduced more advanced fusion mechanisms. EPNet [100] constructs image-aware proposal features by conditioning LiDAR features on image context. Transformer-based methods such as TransFusion [101] and FUTR3D [102] introduced cross-modal attention modules, improving feature alignment and information flow across views and modalities.
Mid-level fusion offers a compelling balance between semantic richness and spatial precision. Unlike early fusion, it avoids the need for raw data alignment, and unlike late fusion, it allows rich interaction between feature spaces. However, aligning intermediate features remains challenging due to differing receptive fields, sampling rates, and spatial resolutions across modalities.

4.5.3. Late Fusion Methods

Late fusion approaches process each modality independently through separate detection branches and merge their outputs during post-processing. This may involve strategies such as confidence-weighted fusion, geometric consistency checks, or non-maximum suppression (NMS) across modalities.
CLOCs [103] represents a modular late-fusion system that refines and merges 2D and 3D detections based on semantic and geometric constraints. Fast-CLOCs [104] optimized this pipeline for real-time deployment. While late fusion is computationally efficient and allows for reuse of pretrained single-modality detectors, it lacks the fine-grained interaction and synergy of feature-level fusion. As a result, its performance tends to lag behind early and mid-level methods.

4.5.4. Discussion

Multi-modal fusion consistently achieves SoA performance by exploiting the strengths of each sensor: LiDAR ensures precise spatial localization, while images enrich detections with contextual and semantic information. Fusion pipelines are often LiDAR-centric, using LiDAR data to generate proposals and images to refine object categories or adjust box parameters. This configuration has become the dominant architecture in modern systems.
However, fusion introduces its own set of challenges. Misalignment due to unsynchronized timestamps, calibration drift, and different acquisition frequencies can lead to incorrect correspondences between modalities. Resolution mismatches are also a concern, as many LiDAR points may fall into a single pixel, and vice versa. Additionally, some objects may be visible in only one modality, creating further ambiguity during training and inference. Effective data augmentation must be jointly applied across sensors to avoid inconsistencies [24,29].
While camera–LiDAR fusion dominates the literature, radar has gained attention as a complementary sensor. Radar systems are robust under adverse weather, offer long-range detection, and can directly measure velocity. However, radar data are typically sparse and noisy with poor spatial resolution, making stand-alone 3D detection difficult. Moreover, the lack of large-scale radar datasets has limited its development. Instead, radar is usually fused with LiDAR or image data.
Fusion methods like RadarNet [105] and RICCARDO [106] incorporate radar and camera data, while frameworks such as CenterFusion [107] explore the fusion of LiDAR and radar data. Surveys like [108] for camera–radar fusion and [109] for LiDAR–radar fusion provide greater detail in these types of methods. Despite its potential, radar fusion remains underexplored compared to camera–LiDAR fusion and is often used as a supplementary modality.
Table 6 summarizes the most representative fusion-based 3D OD models, categorized by fusion strategy and sensor combination. It includes detection performance across benchmarks such as KITTI and nuScenes, as well as inference runtimes and year of publication. This taxonomy complements the earlier tables, allowing a systematic comparison of fusion approaches against LiDAR-only and camera-based detectors.
Table 3. Camera-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set (for easy (E), moderate (M) and hard (H) difficulty levels).
Table 3. Camera-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set (for easy (E), moderate (M) and hard (H) difficulty levels).
MethodYearAP2DAP3DAPBEVnuScenesWaymoTime (s)HardwareCode Available
EMHEMHEMHmAPNDSL1 mAPL2 mAP
Monocular Camera:
3DVP [43]201584.9576.9865.78408 cores @ 3.5 GHz (Matlab + C/C++)
Mono3D [44]201680.3067.2962.234.2GPU @ 2.5 GHz (Matlab + C/C++)
SubCNN [110]201694.2689.9879.782GPU @ 3.5 GHz (Python + C/C++)
Deep3DBox [45]201694.7190.1976.821.5GPU @ 2.5 GHz (C/C++)
Deep MANTA [111]201798.8993.5083.210.7GPU @ 2.5 GHz (Python + C/C++)×
3D-RCNN [112]201890.0289.3980.29
ROI-10D [113]201876.5670.1661.154.322.021.469.784.913.740.20GPU @ 3.5 GHz (Python)×
MF3D [114]201890.4387.3376.787.085.184.6813.739.628.22
MonoGRNet [115]201888.6577.9463.319.615.744.2518.1911.178.730.04NVIDIA P40
GS3D [116]201986.2376.3562.674.472.902.478.416.084.9421 core @ 2.5 GHz (C/C++)×
Mono3D-PLiDAR [117]201980.8553.3644.8010.767.506.1021.2713.9211.250.10NVIDIA GeForce 1080 (pytorch)×
AM3D [118]201992.5588.7177.8816.5010.749.5227.9122.2418.620.40GPU @ 2.5 GHz (Python + C/C++)×
Deep Optics [119]201916.8613.8213.2626.7119.8719.11×
CenterNet [120]201933.8040.00
FQNet [121]201994.7290.1776.782.771.511.015.403.232.460.501 core @ 2.5 GHz (Python)×
Shift R-CNN [122]201994.0788.4878.346.883.872.8311.846.825.270.25GPU @ 1.5 GHz (Python)×
MonoFENet [123]201991.6886.6376.718.355.144.1017.0311.039.050.151 core @ 3.5 GHz (Python)×
MonoDIS [47]201990.3187.5876.8510.377.946.4018.8019.0817.4130.4038.40
MonoPSR [124]201993.6388.5073.3610.767.255.8518.3312.589.910.20GPU @ 3.5 GHz (Python)
MoVi-3D [125]201915.1910.909.2622.7617.0314.85×
RefinedMPL [126]201988.2965.2453.2018.0911.148.9428.0817.6013.950.15GPU @ 2.5 GHz (Python + C/C++)×
M3D-RPN [46]201989.0485.0869.2614.769.717.4221.0213.6710.230.16GPU @ 1.5 GHz (Python)
SS3D [48]202092.7284.9270.3510.787.686.5116.3311.529.930.048Tesla V100
MonoPair [127]202096.6193.5583.5513.049.998.6519.2814.8312.890.06GPU @ 2.5 GHz (Python + C/C++)×
RTM3D [128]202091.8286.9377.4114.4110.348.7719.1714.2011.990.05GPU @ 1.0 GHz (Python)
SMOKE [49]202093.2187.5177.6614.039.767.8420.8314.4912.750.03GPU @ 2.5 GHz (Python)
PatchNet [129]202015.6811.1210.1722.9716.8614.970.390.380.41 core @ 2.5 GHz (C/C++)×
IAFA [130]202093.0889.4679.8317.8112.0110.6125.8817.8815.350.041 core @ 2.5 GHz (C/C++)×
Kinematic3D [131]202089.6771.7354.9719.0712.729.1726.6917.5213.100.121 core @ 1.5 GHz (C/C++)
KM3D [132]202096.4491.0781.1916.7311.459.9223.4416.2014.470.031 core @ 2.5 GHz (Python)
DDMP-3D [133]202191.1581.7063.1219.7112.789.8028.0817.8913.440.181 core @ 2.5 GHz (Python)
MonoRUn [134]202195.4887.9178.1019.6512.3010.5827.9417.3415.240.07GPU @ 2.5 GHz (Python + C/C++)
GrooMeD-NMS [135]202190.1480.2863.7818.1012.329.6526.1918.2714.050.121 core @ 2.5 GHz (Python)
MonoDLE [136]202193.8390.8180.9317.2312.2610.2924.7918.8916.000.04GPU @ 2.5 GHz (Python)
CaDDN [137]202193.6180.7371.0919.1713.4111.4627.9418.9117.190.63GPU @ 2.5 GHz (Python)
MonoFlex [138]202196.0191.0283.3819.9413.8912.0728.2319.7516.890.03GPU @ 2.5 GHz (Python)
MonoRCNN [139]202191.9086.4866.7118.3612.6510.0325.4818.1114.100.07GPU @ 2.5 GHz (Python)
FCOS3D [140]202135.8042.80
MonoEF [141]202196.3290.8883.2721.2913.8711.7129.0319.7017.260.031 core @ 2.5 GHz (Python)
GUPNet [142]202194.1586.4574.1822.2615.0213.1230.2921.1918.201 core @ 2.5 GHz (Python + C/C++)
PGD [143]202192.0480.5869.6719.0511.769.3926.8916.5113.4938.6044.800.031 core @ 2.5 GHz (C/C++)
Aug3D-RPN [144]202185.5777.8861.1617.8212.999.7826.0017.8914.180.081 core @ 2.5 GHz (C/C++)×
DD3D [145]202194.6993.9989.3723.1916.8714.3632.3523.4120.4241.8047.701 core @ 2.5 GHz (C/C++)
PCT [146]202196.4588.7878.8521.0013.3711.3129.6519.0315.920.890.660.0451 core @ 2.5 GHz (Python)
Autoshape [147]202186.5177.6064.4022.4714.1711.3630.6620.0815.950.041 core @ 2.5 GHz (C/C++)
DLE [148]202194.6684.4562.1024.2314.3310.3031.0919.0514.130.06NVIDIA Tesla V100×
MonoCon [149]202122.5016.4613.9531.1222.1019.00
MonoDistill [50]202222.9716.0313.6031.8722.5919.72
MonoDTR [150]202293.9088.4176.2021.9915.3912.7328.5920.3817.140.041 core @ 2.5 GHz (C/C++)
MonoDETR [151]202293.9986.1776.1924.5216.2613.9332.2021.4518.680.041 core @ 2.5 GHz (Python)
MonoJSG [152]202224.6916.1413.6432.5921.2618.18
HomoLoss [153]202295.9290.6980.9121.7514.9413.0729.6020.6817.810.041 core @ 2.5 GHz (Python)×
MonoDDE [154]202296.7689.1981.6024.9317.1415.1033.5823.4620.370.041 core @ 2.5 GHz (Python)×
Mix-Teaching [155]202296.3591.0283.4126.8918.5415.7935.7424.2320.80301 core @ 2.5 GHz (C/C++)
DCD [156]202296.4490.9383.3623.8115.9013.2132.5521.5018.250.031 core @ 2.5 GHz (C/C++)
DEVIANT [157]202294.4286.6476.6921.8814.4611.8929.6520.4417.430.041 GPU (Python)
Cube R-CNN [158]202295.7892.7284.8123.5915.0112.0631.7021.2018.430.05GPU @ 2.5 GHz (Python)
MoGDE [159]202227.0717.8815.6638.3825.6022.91×
ADD [160]202294.8289.5381.6025.6116.8113.7935.2023.5820.080.101 core @ 2.5 GHz (Python)×
CMKD [161]202295.1490.2883.9128.5518.6916.7738.9825.8222.800.101 core @ 2.5 GHz (C/C++)
MonoPGC [162]202324.6817.1714.1432.5023.1420.30×
MonoATT [163]202324.7217.3715.0036.8724.4221.88×
NeurOCS [164]202396.3991.0881.2029.8918.9415.9037.2724.4920.890.10GPU @ 2.5 GHz (Python)×
MonoNerd [52]202394.6086.8977.2322.7517.1315.6331.1323.4620.97NA1 core @ 2.5 GHz (Python)
MonoSKD [51]202396.6891.3483.6928.4317.3515.0137.1224.0820.370.041 core @ 2.5 GHz (Python)
ODM3D [165]202329.7519.0916.9339.4126.0222.76
MonoUNI [166]202394.3088.9678.9524.7516.7313.4933.2823.0519.390.041 core @ 2.5 GHz (Python)×
MonoDSSM [167]202493.9688.3176.1521.4714.5511.7828.2919.5916.340.021 core @ 2.5 GHz (Python + C/C++)×
MonoCD [53]202496.4392.9185.5525.5316.5914.5333.4122.8119.57NA1 core @ 2.5 GHz (Python)
MonoMAE [168]202425.6018.8416.7834.1524.9321.76×
MonoDiff [169]202430.1821.0218.16×
MonoDFNet [170]202425.7119.0715.9633.5624.5221.09
DPL [171]202424.1916.6713.8333.1622.1218.74×
Dp-M3D [172]202523.4113.6512.9132.3820.1316.58×
MonoDINO-DETR [173]2025
Pseudo-LiDAR2D [174]2025×
Stereo Camera:
3DOP [175]201592.9689.5579.383.00GPU @ 2.5 GHz (Matlab + C/C++)×
Pseudo-LiDAR [176]201885.4067.7958.5054.5334.0528.2567.3045.0038.400.40GPU @ 2.5 GHz (Python + C/C++)
Stereo R-CNN [54]201993.9885.9871.2547.5830.2323.7261.9241.3133.420.30GPU @ 2.5 GHz (Python)
TLNet [177]201976.9263.5354.587.644.373.7413.717.696.730.101 core @ 2.5 GHz (Python)
Pseudo-LiDAR++ [178]201994.4682.9075.4561.1142.4336.9978.3158.0151.250.40GPU @ 2.5 GHz (Python)
RT3D-Stereo [179]201956.5345.8137.6329.9023.2818.9658.8146.8238.380.08GPU @ 2.5 GHz (C/C++)
DSGN [55]202095.5386.4378.7573.5052.1845.1482.9065.0556.600.67NVIDIA Tesla V100
OC-Stereo [180]202087.3974.6062.5655.1537.6030.2568.8951.4742.970.351 core @ 2.5 GHz (Python + C/C++)
ZoomNet [181]202094.2283.9269.0055.9838.6430.9772.9454.9144.140.301 core @ 2.5 GHz (C/C++)
Disp R-CNN [182]202093.4582.6470.4568.2145.7837.7379.7658.6247.730.387GPU @ 2.5 GHz (Python + C/C++)
Pseudo-LiDAR E2E [183]202064.7543.9238.1479.6058.8052.10
CDN [184]202095.8587.1979.4374.5254.2246.3683.3266.2457.650.60GPU @ 2.5 GHz (Python)
CG-Stereo [185]202096.3190.3882.8074.3953.5846.5085.2966.4458.950.57GeForce RTX 2080 Ti×
RTS3D [186]202058.5137.3831.1272.1745.2238.48
RT3D-GMP [187]202062.4151.9539.1416.2311.4110.1269.1459.0045.490.06GPU @ 2.5 GHz (Python + C/C++)×
YOLOStereo3D [56]202194.8182.1562.1765.6841.2530.4276.1050.2836.860.10GPU 1080Ti
SIDE [188]202147.6930.8225.68×
LIGA-Stereo [57]202196.4393.8286.1981.3964.6657.2288.1576.7867.400.401 core @ 2.5 GHz (Python + C/C++)
StereoCenterNet [189]202196.6191.2793.5049.4431.3025.6262.9742.1235.370.04GPU @ 2.5 GHz (Python)×
ESGN [190]202144.0932.6029.1065.8046.3938.4278.1058.1249.280.06GPU @ 2.5 GHz (Python + C/C++)×
Pseudo-Stereo [191]202295.7590.2782.3223.7417.7415.1432.6423.7620.640.251 core @ 2.5 GHz (C/C++)
DSGN++ [192]202298.0895.7088.2783.2167.3759.9188.5578.9469.740.20GeForce RTX 2080 Ti
DID-M3D [193]202294.2991.0481.3124.4016.2913.7532.9522.7619.830.041 core @ 2.5 GHz (Python)
DMF [194]202289.5085.4982.5277.5567.3362.4484.6480.2976.050.201 core @ 2.5 GHz (Python + C/C++)×
StereoDistill [58]202397.6193.4387.7181.6666.3957.3989.0378.5969.340.401 core @ 2.5 GHz (Python)×
PS-SVDM [195]202394.4987.5578.2129.2218.1315.3538.1824.8220.891.001 core @ 2.5 GHz (Python)×
Multi-View Camera:
3DOMV [96]2017×
MVRA [196]201995.8794.9882.525.193.272.499.055.844.500.18GPU @ 2.5 GHz (Python)×
DETR3D [59]202141.247.9×
BEVDet [197]202139.846.3×
BEVDepth [198]20225260.9×
Table 4. LiDAR-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set.
Table 4. LiDAR-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set.
MethodYearAP2DAP3DAPBEVnuScenesWaymoTime (s)HardwareCode Available
EMHEMHEMHmAPNDSL1 mAPL2 mAP
ImVoxelNet [199]202289.9079.0969.4517.1510.979.1525.1916.3713.5841.247.90.20GPU @ 2.5 GHz (Python)
PETR [60]202244.550.4×
STS [200]202242.252.9×
BEVerse [201]202239.353.1×
BEVFormer [61]202248.156.9×
SOLOFusion202254.061.9×
PolarFormer [202]202245.654.3×
FocalPETR [203]202246.551.6×
BEV Distill [204]202249.659.4×
HoP [205]202362.468.5×
SparseBEV [62]202360.367.5×
StreamPETR [206]202355.063.1×
PolarBEVDet [207]202455.863.5×
RoPETR [63]202564.870.9×
Projection-Based:
C-YOLO [68]201867.7264.0063.0185.8977.4077.33
TopNet [208]201858.0445.8541.1112.679.287.9580.1668.1663.430.01NVIDIA GeForce 1080 Ti (TF-GPU)×
BirdNet [69]201879.3057.1255.1640.9927.2625.3284.1759.8357.350.11Titan Xp (Caffe)
PIXOR [67]201981.7077.0572.95
FVNet [209]201986.1477.1969.2765.4357.3451.8578.0465.0357.89
MODet [210]201966.0662.5460.0490.8087.5682.690.05GTX1080Ti×
HDNet [211]202089.1486.5778.32
PIXOR++ [211]202093.2886.0180.11×
BirdNet+ [212]202192.6186.7381.8076.1564.0459.7987.4381.8575.360.11Titan Xp (Caffe)
MGTANet [213]202267.5072.70
GPA3D [214]2023
Voxel-Based:
Vote3D [215]2015×
VoxelNet [75]201777.8264.1757.5187.9578.3971.29×
SECOND [76]201883.1373.6666.2089.3983.7778.59
PointPillars [77]201894.0091.1988.1782.5874.3168.9990.0786.5682.8140.1055.000.0161080 Ti + Intel i7
HotSpotNet [78]201996.2192.8189.8087.6078.3173.3494.0688.0983.2459.3066.000.041 core @ 2.5 GHz (Py + C/C++)×
Voxel R-CNN [79]202096.4995.1192.4590.9081.6277.0694.8588.8386.1375.5966.590.04GPU @ 3.0 GHz (C/C++)
VoTr-TSD [80]202195.9594.8192.2489.9082.0979.1494.0390.3486.1474.9565.910.071 core @ 2.5 GHz (C/C++)
TED [81]202296.6496.0393.3591.6185.2880.6895.4492.0587.300.101 core @ 2.5 GHz (C/C++)
VoxSeT [216]202296.1695.2390.4988.5382.0677.4692.7089.0786.290.0331 core @ 2.5 GHz (C/C++)
FocalsConv [217]202296.3095.2892.6990.5582.2877.5992.6789.0086.330.101 core @ 2.5 GHz (C/C++)
PillarNet [218]202266.0071.4083.2376.09
SWFormer [219]202277.869.2×
PV-GNN [220]202491.6482.4977.2895.0992.3887.44×
Point-Based:
iPOD [221]201890.2089.3087.3771.4053.4648.3486.9383.9877.85×
PointRCNN [72]201895.9291.9087.1186.9675.6470.7092.1387.3982.720.10GPU @ 2.5 GHz (Py + C/C++)
STD [222]201996.1493.2290.5387.9579.7175.0994.7489.1986.420.08GPU @ 2.5 GHz (Py + C/C++)
PointRGCN [223]201996.1992.6787.6685.9775.7370.6091.6387.4990.730.26GPU @ V100 (Python)
3DSSD [224]202097.6995.1092.1888.3679.5774.5592.6689.0285.8642.6056.400.04GPU @ 2.5 GHz (Py + C/C++)
Point-GNN [225]202096.5893.5088.3588.3379.4772.2993.1189.1783.900.60GPU @ 2.5 GHz (Python)
PointFormer [74]202087.1377.0669.2553.60
EPNet++ [226]202196.7395.1792.1091.3781.9676.7195.4189.0085.730.10GPU @ 2.5 GHz (Python)
SASA [227]202296.0195.3592.4288.7682.1677.1692.8789.5186.350.041 core @ 2.5 GHz (Py + C/C++)
IA-SSD [73]202296.1093.5690.6888.2780.3275.1092.7989.3384.350.0141 core @ 2.5 GHz (C/C++)
DFAF3D [228]202396.5893.3290.2488.5979.3772.2193.1489.4584.221 core @ 2.5 GHz (Python)×
HINTED [229]202495.1690.9785.5584.0074.1367.0390.6186.0179.290.041 core @ 2.5 GHz (C/C++)
Point–Voxel Hybrid:
PVCNN [230]2019
Fast Point R-CNN [82]201996.1393.1887.6885.2977.4070.2490.8787.8480.520.06GPU @ 2.5 GHz (Py + C/C++)×
PV-RCNN [83]201998.1794.7092.0490.2581.4376.8294.9890.6586.1477.510.081 core @ 2.5 GHz (Py + C/C++)
SA-SSD [231]202097.9295.1690.1588.7579.7974.1695.0391.0385.960.041 core @ 2.5 GHz (Python)
BADet [232]202198.6595.3490.2889.2881.6176.5995.2391.3286.480.141 core @ 2.5 GHz (C/C++)
Pyramid-PV [233]202195.8895.1392.6288.3982.0877.4992.1988.8486.210.071 core @ 2.5 GHz (C/C++)
DVFENet [234]202195.3594.5791.7786.2079.1874.5890.9387.6884.600.051 core @ 2.5 GHz (Py + C/C++)×
PDV [84]202296.0795.0092.4490.4381.8677.3694.5690.4886.230.101 core @ 2.5 GHz (C/C++)
EQ-PVRCNN [235]202298.2395.3292.6590.1382.0177.5394.5589.0986.400.20GPU @ 2.5 GHz (Py + C/C++)
PVT-SSD [236]202396.7595.9090.6990.6582.2976.8595.2391.6386.430.051 core @ 2.5 GHz (Py + C/C++)×
PG-RCNN [237]202396.6695.4090.5589.3882.1377.3393.3989.4686.540.06GPU @ 1.5 GHz (Python)
Uni3DETR [238]202391.1482.2677.58
Table 5. Radar-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set.
Table 5. Radar-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set.
MethodYearAP2DAP3DAPBEVnuScenesWaymoTime (s)HardwareCode Available
E M H E M H E M H mAP NDS L1 L2
Radar-PointGNN [86]2021-0.53.4×
K-Radar [87]2022
KPConvPillars [88]20224.913.9×
Dual Radar [239]2023×
CenterRadarNet [240]2024×
RadarDistill [89]2024-20.543.7
RADLER [90]2025
Table 6. Multi-modal-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set.
Table 6. Multi-modal-based 3D object detection results on KITTI car test set, nuScenes test set, and Waymo validation set.
MethodYearAP2DAP3DAPBEVnuScenesWaymoTime (s)HardwareCode Available
E M H E M H E M H mAP NDS L1 L2
Early Fusion:
F-PointNet [92]201795.8595.1785.4282.1969.7960.5991.1784.6774.770.17GPU @ 3.0 GHz (Python)
F-ConvNet [93]201995.8592.1980.0987.3676.3966.6991.5185.8476.110.47GPU @ 2.5 GHz (Python + C/C++)
RoarNet [241]2018×
Complexer-YOLO [242]201991.9284.1679.6255.9347.3442.6077.2468.9664.950.06GPU @ 3.5 GHz (C/C++)
PointPainting [91]201998.3992.5889.7182.1171.7067.0892.4588.1183.360.40GPU @ 2.5 GHz (Python + C/C++)×
FusionPainting [243]2021×
MVP [244]2021×
F-PointPillars [94]20210.064 cores @ 3.0 GHz (Python)
PointAugmenting [245]202189.1486.5778.32
VirConvNet [95]202398.0097.2794.5392.4887.2082.4595.9993.5290.380.091 core @ 2.5 GHz (C/C++)
HDF [246]2025×
Mid-Level Fusion:
MV3D [96]201696.4790.8378.6374.9763.6354.0086.6278.9369.800.36GPU @ 2.5 GHz (Python + C/C++)
AVOD [97]201795.1789.8882.8376.3966.4760.2389.7584.9578.320.08Titan X (Pascal)
PointFusion [247]2017×
ContFuse [98]201883.6868.7861.6794.0785.3575.880.06GPU @ 2.5 GHz (Python)×
MVXNet [248]201983.2072.7065.20×
PI-RCNN [249]201996.1792.6687.6884.3774.8270.0391.4485.8181.000.101 core @ 2.5 GHz (Python)×
MCF3D [250]2019×
MMF [99]202097.4194.2591.8088.4077.4370.2293.6788.2181.990.08GPU @ 2.5 GHz (Python)×
3D-CVF [251]202096.7893.3686.1189.2080.0573.1193.5289.5682.450.061 core @ 2.5 GHz (C/C++)
EPNet [100]202096.1594.4489.9989.8179.2874.5994.2288.4783.690.101 core @ 2.5 GHz (Python + C/C++)
EPNet++ [226]202196.7395.1792.1091.3781.9676.7195.4189.0085.730.10GPU @ 2.5 GHz (Python)×
TransFusion [101]202268.9071.70×
BEVFusion [252]2022×
FUTR3D [102]202269.4072.10×
DeepFusion [253]2022×
MSMDFusion [254]2022×
CAT-Det [255]202295.9794.7192.0789.8781.3276.6892.5990.0785.820.30GPU @ 2.5 GHz (Python + C/C++)×
HMFI [256]202296.2995.1692.4588.9081.9377.3093.0489.1786.370.101 core @ 2.5 GHz (C/C++)
LoGoNet [257]202396.6095.5593.0791.8085.0680.7495.4891.5287.090.101 core @ 2.5 GHz (C/C++)
SDVRF [258]2023×
SupFusion [259]2023×
FGFusion [260]2023×
VCD [261]2023×
UniTR [262]202370.9074.50×
Late Fusion:
CLOCS [103]202096.7796.0791.1189.1682.2877.2392.9189.4886.420.101 core @ 2.5 GHz (Python)×
Fast-CLOCS [104]202296.6995.7590.9589.1080.3576.9993.0389.4986.4063.1068.700.10GPU @ 2.5 GHz (Python)

5. Evaluation and Discussion

Camera-based methods are lightweight, cost-effective, and easy to deploy, making them particularly attractive for applications with limited hardware constraints. They preserve rich semantic cues such as texture, colour, and object appearance, which are advantageous for visual understanding and object classification. However, due to the inherent challenge of regressing depth from 2D projections, camera-based detectors often underperform in spatial metrics such as 3D IoU, localization precision, and orientation accuracy [29].
Stereo and multi-view systems partially mitigate these issues by leveraging geometric constraints, enabling disparity estimation and improved depth perception. Despite this, their accuracy still lags behind that of LiDAR-based systems, especially in large-scale outdoor environments. Multi-camera configurations, while offering broader spatial coverage, require complex calibration and are mostly reported on datasets such as nuScenes, which support full 360° camera rings.
LiDAR sensors, by contrast, directly capture 3D spatial coordinates, providing accurate and dense measurements of the environment. This makes them particularly effective for estimating object position, size, and orientation. LiDAR-based methods consistently outperform camera-only approaches in 3D detection tasks and show greater robustness under varying lighting conditions [24]. However, they come with their own drawbacks: LiDAR hardware is significantly more expensive, bulkier, and prone to performance degradation in adverse weather (e.g., fog, rain, or snow) [32]. Additionally, point density decreases with distance, reducing resolution for far-away objects. Radars solve some of these problems but still lack the accuracy and resolution of LiDAR.
To overcome the individual limitations of single-sensor systems, multi-modal fusion has become an increasingly prevalent approach. By integrating LiDAR’s spatial precision with the semantic richness of images, fusion-based models yield improved scene understanding, higher detection accuracy, and better robustness to occlusions or sensor dropout [29]. Nonetheless, these methods introduce new complexities: precise extrinsic calibration is required, and data from different sensors must be carefully synchronized and spatially aligned. Fusion also increases computational burden, making real-time inference more challenging.
Table 3, Table 4, Table 5 and Table 6 provide a comprehensive evaluation of 3D OD performance across a wide range of methods, organized by sensor modality, representation type, and publication year. Each table reports accuracy metrics on the KITTI, nuScenes, and Waymo datasets, with additional inference time metrics (reported for KITTI) and public code availability. This taxonomy enables direct comparisons between monocular, stereo, LiDAR-only, radar-only, and fusion-based pipelines.
A consistent hierarchy emerges: LiDAR-based methods outperform monocular and stereo-based ones in 3D localization, while multi-modal methods tend to achieve the best overall results. These trends underscore the inherent difficulty of monocular depth estimation and highlight the advantages of incorporating complementary sensory cues.
From early 2D-proposal-based monocular models to state-of-the-art transformer-based fusion frameworks, the field of 3D OD has progressively shifted toward hybrid pipelines that balance spatial accuracy, semantic reasoning, and computational efficiency. This evolution is reflected not only in architectural choices but also in quantitative improvements on benchmark datasets.
An important trend highlighted in the tables is the trade-off between inference speed and detection accuracy. While transformer-based and two-stage architectures dominate in terms of AP and 3D IoU scores, single-stage detectors and projection-based models have gained traction for real-time deployment. For instance, models such as PointPillars, PIXOR, and SMOKE offer reduced latency while maintaining competitive accuracy. Fusion-based methods like TransFusion and EPNet achieve high accuracy but are computationally heavier, whereas late fusion methods like Fast-CLOCs enable faster inference with minimal architectural change.
Recent models such as TED, PV-RCNN, and CenterFusion exemplify the growing ability to combine high accuracy with low latency, even in multi-modal setups. TED currently holds state-of-the-art performance among LiDAR-only methods on the KITTI benchmark, while PV-RCNN combines voxel-grid structure with point-level refinement for balanced performance. BEV and pillar-based projections dominate low-latency deployment scenarios, especially in AD settings where speed is critical.
Focusing specifically on camera-based methods, steady gains in AP have been observed over time. Monocular detectors, though inherently limited by depth ambiguity, have benefited from advances in deep regression, geometric priors, and knowledge distillation. Stereo and multi-view approaches improve upon monocular baselines by leveraging depth cues, with many of the highest-performing models being reported on the nuScenes benchmark due to its extensive camera coverage. Inference times have also improved over time, making real-time deployment increasingly feasible.
On the other hand, radar-only based methods still showcase limited accuracy performance in the few reported metrics on benchmarked datasets. They lack the resolution and semantic information capturing required for proper 3D understanding. However, they can provide information that other sensors cannot and can operate in more extreme environments where other sensors are less suitable, thereby greatly reinforcing a system’s robustness and reliability when fused. Still, works like [263] highlight how radar-only 3D object detection methods can achieve success in specific applications and how this area holds significant potential for further advancement.
In summary, there is no universally optimal modality or representation for 3D OD. Each approach entails trade-offs between semantic richness, geometric fidelity, latency, cost, and robustness. The ideal method depends on application-specific constraints such as sensor availability, environmental variability, and real-time requirements. Future research will likely focus on three main directions: improving robustness to sensor degradation, optimizing fusion architectures for scalable deployment, and developing more generalizable representations that support cross-domain transfer.
As foundation models and self-supervised learning continue to mature, they will likely play a central role in the next generation of 3D OD systems. Scalable, modality-agnostic perception frameworks that unify sparse and dense data sources will be critical in enabling safe and reliable autonomous operation across diverse scenarios.
Table 3, Table 4, Table 5 and Table 6 constitute the core of this survey. Together, they provide a structured and comprehensive summary of the most influential 3D OD methods developed over the past decade. A total of 205 methods were reviewed and are reported in these tables, spanning a wide range of input modalities, data representations, and architectural paradigms. These methods are categorized by sensor type (monocular, stereo, multi-camera, LiDAR, and fusion-based), ordered chronologically by publication year, and benchmarked using standardized metrics from the KITTI, nuScenes, and Waymo datasets. Inference time (when available) is also included for the KITTI benchmark, along with an indicator of public code availability.
This structured compilation enables readers to trace the chronological and methodological evolution of the field, compare performance trends across modalities, and identify prevailing design trade-offs. It also serves as a reference point for researchers seeking to position new contributions within the broader landscape.
To facilitate wider access and enable dynamic exploration of 3D OD methods, a dedicated online repository was created: (https://3d-object-detection-hub.github.io/, accessed on 25 April 2025). This interactive website contains an extended collection of detection models, grouped by input modality, data representation, and publication year. The platform is designed to be continuously updated and searchable, supporting users in exploring methods across monocular-, stereo-, LiDAR-, radar-, and fusion-based pipelines.
The construction of this survey involved a multi-stage methodology. Initially, existing review papers and survey studies were consulted to establish a foundational taxonomy of input modalities, processing strategies, and detection paradigms. These sources provided both historical context and the initial references for influential models. From there, models were categorized according to the taxonomy developed in Section 4, distinguishing between representation types such as point-based, voxel-based, projection-based and hybrid models.
To obtain reliable performance metrics, official benchmark leaderboards, such as those of KITTI, nuScenes, and Waymo, were systematically reviewed. For each method, metrics were gathered from the benchmark websites whenever publicly available. When benchmark entries were missing or incomplete, values were extracted directly from the original publications. Inference time was also recorded, with particular emphasis on KITTI runtime to compare real-time capability. All data presented in Table 3, Table 4, Table 5 and Table 6 were manually verified to ensure consistency, dataset compatibility, and fair metric alignment across methods. Where necessary, metadata were aggregated from supplementary material or code repositories to fill gaps in the original publications.
Each table entry is annotated by sensor modality (e.g., monocular, stereo, LiDAR, multi-modal) and representation type (e.g., voxel, point, projection, hybrid). Methods are listed chronologically by year of publication to illustrate architectural progression. Accuracy metrics are reported separately for the KITTI, nuScenes, and Waymo datasets, providing cross-benchmark comparability. An additional column denotes whether the method’s codebase is publicly available, enabling practical reproducibility and further experimentation.
This work thus aims to offer one of the most comprehensive and up-to-date surveys of the 3D OD methods available. In contrast to prior surveys that focus narrowly on single-modality pipelines, this review takes a modality-agnostic perspective and includes both classical and SoA methods up to May 2025, capturing recent advances that have not yet appeared in other reviews or benchmark summary papers.

6. Conclusions

This work presented a comprehensive survey of 3D object detection methods, spanning monocular, stereo, multi-camera, radar, LiDAR, and multi-modal fusion approaches. A consistent taxonomy was proposed to organize the field, covering different input modalities and data representations. The survey traced the historical development of these methods and benchmarked them using metrics from publicly available datasets such as KITTI, nuScenes, and Waymo. In total, 205 methods were analysed, and their performance was synthesized across accuracy, inference time, and implementation availability.
Monocular methods are the most lightweight and scalable but are limited by their lack of direct depth sensing. Stereo systems partially address this with geometric priors, though their effectiveness declines at longer distances and under occlusion. LiDAR-based methods excel in spatial accuracy and robustness but are hindered by high cost, sparse point clouds at long ranges, and degraded performance in adverse weather. Radar-based methods exist and can be a good choice for specific use cases but are usually coupled with other sensors. Multi-modal fusion strategies consistently achieve superior accuracy and resilience by combining complementary sensor inputs, but they also introduce additional complexity in calibration, synchronization, and training.
Beyond benchmarking, the survey consolidates fragmented knowledge from recent papers, official leaderboards, and existing reviews into a unified and accessible framework. The dedicated website accompanying this thesis serves as a continuously updated repository of methods and results, supporting reproducibility and further research. Overall, this work contributes a modality-agnostic, data-driven, and taxonomically structured reference for 3D perception in autonomous systems.
The field of 3D object detection continues to evolve rapidly, with new architectural paradigms and sensing modalities expanding its capabilities. Recent trends show a gradual transition from purely CNN-based architectures to Transformer-based models that better capture long-range dependencies and global context. Other emerging paradigms such as knowledge distillation, NeRF-based scene representations, and foundation models trained on large-scale multimodal data are also gaining traction. The field is moving toward scalable, robust real-time solutions suitable for deployment in real-world AV systems. Heavy focus is being placed on enhancing robustness under challenging conditions and ensuring scalability for large-scale AV deployments. Techniques that reduce latency, energy consumption, and hardware requirements, without sacrificing detection accuracy, will be key enablers of future perception systems. The groundwork laid by this work provides a strong foundation for further exploration into robust, efficient, and generalizable 3D detection pipelines for autonomous applications.

Author Contributions

Conceptualization, M.V., A.M. and J.-V.Z.; methodology, M.V., A.M. and J.-V.Z.; investigation, M.V.; implementation and evaluation, M.V.; writing—original draft preparation, M.V.; writing—review and editing, M.V., A.M. and J.-V.Z.; supervision, A.M. and J.-V.Z. All authors have read and agreed to the published version of the manuscript.

Funding

A.M. acknowledges Fundação para a Ciência e a Tecnologia (FCT) for its financial support via the project LAETA Base Funding (DOI: 10.54499/UIDB/50022/2020). For the purpose of Open Access, the authors have applied a CC-BY public copyright license to any Author‘s Accepted Manuscript (AAM) version arising from this submission.

Data Availability Statement

The original data presented in the study will be made openly available at GitHub upon publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAutonomous Driving
ADASAdvanced Driver Assistance Systems
AIArtificial Intelligence
APAverage Precision
AVAutonomous Vehicle
BEVBird’s-Eye View
CNNConvolutional Neural Network
CVComputer Vision
DLDeep Learning
GNSSGlobal Navigation Satellite System
IoUIntersection over Union
IMUInertial Measurement Unit
LiDARLight Detection and Ranging
mAPMedian Average Precision
MLMachine Learning
NDSnuScenes Detection Score
NDSNon-Maximum Suppression
NNNeural Network
ODObject Detection
PCPoint Cloud
RadarRadio Detection and Ranging
SAESociety of Automotive Engineers
SoAState-of-the-Art
SonarSound Navigation and Ranging
SLAMSimultaneous Localization and Mapping
ToFTime-of-Flight

References

  1. SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles; SAE International; On-Road Automated Driving (ORAD) Committee: Warrendale, PA, USA, 2021. [Google Scholar]
  2. Van Brummelen, J.; O’brien, M.; Gruyer, D.; Najjaran, H. Autonomous vehicle perception: The technology of today and tomorrow. Transp. Res. Part C Emerg. Technol. 2018, 89, 384–406. [Google Scholar] [CrossRef]
  3. Jeffs, J.; He, M.X. Autonomous Cars, Robotaxis and Sensors 2024–2044; IDTechEx: Cambridge, UK, 2023. [Google Scholar]
  4. Waymo LLC. On the Road to Fully Self-Driving; Waymo Safety Report; Waymo LLC.: Mountain View, CA, USA, 2021. [Google Scholar]
  5. Ackerman, E. What Full Autonomy Means for the Waymo Driver. IEEE Spectrum. 2021. Available online: https://spectrum.ieee.org/full-autonomy-waymo-driver (accessed on 4 March 2021).
  6. Dingus, T.A.; Guo, F.; Lee, S.; Antin, J.F.; Perez, M.; Buchanan-King, M.; Hankey, J. Driver crash risk factors and prevalence evaluation using naturalistic driving data. Proc. Natl. Acad. Sci. USA 2016, 113, 2636–2641. [Google Scholar] [CrossRef]
  7. Singh, S. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey; National Highway Traffic Safety Administration: Washington, DC, USA, 2015.
  8. Montgomery, W.; Mudge, R.; Groshen, E.L.; Helper, S.; MacDuffie, J.P.; Carson, C. America’s Workforce Self-Driving Future: Realizing Productivity Gains and Spurring Economic Growth; Securing America’s Future Energy: Washington, DC, USA, 2018. [Google Scholar]
  9. Chehri, A.; Mouftah, H.T. Autonomous vehicles in the sustainable cities, the beginning of a green adventure. Sustain. Cities Soc. 2019, 51, 101751. [Google Scholar] [CrossRef]
  10. Dhall, A.; Dai, D.; Van Gool, L. Real-time 3D traffic cone detection for autonomous driving. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 494–501. [Google Scholar]
  11. Hudson, J.; Orviska, M.; Hunady, J. People’s attitudes to autonomous vehicles. Transp. Res. Part A Policy Pract. 2019, 121, 164–176. [Google Scholar] [CrossRef]
  12. Hulse, L.M.; Xie, H.; Galea, E.R. Relationships with road users, risk, gender and age. Saf. Sci. 2018, 102, 1–13. [Google Scholar] [CrossRef]
  13. Srivastava, A. Sense-Plan-Act in Robotic Applications. In Proceedings of the Intelligent Robotics Seminar, Macao, China, 4–8 November 2019. [Google Scholar] [CrossRef]
  14. Betz, J.; Wischnewski, A.; Heilmeier, A.; Nobis, F.; Stahl, T.; Hermansdorfer, L.; Lohmann, B.; Lienkamp, M. What can we learn from autonomous level-5 motorsport? In Proceedings of the 9th International Munich Chassis Symposium 2018, Munich, Germany, 12–13 June 2018; Springer: Wiesbaden, Germany, 2019. [Google Scholar]
  15. Betz, J.; Zheng, H.; Liniger, A.; Rosolia, U.; Karle, P.; Behl, M.; Krovi, V.; Mangharam, R. Autonomous Vehicles on the Edge: A Survey on Autonomous Vehicle Racing. IEEE Open J. Intell. Transp. Syst. 2022, 3, 458–488. [Google Scholar] [CrossRef]
  16. Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
  17. Wen, L.H.; Jo, K.H. Deep learning-based perception systems for autonomous driving: A comprehensive survey. Neurocomputing 2022, 489, 255–270. [Google Scholar] [CrossRef]
  18. Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
  19. Chen, W.; Li, Y.; Tian, Z.; Zhang, F. 2D and 3D object detection algorithms from images: A Survey. Array 2023, 19, 100305. [Google Scholar] [CrossRef]
  20. Pravallika, A.; Hashmi, M.F.; Gupta, A. Deep Learning Frontiers in 3D Object Detection: A Comprehensive Review for Autonomous Driving. IEEE Access 2024, 12, 173936–173980. [Google Scholar] [CrossRef]
  21. Zhang, X.; Wang, H.; Dong, H. A Survey of Deep Learning-Driven 3D Object Detection: Sensor Modalities, Technical Architectures, and Applications. Sensors 2025, 25, 3668. [Google Scholar] [CrossRef]
  22. Ma, X.; Ouyang, W.; Simonelli, A.; Ricci, E. 3d object detection from images for autonomous driving: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3537–3556. [Google Scholar] [CrossRef] [PubMed]
  23. Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef]
  24. Wang, Y.; Mao, Q.; Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Li, H.; Zhang, Y. Multi-modal 3d object detection in autonomous driving: A survey. Int. J. Comput. Vis. 2023, 131, 2122–2152. [Google Scholar] [CrossRef]
  25. Lahoud, J.; Cao, J.; Khan, F.S.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Yang, M.H. 3D vision with transformers: A survey. arXiv 2022, arXiv:2208.04309. [Google Scholar] [CrossRef]
  26. Zhu, M.; Gong, Y.; Tian, C.; Zhu, Z. A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones 2024, 8, 412. [Google Scholar] [CrossRef]
  27. Calvo, E.L.; Taveira, B.; Kahl, F.; Gustafsson, N.; Larsson, J.; Tonderski, A. Timepillars: Temporally-recurrent 3d lidar object detection. arXiv 2023, arXiv:2312.17260. [Google Scholar]
  28. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  29. Arnold, E.; Al-Jarrah, O.Y.; Dianati, M.; Fallah, S.; Oxtoby, D.; Mouzakitis, A. A survey on 3d object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3782–3795. [Google Scholar] [CrossRef]
  30. Nagiub, A.S.; Fayez, M.; Khaled, H.; Ghoniemy, S. 3D object detection for autonomous driving: A comprehensive review. In Proceedings of the 2024 6th International Conference on Computing and Informatics (ICCI), Cairo, Egypt, 6–7 March 2024; pp. 1–11. [Google Scholar]
  31. Gao, W.; Li, G. Deep Learning for 3D Point Clouds; Springer: Singapore, 2025. [Google Scholar]
  32. Liang, W.; Xu, P.; Guo, L.; Bai, H.; Zhou, Y.; Chen, F. A survey of 3D object detection. Multimed. Tools Appl. 2021, 80, 29617–29641. [Google Scholar] [CrossRef]
  33. Fayyad, J.; Jaradat, M.A.; Gruyer, D.; Najjaranngharam, H. Deep Learning Sensor Fusion: Vehicle Perception and Localization: A Review. Sensors 2022, 20, 4220. [Google Scholar] [CrossRef]
  34. Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
  35. Li, H.; Zhao, Y.; Zhong, J.; Wang, B.; Sun, C.; Sun, F. Delving into the Secrets of BEV 3D Object Detection in Autonomous Driving: A Comprehensive Survey. Authorea Prepr. 2025. [Google Scholar] [CrossRef]
  36. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
  37. Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
  38. Wang, P.; Huang, X.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed]
  39. Chang, M.F.; Lambert, J.; Sangkloy, P.; Singh, J.; Bak, S.; Hartnett, A. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8748–8757. [Google Scholar]
  40. Houston, J.; Zuidhof, G.; Bergamini, L.; Ye, Y.; Chen, L.; Jain, A.; Omari, S.; Iglovikov, V.; Ondruska, P. One thousand and one hours: Self-driving motion prediction dataset. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; pp. 409–418. [Google Scholar]
  41. Patil, A.; Malla, S.; Gang, H.; Chen, Y.T. The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9552–9557. [Google Scholar]
  42. Zamanakos, G.; Tsochatzidis, L.; Amanatiadis, A.; Pratikakis, I. A comprehensive survey of LIDAR-based 3D object detection methods with deep learning for autonomous driving. Comput. Graph. 2021, 99, 153–181. [Google Scholar] [CrossRef]
  43. Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1903–1911. [Google Scholar]
  44. Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2147–2156. [Google Scholar]
  45. Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
  46. Brazil, G.; Liu, X. M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9287–9296. [Google Scholar]
  47. Simonelli, A.; Bulo, S.R.; Porzi, L.; López-Antequera, M.; Kontschieder, P. Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1991–1999. [Google Scholar]
  48. Limaye, A.; Mathew, M.; Nagori, S.; Swami, P.K.; Maji, D.; Desappan, K. SS3D: Single shot 3D object detector. arXiv 2020, arXiv:2004.14674. [Google Scholar] [CrossRef]
  49. Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 996–997. [Google Scholar]
  50. Chong, Z.; Ma, X.A.A. Monodistill: Learning spatial features for monocular 3d object detection. arXiv 2022, arXiv:2201.10830. [Google Scholar] [CrossRef]
  51. Wang, S.; Zheng, J. MonoSKD: General distillation framework for monocular 3D object detection via Spearman correlation coefficient. arXiv 2023, arXiv:2310.11316. [Google Scholar] [CrossRef]
  52. Xu, J.; Peng, L.; Cheng, H.; Li, H.; Qian, W.; Li, K.; Wang, W.; Cai, D. Mononerd: Nerf-like representations for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6814–6824. [Google Scholar]
  53. Yan, L.; Yan, P.; Xiong, S.; Xiang, X.; Tan, Y. Monocd: Monocular 3d object detection with complementary depths. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 10248–10257. [Google Scholar]
  54. Li, P.; Chen, X.; Shen, S. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15 –20 June 2019. [Google Scholar]
  55. Chen, Y.; Liu, S.; Shen, X.; Jia, J. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  56. Liu, Y.; Wang, L.; Liu, M. Yolostereo3d: A step back to 2d for efficient stereo 3d detection. In Proceedings of the 2021 International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13018–13024. [Google Scholar]
  57. Guo, X.; Shi, S.; Wang, X.; Li, H. Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3153–3163. [Google Scholar]
  58. Liu, Z.; Ye, X.; Tan, X.; Ding, E.; Bai, X. Stereodistill: Pick the cream from lidar for distilling stereo-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1790–1798. [Google Scholar] [CrossRef]
  59. Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; pp. 180–191. [Google Scholar]
  60. Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 531–548. [Google Scholar]
  61. Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T. Bevformer: Learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
  62. Liu, H.; Teng, Y.; Lu, T.; Wang, H.; Wang, L. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18580–18590. [Google Scholar]
  63. Ji, H.; Ni, T.; Huang, X.; Luo, T.; Zhan, X.; Chen, J. RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding. arXiv 2025, arXiv:2504.12643. [Google Scholar]
  64. Liu, W.; Sun, J.; Li, W.; Hu, T.; Wang, P. Deep Learning on Point Clouds and Its Application: A Survey. Sensors 2019, 19, 4188. [Google Scholar] [CrossRef]
  65. Nguyen, A.; Jo, K. 3D Point Cloud Segmentation: A survey. In Proceedings of the IEEE Conference on Robotics, Automation and Mechatronics, Kagawa, Japan, 4–7 August 2013. [Google Scholar]
  66. Xuan, Y.; Qu, Y. Multimodal Data Fusion for BEV Perception. Master’s Thesis, University of Gothenburg, Gothenburg, Sweden, 2024. [Google Scholar]
  67. Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-time 3D Object Detection from Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
  68. Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-yolo: Real-time 3d object detection on point clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar]
  69. Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; García, F.; De La Escalera, A. Birdnet: A 3d object detection framework from lidar information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems, Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar]
  70. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  71. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
  72. Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
  73. Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
  74. Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7463–7472. [Google Scholar]
  75. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
  76. Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
  77. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
  78. Chen, Q.; Sun, L.; Wang, Z.; Jia, K.; Yuille, A. Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 68–84. [Google Scholar]
  79. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
  80. Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
  81. Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-equivariant 3d object detection for autonomous driving. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2795–2802. [Google Scholar] [CrossRef]
  82. Chen, Y.; Liu, S.; Shen, X.; Jia, J. Fast point r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9775–9784. [Google Scholar]
  83. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
  84. Hu, J.S.; Kuai, T.; Waslander, S.L. Point density-aware voxels for lidar 3d object detection. In Proceedings of the IEEE/CVF Conference, New Orleans, LA, USA, 18–24 June 2022; pp. 8469–8478. [Google Scholar]
  85. Lai-Dang, Q.V. A survey of vision transformers in autonomous driving: Current trends and future directions. arXiv 2024, arXiv:2403.07542. [Google Scholar] [CrossRef]
  86. Svenningsson, P.; Fioranelli, F.; Yarovoy, A. Radar-pointgnn: Graph based object recognition for unstructured radar point-cloud data. In Proceedings of the 2021 IEEE Radar Conference (RadarConf21), Atlanta, GA, USA, 8–14 May 2021; pp. 1–6. [Google Scholar]
  87. Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. Adv. Neural Inf. Process. Syst. 2022, 35, 3819–3829. [Google Scholar]
  88. Ulrich, M.; Braun, S.; Köhler, D.; Niederlöhner, D.; Faion, F.; Gläser, C.; Blume, H. Improved orientation estimation and detection with hybrid object detection networks for automotive radar. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 111–117. [Google Scholar]
  89. Bang, G.; Choi, K.; Kim, J.; Kum, D.; Choi, J.W. Radardistill: Boosting radar-based object detection performance via knowledge distillation from lidar features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15491–15500. [Google Scholar]
  90. Luo, Y.; Hoffmann, R.; Xia, Y.; Wysocki, O.; Schwab, B.; Kolbe, T.H.; Cremers, D. RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 4452–4461. [Google Scholar]
  91. Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
  92. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
  93. Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1742–1749. [Google Scholar]
  94. Paigwar, A.; Sierra-Gonzalez, D.; Erkent, Ö.; Laugier, C. Frustum-pointpillars: A multi-stage approach for 3d object detection using rgb camera and lidar. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2926–2933. [Google Scholar]
  95. Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual sparse convolution for multimodal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21653–21662. [Google Scholar]
  96. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1907–1915. [Google Scholar]
  97. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
  98. Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 10–13 September 2018; pp. 641–656. [Google Scholar]
  99. Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7345–7353. [Google Scholar]
  100. Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer: Cham, Switzerland, 2020; pp. 35–52. [Google Scholar]
  101. Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
  102. Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. Futr3d: A unified sensor fusion framework for 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 172–181. [Google Scholar]
  103. Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 25–29 October 2020; pp. 10386–10393. [Google Scholar]
  104. Pang, S.; Morris, D.; Radha, H. Fast-CLOCs: Fast camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 187–196. [Google Scholar]
  105. Yang, B.; Guo, R.; Liang, M.; Casas, S.; Urtasun, R. Radarnet: Exploiting radar for robust perception of dynamic objects. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 496–512. [Google Scholar]
  106. Long, Y.; Kumar, A.; Liu, X.; Morris, D. RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 22276–22285. [Google Scholar]
  107. Nabati, R.; Qi, H. Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1527–1536. [Google Scholar]
  108. Shi, K.; He, S.; Shi, Z.; Chen, A.; Xiong, Z.; Chen, J.; Luo, J. Radar and camera fusion for object detection and tracking: A comprehensive survey. arXiv 2024, arXiv:2410.19872. [Google Scholar] [CrossRef]
  109. Giuffrida, L.; Masera, G.; Martina, M. A survey of automotive radar and lidar signal processing and architectures. Chips 2023, 2, 243–261. [Google Scholar] [CrossRef]
  110. Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Subcategory-aware convolutional neural networks for object proposals and detection. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 924–933. [Google Scholar]
  111. Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teuliere, C.; Chateau, T. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2040–2049. [Google Scholar]
  112. Kundu, A.; Li, Y.; Rehg, J.M. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3559–3568. [Google Scholar]
  113. Manhardt, F.; Kehl, W.; Gaidon, A. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2069–2078. [Google Scholar]
  114. Xu, B.; Chen, Z. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2345–2353. [Google Scholar]
  115. Qin, Z.; Wang, J.; Lu, Y. Monogrnet: A geometric reasoning network for monocular 3d object localization. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8851–8858. [Google Scholar] [CrossRef]
  116. Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1019–1028. [Google Scholar]
  117. Weng, X.; Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  118. Ma, X.; Wang, Z.A.A. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6851–6860. [Google Scholar]
  119. Chang, J.; Wetzstein, G. Deep optics for monocular depth estimation and 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10193–10202. [Google Scholar]
  120. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [PubMed]
  121. Liu, L.; Lu, J.; Xu, C.; Tian, Q.; Zhou, J. Deep fitting degree scoring network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1057–1066. [Google Scholar]
  122. Naiden, A.; Paunescu, V.; Kim, G.; Jeon, B.; Leordeanu, M. Shift r-cnn: Deep monocular 3d object detection with closed-form geometric constraints. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 61–65. [Google Scholar]
  123. Bao, W.; Xu, B.; Chen, Z. Monofenet: Monocular 3d object detection with feature enhancement networks. IEEE Trans. Image Process. 2019, 29, 2753–2765. [Google Scholar] [CrossRef]
  124. Ku, J.; Pon, A.D.; Waslander, S.L. Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11867–11876. [Google Scholar]
  125. Simonelli, A.; Bulo, S.R.; Porzi, L.; Ricci, E.; Kontschieder, P. Towards generalization across depth for monocular 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Cham, Switzerland, 2020; pp. 767–782. [Google Scholar]
  126. Vianney, J.M.U.; Aich, S.; Liu, B. Refinedmpl: Refined monocular pseudolidar for 3d object detection in autonomous driving. arXiv 2019, arXiv:1911.09712. [Google Scholar]
  127. Chen, Y.; Tai, L.; Sun, K.; Li, M. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12093–12102. [Google Scholar]
  128. Li, P.; Zhao, H.; Liu, P.; Cao, F. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 644–660. [Google Scholar]
  129. Ma, X.; Liu, S.; Xia, Z.; Zhang, H.; Zeng, X.; Ouyang, W. Rethinking pseudo-lidar representation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer: Cham, Switzerland, 2020; pp. 311–327. [Google Scholar]
  130. Zhou, D.; Song, X.; Dai, Y.; Yin, J.; Lu, F.; Liao, M.; Fang, J.; Zhang, L. Iafa: Instance-aware feature aggregation for 3d object detection from a single image. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  131. Brazil, G.; Pons-Moll, G.; Liu, X.; Schiele, B. Kinematic 3d object detection in monocular video. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIII 16. Springer: Cham, Switzerland, 2020; pp. 135–152. [Google Scholar]
  132. Li, P.; Zhao, H. Monocular 3d detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Autom. Lett. 2021, 6, 5565–5572. [Google Scholar] [CrossRef]
  133. Wang, L.; Du, L.; Ye, X.; Fu, Y.; Guo, G.; Xue, X.; Feng, J.; Zhang, L. Depth-conditioned dynamic message propagation for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2021; pp. 454–463. [Google Scholar]
  134. Chen, H.; Huang, Y.; Tian, W.; Gao, Z.; Xiong, L. Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10379–10388. [Google Scholar]
  135. Kumar, A.; Brazil, G.; Liu, X. Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8973–8983. [Google Scholar]
  136. Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2021; pp. 4721–4730. [Google Scholar]
  137. Reading, C.; Harakeh, A.; Chae, J.; Waslander, S.L. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8555–8564. [Google Scholar]
  138. Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3289–3298. [Google Scholar]
  139. Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T.K. Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15172–15181. [Google Scholar]
  140. Wang, T.; Zhu, X.; Pang, J.; Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 913–922. [Google Scholar]
  141. Zhou, Y.; He, Y.; Zhu, H.; Wang, C.; Li, H.; Jiang, Q. Monocular 3d object detection: An extrinsic parameter free approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7556–7566. [Google Scholar]
  142. Lu, Y.; Ma, X.; Yang, L.; Zhang, T.; Liu, Y.; Chu, Q.; Yan, J.; Ouyang, W. Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3111–3121. [Google Scholar]
  143. Wang, T.; Xinge, Z.; Pang, J.; Lin, D. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; pp. 1475–1485. [Google Scholar]
  144. He, C.; Huang, J.; Hua, X.S.; Zhang, L. Aug3d-rpn: Improving monocular 3d object detection by synthetic images with virtual depth. arXiv 2021, arXiv:2107.13269. [Google Scholar]
  145. Park, D.; Ambrus, R.; Guizilini, V.; Li, J.; Gaidon, A. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3142–3152. [Google Scholar]
  146. Wang, L.; Zhang, L.; Zhu, Y.; Zhang, Z.; He, T.; Li, M.; Xue, X. Progressive coordinate transforms for monocular 3d object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 13364–13377. [Google Scholar]
  147. Liu, Z.; Zhou, D.; Lu, F.; Fang, J.; Zhang, L. Autoshape: Real-time shape-aware monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 15641–15650. [Google Scholar]
  148. Liu, C.; Gu, S.; Van Gool, L.; Timofte, R. Deep line encoding for monocular 3d object detection and depth prediction. In Proceedings of the 32nd British Machine Vision Conference (BMVC 2021), Virtual, 22–25 November 2021; BMVA Press: Durham, UK, 2021; p. 354. [Google Scholar]
  149. Liu, X.; Xue, N.; Wu, T. Learning auxiliary monocular contexts helps monocular 3d object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1810–1818. [Google Scholar] [CrossRef]
  150. Huang, K.C.; Wu, T.H.; Su, H.T.; Hsu, W.H. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4012–4021. [Google Scholar]
  151. Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Li, H.; Gao, P. Monodetr: Depth-guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9155–9166. [Google Scholar]
  152. Lian, Q.; Li, P.; Chen, X. Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1070–1079. [Google Scholar]
  153. Gu, J.; Wu, B.; Fan, L.; Huang, J.; Cao, S.; Xiang, Z.; Hua, X.S. Homography loss for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1080–1089. [Google Scholar]
  154. Li, Z.; Qu, Z.; Zhou, Y.; Liu, J.; Wang, H.; Jiang, L. Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2791–2800. [Google Scholar]
  155. Yang, L.; Zhang, X.; Li, J.; Wang, L.; Zhu, M.; Zhang, C.; Liu, H. Mix-teaching: A simple, unified and effective semi-supervised learning framework for monocular 3d object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6832–6844. [Google Scholar] [CrossRef]
  156. Li, Y.; Chen, Y.; He, J.; Zhang, Z. Densely constrained depth estimator for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 718–734. [Google Scholar]
  157. Kumar, A.; Brazil, G.; Corona, E.; Parchami, A.; Liu, X. Deviant: Depth equivariant network for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 664–683. [Google Scholar]
  158. Brazil, G.; Kumar, A.; Straub, J.; Ravi, N.; Johnson, J.; Gkioxari, G. Omni3d: A large benchmark and model for 3d object detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13154–13164. [Google Scholar]
  159. Zhou, Y.; Liu, Q.; Zhu, H.; Li, Y.; Chang, S.; Guo, M. Mogde: Boosting mobile monocular 3d object detection with ground depth estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 2033–2045. [Google Scholar]
  160. Wu, Z.; Wu, Y.; Pu, J.; Li, X.; Wang, X. Attention-based depth distillation with 3d-aware positional encoding for monocular 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2892–2900. [Google Scholar] [CrossRef]
  161. Hong, Y.; Dai, H.; Ding, Y. Cross-modality knowledge distillation network for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 87–104. [Google Scholar]
  162. Wu, Z.; Gan, Y.; Wang, L.; Chen, G.; Pu, J. Monopgc: Monocular 3d object detection with pixel geometry contexts. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 4842–4849. [Google Scholar]
  163. Zhou, Y.; Zhu, H.; Liu, Q.; Chang, S.; Guo, M. Monoatt: Online monocular 3d object detection with adaptive token transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17493–17503. [Google Scholar]
  164. Min, Z.; Zhuang, B.; Schulter, S.; Liu, B.; Dunn, E.; Chandraker, M. Neurocs: Neural nocs supervision for monocular 3d object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21404–21414. [Google Scholar]
  165. Zhang, W.; Liu, D.; Ma, C.; Cai, W. Alleviating foreground sparsity for semi-supervised monocular 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 7542–7552. [Google Scholar]
  166. Jinrang, J.; Li, Z.; Shi, Y. Monouni: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. Adv. Neural Inf. Process. Syst. 2023, 36, 11703–11715. [Google Scholar]
  167. Vu, K.D.; Tran, T.T.; Nguyen, D.D. MonoDSSMs: Efficient Monocular 3D Object Detection with Depth-Aware State Space Models. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3883–3900. [Google Scholar]
  168. Jiang, X.; Jin, S.; Zhang, X.; Shao, L.; Lu, S. MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders. arXiv 2024, arXiv:2405.07696. [Google Scholar]
  169. Ranasinghe, Y.; Hegde, D.; Patel, V.M. Monodiff: Monocular 3d object detection and pose estimation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10659–10670. [Google Scholar]
  170. Gao, Y.; Wang, P.; Li, X.; Sun, M.; Di, R.; Li, L.; Hong, W. MonoDFNet: Monocular 3D Object Detection with Depth Fusion and Adaptive Optimization. Sensors 2025, 25, 760. [Google Scholar] [CrossRef]
  171. Zhang, J.; Li, J.; Lin, X.; Zhang, W.; Tan, X.; Han, J.; Ding, E.; Wang, J.; Li, G. Decoupled pseudo-labeling for semi-supervised monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16923–16932. [Google Scholar]
  172. Shi, P.; Dong, X.; Ge, R.; Liu, Z.; Yang, A. Dp-M3D: Monocular 3D object detection algorithm with depth perception capability. Knowl.-Based Syst. 2025, 318, 113539. [Google Scholar] [CrossRef]
  173. Kim, J.; Moon, S.; Nah, S.; Shim, D.H. MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model. arXiv 2025, arXiv:2502.00315. [Google Scholar]
  174. Gao, R.; Kim, J.; Phuong, M.C.; Cho, K. Pseudo-LiDAR with Two-Dimensional Instance for Monocular Three-Dimensional Object Tracking. IEEE Access 2025, 13, 45771–45783. [Google Scholar] [CrossRef]
  175. Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3d object proposals for accurate object class detection. Adv. Neural Inf. Process. Syst. 2015, 28, 424–432. [Google Scholar]
  176. Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
  177. Qin, Z.; Wang, J.; Lu, Y. Triangulation learning network: From monocular to stereo 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7615–7623. [Google Scholar]
  178. You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv 2019, arXiv:1906.06310. [Google Scholar]
  179. Königshof, H.; Salscheider, N.O.; Stiller, C. Realtime 3d object detection for automated driving using stereo vision and semantic information. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 1405–1410. [Google Scholar]
  180. Li, C.; Ku, J.; Waslander, S.L. Confidence guided stereo 3D object detection with split depth estimation. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 5776–5783. [Google Scholar]
  181. Xu, Z.; Zhang, W.; Ye, X.; Tan, X.; Yang, W.; Wen, S.; Ding, E.; Meng, A.; Huang, L. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12557–12564. [Google Scholar] [CrossRef]
  182. Sun, J.; Chen, L.; Xie, Y.; Zhang, S.; Jiang, Q.; Zhou, X.; Bao, H. Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10548–10557. [Google Scholar]
  183. Qian, R.; Garg, D.; Wang, Y.; You, Y.; Belongie, S.; Hariharan, B.; Campbell, M.; Weinberger, K.Q.; Chao, W.L. End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5881–5890. [Google Scholar]
  184. Garg, D.; Wang, Y.; Hariharan, B.; Campbell, M.; Weinberger, K.Q.; Chao, W.L. Wasserstein distances for stereo disparity estimation. Adv. Neural Inf. Process. Syst. 2020, 33, 22517–22529. [Google Scholar]
  185. Pon, A.D.; Ku, J.; Li, C.; Waslander, S.L. Object-centric stereo matching for 3d object detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 8383–8389. [Google Scholar]
  186. Li, P.; Su, S.; Zhao, H. Rts3d: Real-time stereo 3d detection from 4d feature-consistency embedding space for autonomous driving. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1930–1939. [Google Scholar] [CrossRef]
  187. Königshof, H.; Stiller, C. Learning-based shape estimation with grid map patches for realtime 3D object detection for automated driving. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
  188. Peng, X.; Zhu, X.; Wang, T.; Ma, Y. Side: Center-based stereo 3d detector with structure-aware instance depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 119–128. [Google Scholar]
  189. Shi, Y.; Guo, Y.; Mi, Z.; Li, X. Stereo CenterNet-based 3D object detection for autonomous driving. Neurocomputing 2022, 471, 219–229. [Google Scholar] [CrossRef]
  190. Gao, A.; Pang, Y.; Nie, J.; Shao, Z.; Cao, J.; Guo, Y.; Li, X. ESGN: Efficient stereo geometry network for fast 3D object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 34, 2000–2009. [Google Scholar] [CrossRef]
  191. Chen, Y.N.; Dai, H.; Ding, Y. Pseudo-stereo for monocular 3d object detection in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 887–897. [Google Scholar]
  192. Chen, Y.; Huang, S.; Liu, S.; Yu, B.; Jia, J. Dsgn++: Exploiting visual-spatial relation for stereo-based 3d detectors. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4416–4429. [Google Scholar] [CrossRef]
  193. Peng, L.; Wu, X.; Yang, Z.; Liu, H.; Cai, D. Did-m3d: Decoupling instance depth for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 71–88. [Google Scholar]
  194. Chen, J.; Wang, Q.; Peng, W.; Xu, H.; Li, X.; Xu, W. Disparity-based multiscale fusion network for transportation detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18855–18863. [Google Scholar] [CrossRef]
  195. Shi, Y. Svdm: Single-view diffusion model for pseudo-stereo 3d object detection. arXiv 2023, arXiv:2307.02270. [Google Scholar]
  196. Choi, H.M.; Kang, H.; Hyun, Y. Multi-view reprojection architecture for orientation estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2357–2366. [Google Scholar]
  197. Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
  198. Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1477–1485. [Google Scholar] [CrossRef]
  199. Rukhovich, D.; Vorontsova, A.; Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2397–2406. [Google Scholar]
  200. Wang, Z.; Min, C.; Ge, Z.; Li, Y.; Li, Z.; Yang, H.; Huang, D. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv 2022, arXiv:2208.10145. [Google Scholar]
  201. Zhang, Y.; Zhu, Z.; Zheng, W.; Huang, J.; Huang, G.; Zhou, J.; Lu, J. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv 2022, arXiv:2205.09743. [Google Scholar]
  202. Jiang, Y.; Zhang, L.; Miao, Z.; Zhu, X.; Gao, J.; Hu, W.; Jiang, Y.G. Polarformer: Multi-camera 3d object detection with polar transformer. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1042–1050. [Google Scholar] [CrossRef]
  203. Wang, S.; Jiang, X.; Li, Y. Focal-petr: Embracing foreground for efficient multi-camera 3d object detection. IEEE Trans. Intell. Veh. 2023, 9, 1481–1489. [Google Scholar] [CrossRef]
  204. Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Bevdistill: Cross-modal bev distillation for multi-view 3d object detection. arXiv 2022, arXiv:2211.09386. [Google Scholar]
  205. Park, J.; Xu, C.; Yang, S.; Keutzer, K.; Kitani, K.; Tomizuka, M.; Zhan, W. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv 2022, arXiv:2210.02443. [Google Scholar] [CrossRef]
  206. Wang, S.; Liu, Y.; Wang, T.; Li, Y.; Zhang, X. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3621–3631. [Google Scholar]
  207. Yu, Z.; Liu, Q.; Wang, W.; Zhang, L.; Zhao, X. PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Bird’s-Eye-View. arXiv 2024, arXiv:2408.16200. [Google Scholar]
  208. Wirges, S.; Fischer, T.; Stiller, C.; Frias, J.B. Object detection and classification in occupancy grid maps using deep convolutional networks. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3530–3535. [Google Scholar]
  209. Zhou, J.; Tan, X.; Shao, Z.; Ma, L. FVNet: 3D front-view proposal generation for real-time object detection from point clouds. In Proceedings of the 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Huaqiao, China, 19–21 October 2019; pp. 1–8. [Google Scholar]
  210. Zhang, Y.; Xiang, Z.; Qiao, C.; Chen, S. Accurate and Real-Time Object Detection Based on Bird’s Eye View on 3D Point Clouds. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 214–221. [Google Scholar]
  211. Yang, B.; Liang, M.; Urtasun, R. Hdnet: Exploiting hd maps for 3d object detection. In Proceedings of the Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; pp. 146–155. [Google Scholar]
  212. Barrera, A.; Beltran, J.; Guindel, C.; Iglesias, J.A.; Garcia, F. Birdnet+: Two-stage 3d object detection in lidar through a sparsity-invariant bird’s eye view. IEEE Access 2021, 9, 160299–160316. [Google Scholar] [CrossRef]
  213. Koh, J.; Lee, J.; Lee, Y.; Kim, J.; Choi, J.W. Mgtanet: Encoding sequential lidar points using long short-term motion-guided temporal attention for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1179–1187. [Google Scholar] [CrossRef]
  214. Li, Z.; Guo, J.; Cao, T.; Bingbing, L.; Yang, W. Gpa-3d: Geometry-aware prototype alignment for unsupervised domain adaptive 3d object detection from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6394–6403. [Google Scholar]
  215. Wang, D.Z.; Posner, I. Voting for voting in online point cloud object detection. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; Volume 1, pp. 10–15. [Google Scholar]
  216. He, C.; Li, R.; Li, S.; Zhang, L. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8417–8427. [Google Scholar]
  217. Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
  218. Shi, G.; Li, R.; Ma, C. Pillarnet: Real-time and high-performance pillar-based 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 35–52. [Google Scholar]
  219. Sun, P.; Tan, M.; Wang, W.; Liu, C.; Xia, F.; Leng, Z.; Anguelov, D. Swformer: Sparse window transformer for 3d object detection in point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 426–442. [Google Scholar]
  220. Fei, H.; Zhao, J.; Zhang, Z.; Wang, H.; Huang, X. PV-GNN: Point-Voxel 3D Object Detection based on Graph Neural Network. Res. Sq. 2024. [Google Scholar] [CrossRef]
  221. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Ipod: Intensive point-based object detector for point cloud. arXiv 2018, arXiv:1812.05276. [Google Scholar] [CrossRef]
  222. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 1951–1960. [Google Scholar]
  223. Zarzar, J.; Giancola, S.; Ghanem, B. PointRGCN: Graph convolution networks for 3D vehicles detection refinement. arXiv 2019, arXiv:1911.12236. [Google Scholar] [CrossRef]
  224. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
  225. Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
  226. Liu, Z.; Huang, T.; Li, B.; Chen, X.; Wang, X.; Bai, X. EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8324–8341. [Google Scholar] [CrossRef]
  227. Chen, C.; Chen, Z.; Zhang, J.; Tao, D. Sasa: Semantics-augmented set abstraction for point-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 221–229. [Google Scholar] [CrossRef]
  228. Tang, Q.; Bai, X.; Guo, J.; Pan, B.; Jiang, W. DFAF3D: A dual-feature-aware anchor-free single-stage 3D detector for point clouds. Image Vis. Comput. 2023, 129, 104594. [Google Scholar] [CrossRef]
  229. Xia, Q.; Ye, W.; Wu, H.; Zhao, S.; Xing, L.; Huang, X.; Deng, J.; Li, X.; Wen, C.; Wang, C. Hinted: Hard instance enhanced detector with mixed-density feature fusion for sparsely-supervised 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15321–15330. [Google Scholar]
  230. Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3d deep learning. Adv. Neural Inf. Process. Syst. 2019, 32, 965–975. [Google Scholar]
  231. He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
  232. Qian, R.; Lai, X.; Li, X. BADet: Boundary-aware 3D object detection from point clouds. Pattern Recognit. 2022, 125, 108524. [Google Scholar] [CrossRef]
  233. Mao, J.; Niu, M.; Bai, H.; Liang, X.; Xu, H.; Xu, C. Pyramid r-cnn: Towards better performance and adaptability for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2723–2732. [Google Scholar]
  234. He, Y.; Xia, G.; Luo, Y.; Su, L.; Zhang, Z.; Li, W.; Wang, P. DVFENet: Dual-branch voxel feature extraction network for 3D object detection. Neurocomputing 2021, 459, 201–211. [Google Scholar] [CrossRef]
  235. Yang, Z.; Jiang, L.; Sun, Y.; Schiele, B.; Jia, J. A unified query-based paradigm for point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8541–8551. [Google Scholar]
  236. Yang, H.; Wang, W.; Chen, M.; Lin, B.; He, T.; Chen, H.; He, X.; Ouyang, W. Pvt-ssd: Single-stage 3d object detector with point-voxel transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13476–13487. [Google Scholar]
  237. Koo, I.; Lee, I.; Kim, S.H.; Kim, H.S.; Jeon, W.J.; Kim, C. Pg-rcnn: Semantic surface point generation for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18142–18151. [Google Scholar]
  238. Wang, Z.; Li, Y.L.; Chen, X.; Zhao, H.; Wang, S. Uni3detr: Unified 3d detection transformer. Adv. Neural Inf. Process. Syst. 2023, 36, 39876–39896. [Google Scholar]
  239. Zhang, X.; Wang, L.; Chen, J.; Fang, C.; Yang, G.; Wang, Y.; Yang, L.; Song, Z.; Liu, L.; Zhang, X.; et al. Dual radar: A multi-modal dataset with dual 4d radar for autononous driving. Sci. Data 2025, 12, 439. [Google Scholar] [CrossRef] [PubMed]
  240. Cheng, J.H.; Kuan, S.Y.; Liu, H.I.; Latapie, H.; Liu, G.; Hwang, J.N. Centerradarnet: Joint 3d object detection and tracking framework using 4d fmcw radar. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 998–1004. [Google Scholar]
  241. Shin, K.; Kwon, Y.P.; Tomizuka, M. Roarnet: A robust 3d object detection based on region approximation refinement. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2510–2515. [Google Scholar]
  242. Simon, M.; Amende, K.; Kraus, A.; Honer, J.; Samann, T.; Kaulbersch, H.; Milz, S.; Michael Gross, H. Complexer-yolo: Real-time 3d object detection and tracking on semantic point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  243. Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3047–3054. [Google Scholar]
  244. Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3d detection. Adv. Neural Inf. Process. Syst. 2021, 34, 16494–16507. [Google Scholar]
  245. Wang, C.; Ma, C.; Zhu, M.; Yang, X. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11794–11803. [Google Scholar]
  246. Li, J.; Chen, L.; Li, Z. Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection. IEEE Access 2025, 13, 52385–52396. [Google Scholar] [CrossRef]
  247. Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar]
  248. Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3d object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
  249. Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12460–12467. [Google Scholar] [CrossRef]
  250. Wang, J.; Zhu, M.; Sun, D.; Wang, B.; Gao, W.; Wei, H. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection. IEEE Access 2019, 7, 90801–90814. [Google Scholar] [CrossRef]
  251. Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16. Springer: Cham, Switzerland, 2020; pp. 720–736. [Google Scholar]
  252. Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
  253. Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191. [Google Scholar]
  254. Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.G. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21643–21652. [Google Scholar]
  255. Zhang, Y.; Chen, J.; Huang, D. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 908–917. [Google Scholar]
  256. Li, X.; Shi, B.; Hou, Y.; Wu, X.; Ma, T.; Li, Y.; He, L. Homogeneous multi-modal feature fusion and interaction for 3D object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 691–707. [Google Scholar]
  257. Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y.; et al. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17524–17534. [Google Scholar]
  258. Ren, B.; Yin, J. SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection. arXiv 2023, arXiv:2304.08304. [Google Scholar]
  259. Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 22014–22024. [Google Scholar]
  260. Yin, Z.; Sun, H.; Liu, N.; Zhou, H.; Shen, J. Fgfusion: Fine-grained lidar-camera fusion for 3d object detection. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; Springer: Singapore, 2023; pp. 505–517. [Google Scholar]
  261. Huang, L.; Li, Z.; Sima, C.; Wang, W.; Wang, J.; Qiao, Y.; Li, H. Leveraging vision-centric multi-modal expertise for 3d object detection. Adv. Neural Inf. Process. Syst. 2023, 36, 38504–38519. [Google Scholar]
  262. Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6792–6802. [Google Scholar]
  263. Viadero-Monasterio, F.; Alonso-Rentería, L.; Pérez-Oria, J.; Viadero-Rueda, F. Radar-based pedestrian and vehicle detection and identification for driving assistance. Vehicles 2024, 6, 1185–1199. [Google Scholar] [CrossRef]
Table 1. Exteroceptive sensors performance comparison.
Table 1. Exteroceptive sensors performance comparison.
SensorRangeAccuracyCostComput.
Cost
SizeDepthColourAffected by
Illumination
Affected by
Weather
Monocular CameraMediumMediumLowHighSmallNoYesYesYes
Stereo CameraMediumMediumMediumHighMediumYesYesYesYes
Infrared CameraMediumMediumLowMediumSmallNoNoNoYes
Sonar/UltrasonicLowLowLowLowSmallYesNoNoNo
RadarHighMediumMediumMediumMediumYesNoNoNo
LiDARHighHighHighMediumLargeYesNoNoYes
Table 2. Common datasets for 3D object detection in autonomous vehicles.
Table 2. Common datasets for 3D object detection in autonomous vehicles.
DatasetYear#
Cameras
#
LiDARs
#
Scenes
#
Classes
LocationsNightRainAnnotated
3D BBoxes
Annotated
Frames
KITTI [28]201221223GermanyNoNo80k15k
ApolloScape [38]2018227327ChinaYesNo70k80k
nuScenes [36]201961100023USA/SingaporeYesYes1.4M40k
ArgoVerse [39]20199211315USAYesYes993K22k
Waymo Open [37]20195511504USAYesYes12M230k
Lyft Level 5 [40]2019733669USANoNo1.3M46k
H3D [41]2019311608USANoNo1.1M27k
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Valverde, M.; Moutinho, A.; Zacchi, J.-V. A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors 2025, 25, 5264. https://doi.org/10.3390/s25175264

AMA Style

Valverde M, Moutinho A, Zacchi J-V. A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors. 2025; 25(17):5264. https://doi.org/10.3390/s25175264

Chicago/Turabian Style

Valverde, Miguel, Alexandra Moutinho, and João-Vitor Zacchi. 2025. "A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities" Sensors 25, no. 17: 5264. https://doi.org/10.3390/s25175264

APA Style

Valverde, M., Moutinho, A., & Zacchi, J.-V. (2025). A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors, 25(17), 5264. https://doi.org/10.3390/s25175264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop