A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities
Abstract
1. Introduction
1.1. Overview of Autonomous Vehicles
1.2. Scope, Aims, and Outline
- The context for the task of 3D object detection, presenting its formulation, the sensor modalities required for it, and finally by presenting benchmark datasets and their respective evaluation metrics.
- A comprehensive literature review of camera-based, LiDAR-based, radar-based and multimodal-based 3D perception methods, including an updated taxonomy and discussion of their evolution.
- A performance and speed benchmark of selected 3D object detectors using standard datasets and evaluation metrics.
2. Background
2.1. Problem Definition
2.2. Data Representation
- Voxels, which discretize 3D space into volumetric grids;
- Point clouds, composed of unordered 3D points , sometimes augmented with intensity or reflectance values;
- Meshes, which represent object surfaces through vertices, edges, and faces [31].
2.3. Sensors
- Exteroceptive sensors measure external variables and observe the surrounding environment. Examples include stereo, flash, infrared, and thermal cameras, as well as radar, LiDAR, and sonar [33].
- Proprioceptive sensors measure variables related to the vehicle state, providing information about its position, velocity, orientation, and acceleration. Examples include global navigation satellite systems (GNSSs), inertial measurement units (IMUs), ground speed sensors (GSSs), encoders, gyroscopes, and accelerometers [33].
- Monocular cameras are passive sensors that capture rich appearance information, including texture and colour, at low cost and high resolution. They produce images , but cannot directly recover depth, limiting their 3D localization capabilities [2,18]. Their performance deteriorates under adverse lighting conditions such as night-time, glare, fog, or rain [29].
- Stereo vision systems estimate depth by triangulating points based on the disparity between images captured by two horizontally aligned cameras, enhancing 3D understanding. However, they require precise calibration and are sensitive to low-texture regions and lighting variations [16,29]. Other systems, such as Time-of-Flight cameras, infer depth using infrared pulses but offer lower resolution, while RGB-D sensors like Kinect combine colour and depth for a more complete spatial view.
- Infrared cameras detect infrared radiation, including thermal emissions and other wavelengths of the electromagnetic spectrum. They enable perception in dark or low-visibility conditions but typically provide low resolution and are less effective for detailed object classification.
- Sonar and ultrasonic sensors emit sound waves to detect nearby obstacles. They are compact, inexpensive, and reliable at short range, but provide low spatial resolution and are unsuitable for complex 3D perception tasks [2].
- Radar systems emit electromagnetic waves and detect their reflections to measure the position and relative velocity of objects using the Doppler effect. They provide long-range robustness and function reliably in adverse weather, although their angular resolution is lower, making fine-grained object detection more challenging [33].
- LiDAR sensors actively scan the environment with laser beams to generate detailed 3D PCs. A typical LiDAR unit emitting m beams over n rotations produces a range image , which can be converted into a PC [18,29]. LiDAR provides high spatial accuracy independent of light conditions, although it remains relatively expensive and can be affected by environmental factors like fog, rain, or snow [16].
3. Datasets and Evaluation Metrics
3.1. Benchmark Datasets
- KITTI: The KITTI dataset (https://www.cvlibs.net/datasets/kitti/, accessed on 25 April 2025), developed by the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago, remains one of the most widely used benchmarks for AD. It provides stereo RGB images, LiDAR PCs, and calibration files. The dataset includes 7481 training and 7518 testing frames, with over 80,000 annotated 3D bounding boxes. Objects are labelled as easy, moderate, or hard based on occlusion, truncation, and object size [29]. Data were collected using a Velodyne LiDAR, stereo cameras, and GNSS/IMU sensors across 22 scenes in urban and highway environments [32].
- nuScenes: The nuScenes dataset (https://www.nuscenes.org/, accessed on 25 April 2025), developed by Motional (https://motional.com/), comprises 1000 20-s driving scenes recorded at 2 Hz. Each scene contains annotations for 23 categories. The sensor suite includes six cameras, a 32-beam LiDAR, and five radars. In total, the dataset features over 390,000 LiDAR sweeps and 1.4 million annotated bounding boxes [24].
- Waymo Open: The Waymo Open Dataset (https://waymo.com/open/, accessed on 25 April 2025) includes approximately 230,000 annotated frames and over 12 million 3D bounding boxes. It provides synchronized data from five LiDAR sensors and five cameras, spanning 798 training, 202 validation, and 150 test segments. Annotated classes cover vehicles, pedestrians, cyclists, and traffic signs [24].
3.2. Evaluation Metrics
3.2.1. General Metrics
- True Positives (): Correctly predicted positives.
- False Positives (): Incorrectly predicted positives.
- False Negatives (): Missed predictions.
- True Negatives (): Correctly predicted negatives.
3.2.2. Dataset-Specific Metrics
- AP2D: Average precision computed by projecting the predicted 3D bounding boxes into the 2D image plane and calculating 2D IoU.
- AP3D: Average precision computed using the full 3D bounding box IoU.
- APBEV: Average precision computed from a bird’s-eye view (BEV) projection of the 3D bounding box.
- nuScenes: The nuScenes benchmark [36] proposes a more comprehensive evaluation scheme that moves beyond traditional IoU-based matching. The authors argue that using IoU alone does not capture all relevant aspects of detection quality in complex urban environments. Instead, nuScenes introduces centre-based matching, where predicted objects are associated with the ground truth based on their 2D centre distance on the ground plane. The newly introduced scores quantify how closely the predicted objects align with the ground truth not just in terms of location, but also shape, pose, and dynamic behaviour. The final nuScenes Detection Score (NDS) aggregates the mean average precision (mAP) and the mean TP metrics (mTP) into a single holistic score:
- Waymo Open Dataset: The Waymo benchmark [37] evaluates detection at two levels:
- Level 1 (L1): Objects with at least five LiDAR points inside the bounding box.
- Level 2 (L2): All annotated objects, including sparse detections.
4. Taxonomy and Review
4.1. Taxonomy of 3D Object Detection
4.2. Camera-Based Methods
4.2.1. Monocular-Based Methods
4.2.2. Stereo-Based Methods
4.2.3. Multi-View/Multi-Camera-Based Methods
4.2.4. Discussion
4.3. LiDAR-Based Methods
4.3.1. Projection-Based Methods
4.3.2. Point-Based Methods
4.3.3. Voxel-Based Methods
4.3.4. Point–Voxel Hybrid Methods
4.3.5. Other Representations
4.3.6. Discussion
4.4. Radar-Based Methods
4.5. Multi-Modal-Based Methods
4.5.1. Early Fusion Methods
4.5.2. Mid-Level Fusion Methods
4.5.3. Late Fusion Methods
4.5.4. Discussion
Method | Year | AP2D | AP3D | APBEV | nuScenes | Waymo | Time (s) | Hardware | Code Available | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E | M | H | E | M | H | E | M | H | mAP | NDS | L1 mAP | L2 mAP | |||||
Monocular Camera: | |||||||||||||||||
3DVP [43] | 2015 | 84.95 | 76.98 | 65.78 | – | – | – | – | – | – | – | – | – | – | 40 | 8 cores @ 3.5 GHz (Matlab + C/C++) | ✓ |
Mono3D [44] | 2016 | 80.30 | 67.29 | 62.23 | – | – | – | – | – | – | – | – | – | – | 4.2 | GPU @ 2.5 GHz (Matlab + C/C++) | ✓ |
SubCNN [110] | 2016 | 94.26 | 89.98 | 79.78 | – | – | – | – | – | – | – | – | – | – | 2 | GPU @ 3.5 GHz (Python + C/C++) | ✓ |
Deep3DBox [45] | 2016 | 94.71 | 90.19 | 76.82 | – | – | – | – | – | – | – | – | – | – | 1.5 | GPU @ 2.5 GHz (C/C++) | ✓ |
Deep MANTA [111] | 2017 | 98.89 | 93.50 | 83.21 | – | – | – | – | – | – | – | – | – | – | 0.7 | GPU @ 2.5 GHz (Python + C/C++) | × |
3D-RCNN [112] | 2018 | 90.02 | 89.39 | 80.29 | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
ROI-10D [113] | 2018 | 76.56 | 70.16 | 61.15 | 4.32 | 2.02 | 1.46 | 9.78 | 4.91 | 3.74 | – | – | – | – | 0.20 | GPU @ 3.5 GHz (Python) | × |
MF3D [114] | 2018 | 90.43 | 87.33 | 76.78 | 7.08 | 5.18 | 4.68 | 13.73 | 9.62 | 8.22 | – | – | – | – | – | – | ✓ |
MonoGRNet [115] | 2018 | 88.65 | 77.94 | 63.31 | 9.61 | 5.74 | 4.25 | 18.19 | 11.17 | 8.73 | – | – | – | – | 0.04 | NVIDIA P40 | ✓ |
GS3D [116] | 2019 | 86.23 | 76.35 | 62.67 | 4.47 | 2.90 | 2.47 | 8.41 | 6.08 | 4.94 | – | – | – | – | 2 | 1 core @ 2.5 GHz (C/C++) | × |
Mono3D-PLiDAR [117] | 2019 | 80.85 | 53.36 | 44.80 | 10.76 | 7.50 | 6.10 | 21.27 | 13.92 | 11.25 | – | – | – | – | 0.10 | NVIDIA GeForce 1080 (pytorch) | × |
AM3D [118] | 2019 | 92.55 | 88.71 | 77.88 | 16.50 | 10.74 | 9.52 | 27.91 | 22.24 | 18.62 | – | – | – | – | 0.40 | GPU @ 2.5 GHz (Python + C/C++) | × |
Deep Optics [119] | 2019 | – | – | – | 16.86 | 13.82 | 13.26 | 26.71 | 19.87 | 19.11 | – | – | – | – | – | – | × |
CenterNet [120] | 2019 | – | – | – | – | – | – | – | – | – | 33.80 | 40.00 | – | – | – | – | ✓ |
FQNet [121] | 2019 | 94.72 | 90.17 | 76.78 | 2.77 | 1.51 | 1.01 | 5.40 | 3.23 | 2.46 | – | – | – | – | 0.50 | 1 core @ 2.5 GHz (Python) | × |
Shift R-CNN [122] | 2019 | 94.07 | 88.48 | 78.34 | 6.88 | 3.87 | 2.83 | 11.84 | 6.82 | 5.27 | – | – | – | – | 0.25 | GPU @ 1.5 GHz (Python) | × |
MonoFENet [123] | 2019 | 91.68 | 86.63 | 76.71 | 8.35 | 5.14 | 4.10 | 17.03 | 11.03 | 9.05 | – | – | – | – | 0.15 | 1 core @ 3.5 GHz (Python) | × |
MonoDIS [47] | 2019 | 90.31 | 87.58 | 76.85 | 10.37 | 7.94 | 6.40 | 18.80 | 19.08 | 17.41 | 30.40 | 38.40 | – | – | – | – | ✓ |
MonoPSR [124] | 2019 | 93.63 | 88.50 | 73.36 | 10.76 | 7.25 | 5.85 | 18.33 | 12.58 | 9.91 | – | – | – | – | 0.20 | GPU @ 3.5 GHz (Python) | ✓ |
MoVi-3D [125] | 2019 | – | – | – | 15.19 | 10.90 | 9.26 | 22.76 | 17.03 | 14.85 | – | – | – | – | – | – | × |
RefinedMPL [126] | 2019 | 88.29 | 65.24 | 53.20 | 18.09 | 11.14 | 8.94 | 28.08 | 17.60 | 13.95 | – | – | – | – | 0.15 | GPU @ 2.5 GHz (Python + C/C++) | × |
M3D-RPN [46] | 2019 | 89.04 | 85.08 | 69.26 | 14.76 | 9.71 | 7.42 | 21.02 | 13.67 | 10.23 | – | – | – | – | 0.16 | GPU @ 1.5 GHz (Python) | ✓ |
SS3D [48] | 2020 | 92.72 | 84.92 | 70.35 | 10.78 | 7.68 | 6.51 | 16.33 | 11.52 | 9.93 | – | – | – | – | 0.048 | Tesla V100 | ✓ |
MonoPair [127] | 2020 | 96.61 | 93.55 | 83.55 | 13.04 | 9.99 | 8.65 | 19.28 | 14.83 | 12.89 | – | – | – | – | 0.06 | GPU @ 2.5 GHz (Python + C/C++) | × |
RTM3D [128] | 2020 | 91.82 | 86.93 | 77.41 | 14.41 | 10.34 | 8.77 | 19.17 | 14.20 | 11.99 | – | – | – | – | 0.05 | GPU @ 1.0 GHz (Python) | ✓ |
SMOKE [49] | 2020 | 93.21 | 87.51 | 77.66 | 14.03 | 9.76 | 7.84 | 20.83 | 14.49 | 12.75 | – | – | – | – | 0.03 | GPU @ 2.5 GHz (Python) | ✓ |
PatchNet [129] | 2020 | – | – | – | 15.68 | 11.12 | 10.17 | 22.97 | 16.86 | 14.97 | – | – | 0.39 | 0.38 | 0.4 | 1 core @ 2.5 GHz (C/C++) | × |
IAFA [130] | 2020 | 93.08 | 89.46 | 79.83 | 17.81 | 12.01 | 10.61 | 25.88 | 17.88 | 15.35 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (C/C++) | × |
Kinematic3D [131] | 2020 | 89.67 | 71.73 | 54.97 | 19.07 | 12.72 | 9.17 | 26.69 | 17.52 | 13.10 | – | – | – | – | 0.12 | 1 core @ 1.5 GHz (C/C++) | ✓ |
KM3D [132] | 2020 | 96.44 | 91.07 | 81.19 | 16.73 | 11.45 | 9.92 | 23.44 | 16.20 | 14.47 | – | – | – | – | 0.03 | 1 core @ 2.5 GHz (Python) | ✓ |
DDMP-3D [133] | 2021 | 91.15 | 81.70 | 63.12 | 19.71 | 12.78 | 9.80 | 28.08 | 17.89 | 13.44 | – | – | – | – | 0.18 | 1 core @ 2.5 GHz (Python) | ✓ |
MonoRUn [134] | 2021 | 95.48 | 87.91 | 78.10 | 19.65 | 12.30 | 10.58 | 27.94 | 17.34 | 15.24 | – | – | – | – | 0.07 | GPU @ 2.5 GHz (Python + C/C++) | ✓ |
GrooMeD-NMS [135] | 2021 | 90.14 | 80.28 | 63.78 | 18.10 | 12.32 | 9.65 | 26.19 | 18.27 | 14.05 | – | – | – | – | 0.12 | 1 core @ 2.5 GHz (Python) | ✓ |
MonoDLE [136] | 2021 | 93.83 | 90.81 | 80.93 | 17.23 | 12.26 | 10.29 | 24.79 | 18.89 | 16.00 | – | – | – | – | 0.04 | GPU @ 2.5 GHz (Python) | ✓ |
CaDDN [137] | 2021 | 93.61 | 80.73 | 71.09 | 19.17 | 13.41 | 11.46 | 27.94 | 18.91 | 17.19 | – | – | – | – | 0.63 | GPU @ 2.5 GHz (Python) | ✓ |
MonoFlex [138] | 2021 | 96.01 | 91.02 | 83.38 | 19.94 | 13.89 | 12.07 | 28.23 | 19.75 | 16.89 | – | – | – | – | 0.03 | GPU @ 2.5 GHz (Python) | ✓ |
MonoRCNN [139] | 2021 | 91.90 | 86.48 | 66.71 | 18.36 | 12.65 | 10.03 | 25.48 | 18.11 | 14.10 | – | – | – | – | 0.07 | GPU @ 2.5 GHz (Python) | ✓ |
FCOS3D [140] | 2021 | – | – | – | – | – | – | – | – | – | 35.80 | 42.80 | – | – | – | – | ✓ |
MonoEF [141] | 2021 | 96.32 | 90.88 | 83.27 | 21.29 | 13.87 | 11.71 | 29.03 | 19.70 | 17.26 | – | – | – | – | 0.03 | 1 core @ 2.5 GHz (Python) | ✓ |
GUPNet [142] | 2021 | 94.15 | 86.45 | 74.18 | 22.26 | 15.02 | 13.12 | 30.29 | 21.19 | 18.20 | – | – | – | – | – | 1 core @ 2.5 GHz (Python + C/C++) | ✓ |
PGD [143] | 2021 | 92.04 | 80.58 | 69.67 | 19.05 | 11.76 | 9.39 | 26.89 | 16.51 | 13.49 | 38.60 | 44.80 | – | – | 0.03 | 1 core @ 2.5 GHz (C/C++) | ✓ |
Aug3D-RPN [144] | 2021 | 85.57 | 77.88 | 61.16 | 17.82 | 12.99 | 9.78 | 26.00 | 17.89 | 14.18 | – | – | – | – | 0.08 | 1 core @ 2.5 GHz (C/C++) | × |
DD3D [145] | 2021 | 94.69 | 93.99 | 89.37 | 23.19 | 16.87 | 14.36 | 32.35 | 23.41 | 20.42 | 41.80 | 47.70 | – | – | – | 1 core @ 2.5 GHz (C/C++) | ✓ |
PCT [146] | 2021 | 96.45 | 88.78 | 78.85 | 21.00 | 13.37 | 11.31 | 29.65 | 19.03 | 15.92 | – | – | 0.89 | 0.66 | 0.045 | 1 core @ 2.5 GHz (Python) | ✓ |
Autoshape [147] | 2021 | 86.51 | 77.60 | 64.40 | 22.47 | 14.17 | 11.36 | 30.66 | 20.08 | 15.95 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (C/C++) | ✓ |
DLE [148] | 2021 | 94.66 | 84.45 | 62.10 | 24.23 | 14.33 | 10.30 | 31.09 | 19.05 | 14.13 | – | – | – | – | 0.06 | NVIDIA Tesla V100 | × |
MonoCon [149] | 2021 | – | – | – | 22.50 | 16.46 | 13.95 | 31.12 | 22.10 | 19.00 | – | – | – | – | – | – | ✓ |
MonoDistill [50] | 2022 | – | – | – | 22.97 | 16.03 | 13.60 | 31.87 | 22.59 | 19.72 | – | – | – | – | – | – | ✓ |
MonoDTR [150] | 2022 | 93.90 | 88.41 | 76.20 | 21.99 | 15.39 | 12.73 | 28.59 | 20.38 | 17.14 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (C/C++) | ✓ |
MonoDETR [151] | 2022 | 93.99 | 86.17 | 76.19 | 24.52 | 16.26 | 13.93 | 32.20 | 21.45 | 18.68 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | ✓ |
MonoJSG [152] | 2022 | – | – | – | 24.69 | 16.14 | 13.64 | 32.59 | 21.26 | 18.18 | – | – | – | – | – | – | ✓ |
HomoLoss [153] | 2022 | 95.92 | 90.69 | 80.91 | 21.75 | 14.94 | 13.07 | 29.60 | 20.68 | 17.81 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | × |
MonoDDE [154] | 2022 | 96.76 | 89.19 | 81.60 | 24.93 | 17.14 | 15.10 | 33.58 | 23.46 | 20.37 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | × |
Mix-Teaching [155] | 2022 | 96.35 | 91.02 | 83.41 | 26.89 | 18.54 | 15.79 | 35.74 | 24.23 | 20.80 | – | – | – | – | 30 | 1 core @ 2.5 GHz (C/C++) | ✓ |
DCD [156] | 2022 | 96.44 | 90.93 | 83.36 | 23.81 | 15.90 | 13.21 | 32.55 | 21.50 | 18.25 | – | – | – | – | 0.03 | 1 core @ 2.5 GHz (C/C++) | ✓ |
DEVIANT [157] | 2022 | 94.42 | 86.64 | 76.69 | 21.88 | 14.46 | 11.89 | 29.65 | 20.44 | 17.43 | – | – | – | – | 0.04 | 1 GPU (Python) | ✓ |
Cube R-CNN [158] | 2022 | 95.78 | 92.72 | 84.81 | 23.59 | 15.01 | 12.06 | 31.70 | 21.20 | 18.43 | – | – | – | – | 0.05 | GPU @ 2.5 GHz (Python) | ✓ |
MoGDE [159] | 2022 | – | – | – | 27.07 | 17.88 | 15.66 | 38.38 | 25.60 | 22.91 | – | – | – | – | – | – | × |
ADD [160] | 2022 | 94.82 | 89.53 | 81.60 | 25.61 | 16.81 | 13.79 | 35.20 | 23.58 | 20.08 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (Python) | × |
CMKD [161] | 2022 | 95.14 | 90.28 | 83.91 | 28.55 | 18.69 | 16.77 | 38.98 | 25.82 | 22.80 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (C/C++) | ✓ |
MonoPGC [162] | 2023 | – | – | – | 24.68 | 17.17 | 14.14 | 32.50 | 23.14 | 20.30 | – | – | – | – | – | – | × |
MonoATT [163] | 2023 | – | – | – | 24.72 | 17.37 | 15.00 | 36.87 | 24.42 | 21.88 | – | – | – | – | – | – | × |
NeurOCS [164] | 2023 | 96.39 | 91.08 | 81.20 | 29.89 | 18.94 | 15.90 | 37.27 | 24.49 | 20.89 | – | – | – | – | 0.10 | GPU @ 2.5 GHz (Python) | × |
MonoNerd [52] | 2023 | 94.60 | 86.89 | 77.23 | 22.75 | 17.13 | 15.63 | 31.13 | 23.46 | 20.97 | – | – | – | – | NA | 1 core @ 2.5 GHz (Python) | ✓ |
MonoSKD [51] | 2023 | 96.68 | 91.34 | 83.69 | 28.43 | 17.35 | 15.01 | 37.12 | 24.08 | 20.37 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | ✓ |
ODM3D [165] | 2023 | – | – | – | 29.75 | 19.09 | 16.93 | 39.41 | 26.02 | 22.76 | – | – | – | – | – | – | ✓ |
MonoUNI [166] | 2023 | 94.30 | 88.96 | 78.95 | 24.75 | 16.73 | 13.49 | 33.28 | 23.05 | 19.39 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | × |
MonoDSSM [167] | 2024 | 93.96 | 88.31 | 76.15 | 21.47 | 14.55 | 11.78 | 28.29 | 19.59 | 16.34 | – | – | – | – | 0.02 | 1 core @ 2.5 GHz (Python + C/C++) | × |
MonoCD [53] | 2024 | 96.43 | 92.91 | 85.55 | 25.53 | 16.59 | 14.53 | 33.41 | 22.81 | 19.57 | – | – | – | – | NA | 1 core @ 2.5 GHz (Python) | ✓ |
MonoMAE [168] | 2024 | – | – | – | 25.60 | 18.84 | 16.78 | 34.15 | 24.93 | 21.76 | – | – | – | – | – | – | × |
MonoDiff [169] | 2024 | – | – | – | 30.18 | 21.02 | 18.16 | – | – | – | – | – | – | – | – | – | × |
MonoDFNet [170] | 2024 | – | – | – | 25.71 | 19.07 | 15.96 | 33.56 | 24.52 | 21.09 | – | – | – | – | – | – | ✓ |
DPL [171] | 2024 | – | – | – | 24.19 | 16.67 | 13.83 | 33.16 | 22.12 | 18.74 | – | – | – | – | – | – | × |
Dp-M3D [172] | 2025 | – | – | – | 23.41 | 13.65 | 12.91 | 32.38 | 20.13 | 16.58 | – | – | – | – | – | – | × |
MonoDINO-DETR [173] | 2025 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
Pseudo-LiDAR2D [174] | 2025 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
Stereo Camera: | |||||||||||||||||
3DOP [175] | 2015 | 92.96 | 89.55 | 79.38 | – | – | – | – | – | – | – | – | – | – | 3.00 | GPU @ 2.5 GHz (Matlab + C/C++) | × |
Pseudo-LiDAR [176] | 2018 | 85.40 | 67.79 | 58.50 | 54.53 | 34.05 | 28.25 | 67.30 | 45.00 | 38.40 | – | – | – | – | 0.40 | GPU @ 2.5 GHz (Python + C/C++) | ✓ |
Stereo R-CNN [54] | 2019 | 93.98 | 85.98 | 71.25 | 47.58 | 30.23 | 23.72 | 61.92 | 41.31 | 33.42 | – | – | – | – | 0.30 | GPU @ 2.5 GHz (Python) | ✓ |
TLNet [177] | 2019 | 76.92 | 63.53 | 54.58 | 7.64 | 4.37 | 3.74 | 13.71 | 7.69 | 6.73 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (Python) | ✓ |
Pseudo-LiDAR++ [178] | 2019 | 94.46 | 82.90 | 75.45 | 61.11 | 42.43 | 36.99 | 78.31 | 58.01 | 51.25 | – | – | – | – | 0.40 | GPU @ 2.5 GHz (Python) | ✓ |
RT3D-Stereo [179] | 2019 | 56.53 | 45.81 | 37.63 | 29.90 | 23.28 | 18.96 | 58.81 | 46.82 | 38.38 | – | – | – | – | 0.08 | GPU @ 2.5 GHz (C/C++) | ✓ |
DSGN [55] | 2020 | 95.53 | 86.43 | 78.75 | 73.50 | 52.18 | 45.14 | 82.90 | 65.05 | 56.60 | – | – | – | – | 0.67 | NVIDIA Tesla V100 | ✓ |
OC-Stereo [180] | 2020 | 87.39 | 74.60 | 62.56 | 55.15 | 37.60 | 30.25 | 68.89 | 51.47 | 42.97 | – | – | – | – | 0.35 | 1 core @ 2.5 GHz (Python + C/C++) | ✓ |
ZoomNet [181] | 2020 | 94.22 | 83.92 | 69.00 | 55.98 | 38.64 | 30.97 | 72.94 | 54.91 | 44.14 | – | – | – | – | 0.30 | 1 core @ 2.5 GHz (C/C++) | ✓ |
Disp R-CNN [182] | 2020 | 93.45 | 82.64 | 70.45 | 68.21 | 45.78 | 37.73 | 79.76 | 58.62 | 47.73 | – | – | – | – | 0.387 | GPU @ 2.5 GHz (Python + C/C++) | ✓ |
Pseudo-LiDAR E2E [183] | 2020 | – | – | – | 64.75 | 43.92 | 38.14 | 79.60 | 58.80 | 52.10 | – | – | – | – | – | – | ✓ |
CDN [184] | 2020 | 95.85 | 87.19 | 79.43 | 74.52 | 54.22 | 46.36 | 83.32 | 66.24 | 57.65 | – | – | – | – | 0.60 | GPU @ 2.5 GHz (Python) | ✓ |
CG-Stereo [185] | 2020 | 96.31 | 90.38 | 82.80 | 74.39 | 53.58 | 46.50 | 85.29 | 66.44 | 58.95 | – | – | – | – | 0.57 | GeForce RTX 2080 Ti | × |
RTS3D [186] | 2020 | – | – | – | 58.51 | 37.38 | 31.12 | 72.17 | 45.22 | 38.48 | – | – | – | – | – | – | ✓ |
RT3D-GMP [187] | 2020 | 62.41 | 51.95 | 39.14 | 16.23 | 11.41 | 10.12 | 69.14 | 59.00 | 45.49 | – | – | – | – | 0.06 | GPU @ 2.5 GHz (Python + C/C++) | × |
YOLOStereo3D [56] | 2021 | 94.81 | 82.15 | 62.17 | 65.68 | 41.25 | 30.42 | 76.10 | 50.28 | 36.86 | – | – | – | – | 0.10 | GPU 1080Ti | ✓ |
SIDE [188] | 2021 | – | – | – | 47.69 | 30.82 | 25.68 | – | – | – | – | – | – | – | – | – | × |
LIGA-Stereo [57] | 2021 | 96.43 | 93.82 | 86.19 | 81.39 | 64.66 | 57.22 | 88.15 | 76.78 | 67.40 | – | – | – | – | 0.40 | 1 core @ 2.5 GHz (Python + C/C++) | ✓ |
StereoCenterNet [189] | 2021 | 96.61 | 91.27 | 93.50 | 49.44 | 31.30 | 25.62 | 62.97 | 42.12 | 35.37 | – | – | – | – | 0.04 | GPU @ 2.5 GHz (Python) | × |
ESGN [190] | 2021 | 44.09 | 32.60 | 29.10 | 65.80 | 46.39 | 38.42 | 78.10 | 58.12 | 49.28 | – | – | – | – | 0.06 | GPU @ 2.5 GHz (Python + C/C++) | × |
Pseudo-Stereo [191] | 2022 | 95.75 | 90.27 | 82.32 | 23.74 | 17.74 | 15.14 | 32.64 | 23.76 | 20.64 | – | – | – | – | 0.25 | 1 core @ 2.5 GHz (C/C++) | ✓ |
DSGN++ [192] | 2022 | 98.08 | 95.70 | 88.27 | 83.21 | 67.37 | 59.91 | 88.55 | 78.94 | 69.74 | – | – | – | – | 0.20 | GeForce RTX 2080 Ti | ✓ |
DID-M3D [193] | 2022 | 94.29 | 91.04 | 81.31 | 24.40 | 16.29 | 13.75 | 32.95 | 22.76 | 19.83 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | ✓ |
DMF [194] | 2022 | 89.50 | 85.49 | 82.52 | 77.55 | 67.33 | 62.44 | 84.64 | 80.29 | 76.05 | – | – | – | – | 0.20 | 1 core @ 2.5 GHz (Python + C/C++) | × |
StereoDistill [58] | 2023 | 97.61 | 93.43 | 87.71 | 81.66 | 66.39 | 57.39 | 89.03 | 78.59 | 69.34 | – | – | – | – | 0.40 | 1 core @ 2.5 GHz (Python) | × |
PS-SVDM [195] | 2023 | 94.49 | 87.55 | 78.21 | 29.22 | 18.13 | 15.35 | 38.18 | 24.82 | 20.89 | – | – | – | – | 1.00 | 1 core @ 2.5 GHz (Python) | × |
Multi-View Camera: | |||||||||||||||||
3DOMV [96] | 2017 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
MVRA [196] | 2019 | 95.87 | 94.98 | 82.52 | 5.19 | 3.27 | 2.49 | 9.05 | 5.84 | 4.50 | – | – | – | – | 0.18 | GPU @ 2.5 GHz (Python) | × |
DETR3D [59] | 2021 | – | – | – | – | – | – | – | – | – | 41.2 | 47.9 | – | – | – | – | × |
BEVDet [197] | 2021 | – | – | – | – | – | – | – | – | – | 39.8 | 46.3 | – | – | – | – | × |
BEVDepth [198] | 2022 | – | – | – | – | – | – | – | – | – | 52 | 60.9 | – | – | – | – | × |
Method | Year | AP2D | AP3D | APBEV | nuScenes | Waymo | Time (s) | Hardware | Code Available | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E | M | H | E | M | H | E | M | H | mAP | NDS | L1 mAP | L2 mAP | |||||
ImVoxelNet [199] | 2022 | 89.90 | 79.09 | 69.45 | 17.15 | 10.97 | 9.15 | 25.19 | 16.37 | 13.58 | 41.2 | 47.9 | – | – | 0.20 | GPU @ 2.5 GHz (Python) | ✓ |
PETR [60] | 2022 | – | – | – | – | – | – | – | – | – | 44.5 | 50.4 | – | – | – | – | × |
STS [200] | 2022 | – | – | – | – | – | – | – | – | – | 42.2 | 52.9 | – | – | – | – | × |
BEVerse [201] | 2022 | – | – | – | – | – | – | – | – | – | 39.3 | 53.1 | – | – | – | – | × |
BEVFormer [61] | 2022 | – | – | – | – | – | – | – | – | – | 48.1 | 56.9 | – | – | – | – | × |
SOLOFusion | 2022 | – | – | – | – | – | – | – | – | – | 54.0 | 61.9 | – | – | – | – | × |
PolarFormer [202] | 2022 | – | – | – | – | – | – | – | – | – | 45.6 | 54.3 | – | – | – | – | × |
FocalPETR [203] | 2022 | – | – | – | – | – | – | – | – | – | 46.5 | 51.6 | – | – | – | – | × |
BEV Distill [204] | 2022 | – | – | – | – | – | – | – | – | – | 49.6 | 59.4 | – | – | – | – | × |
HoP [205] | 2023 | – | – | – | – | – | – | – | – | – | 62.4 | 68.5 | – | – | – | – | × |
SparseBEV [62] | 2023 | – | – | – | – | – | – | – | – | – | 60.3 | 67.5 | – | – | – | – | × |
StreamPETR [206] | 2023 | – | – | – | – | – | – | – | – | – | 55.0 | 63.1 | – | – | – | – | × |
PolarBEVDet [207] | 2024 | – | – | – | – | – | – | – | – | – | 55.8 | 63.5 | – | – | – | – | × |
RoPETR [63] | 2025 | – | – | – | – | – | – | – | – | – | 64.8 | 70.9 | – | – | – | – | × |
Projection-Based: | |||||||||||||||||
C-YOLO [68] | 2018 | – | – | – | 67.72 | 64.00 | 63.01 | 85.89 | 77.40 | 77.33 | – | – | – | – | – | – | ✓ |
TopNet [208] | 2018 | 58.04 | 45.85 | 41.11 | 12.67 | 9.28 | 7.95 | 80.16 | 68.16 | 63.43 | – | – | – | – | 0.01 | NVIDIA GeForce 1080 Ti (TF-GPU) | × |
BirdNet [69] | 2018 | 79.30 | 57.12 | 55.16 | 40.99 | 27.26 | 25.32 | 84.17 | 59.83 | 57.35 | – | – | – | – | 0.11 | Titan Xp (Caffe) | ✓ |
PIXOR [67] | 2019 | – | – | – | – | – | – | 81.70 | 77.05 | 72.95 | – | – | – | – | – | – | ✓ |
FVNet [209] | 2019 | 86.14 | 77.19 | 69.27 | 65.43 | 57.34 | 51.85 | 78.04 | 65.03 | 57.89 | – | – | – | – | – | – | ✓ |
MODet [210] | 2019 | 66.06 | 62.54 | 60.04 | – | – | – | 90.80 | 87.56 | 82.69 | – | – | – | – | 0.05 | GTX1080Ti | × |
HDNet [211] | 2020 | – | – | – | – | – | – | 89.14 | 86.57 | 78.32 | – | – | – | – | – | – | ✓ |
PIXOR++ [211] | 2020 | – | – | – | – | – | – | 93.28 | 86.01 | 80.11 | – | – | – | – | – | – | × |
BirdNet+ [212] | 2021 | 92.61 | 86.73 | 81.80 | 76.15 | 64.04 | 59.79 | 87.43 | 81.85 | 75.36 | – | – | – | – | 0.11 | Titan Xp (Caffe) | ✓ |
MGTANet [213] | 2022 | – | – | – | – | – | – | – | – | – | 67.50 | 72.70 | – | – | – | – | ✓ |
GPA3D [214] | 2023 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
Voxel-Based: | |||||||||||||||||
Vote3D [215] | 2015 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
VoxelNet [75] | 2017 | – | – | – | 77.82 | 64.17 | 57.51 | 87.95 | 78.39 | 71.29 | – | – | – | – | – | – | × |
SECOND [76] | 2018 | – | – | – | 83.13 | 73.66 | 66.20 | 89.39 | 83.77 | 78.59 | – | – | – | – | – | – | ✓ |
PointPillars [77] | 2018 | 94.00 | 91.19 | 88.17 | 82.58 | 74.31 | 68.99 | 90.07 | 86.56 | 82.81 | 40.10 | 55.00 | – | – | 0.016 | 1080 Ti + Intel i7 | ✓ |
HotSpotNet [78] | 2019 | 96.21 | 92.81 | 89.80 | 87.60 | 78.31 | 73.34 | 94.06 | 88.09 | 83.24 | 59.30 | 66.00 | – | – | 0.04 | 1 core @ 2.5 GHz (Py + C/C++) | × |
Voxel R-CNN [79] | 2020 | 96.49 | 95.11 | 92.45 | 90.90 | 81.62 | 77.06 | 94.85 | 88.83 | 86.13 | – | – | 75.59 | 66.59 | 0.04 | GPU @ 3.0 GHz (C/C++) | ✓ |
VoTr-TSD [80] | 2021 | 95.95 | 94.81 | 92.24 | 89.90 | 82.09 | 79.14 | 94.03 | 90.34 | 86.14 | – | – | 74.95 | 65.91 | 0.07 | 1 core @ 2.5 GHz (C/C++) | ✓ |
TED [81] | 2022 | 96.64 | 96.03 | 93.35 | 91.61 | 85.28 | 80.68 | 95.44 | 92.05 | 87.30 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (C/C++) | ✓ |
VoxSeT [216] | 2022 | 96.16 | 95.23 | 90.49 | 88.53 | 82.06 | 77.46 | 92.70 | 89.07 | 86.29 | – | – | – | – | 0.033 | 1 core @ 2.5 GHz (C/C++) | ✓ |
FocalsConv [217] | 2022 | 96.30 | 95.28 | 92.69 | 90.55 | 82.28 | 77.59 | 92.67 | 89.00 | 86.33 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (C/C++) | ✓ |
PillarNet [218] | 2022 | – | – | – | – | – | – | – | – | – | 66.00 | 71.40 | 83.23 | 76.09 | – | – | ✓ |
SWFormer [219] | 2022 | – | – | – | – | – | – | – | – | – | – | – | 77.8 | 69.2 | – | – | × |
PV-GNN [220] | 2024 | – | – | – | 91.64 | 82.49 | 77.28 | 95.09 | 92.38 | 87.44 | – | – | – | – | – | – | × |
Point-Based: | |||||||||||||||||
iPOD [221] | 2018 | 90.20 | 89.30 | 87.37 | 71.40 | 53.46 | 48.34 | 86.93 | 83.98 | 77.85 | – | – | – | – | – | – | × |
PointRCNN [72] | 2018 | 95.92 | 91.90 | 87.11 | 86.96 | 75.64 | 70.70 | 92.13 | 87.39 | 82.72 | – | – | – | – | 0.10 | GPU @ 2.5 GHz (Py + C/C++) | ✓ |
STD [222] | 2019 | 96.14 | 93.22 | 90.53 | 87.95 | 79.71 | 75.09 | 94.74 | 89.19 | 86.42 | – | – | – | – | 0.08 | GPU @ 2.5 GHz (Py + C/C++) | ✓ |
PointRGCN [223] | 2019 | 96.19 | 92.67 | 87.66 | 85.97 | 75.73 | 70.60 | 91.63 | 87.49 | 90.73 | – | – | – | – | 0.26 | GPU @ V100 (Python) | ✓ |
3DSSD [224] | 2020 | 97.69 | 95.10 | 92.18 | 88.36 | 79.57 | 74.55 | 92.66 | 89.02 | 85.86 | 42.60 | 56.40 | – | – | 0.04 | GPU @ 2.5 GHz (Py + C/C++) | ✓ |
Point-GNN [225] | 2020 | 96.58 | 93.50 | 88.35 | 88.33 | 79.47 | 72.29 | 93.11 | 89.17 | 83.90 | – | – | – | – | 0.60 | GPU @ 2.5 GHz (Python) | ✓ |
PointFormer [74] | 2020 | – | – | – | 87.13 | 77.06 | 69.25 | – | – | – | 53.60 | – | – | – | – | – | ✓ |
EPNet++ [226] | 2021 | 96.73 | 95.17 | 92.10 | 91.37 | 81.96 | 76.71 | 95.41 | 89.00 | 85.73 | – | – | – | – | 0.10 | GPU @ 2.5 GHz (Python) | ✓ |
SASA [227] | 2022 | 96.01 | 95.35 | 92.42 | 88.76 | 82.16 | 77.16 | 92.87 | 89.51 | 86.35 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Py + C/C++) | ✓ |
IA-SSD [73] | 2022 | 96.10 | 93.56 | 90.68 | 88.27 | 80.32 | 75.10 | 92.79 | 89.33 | 84.35 | – | – | – | – | 0.014 | 1 core @ 2.5 GHz (C/C++) | ✓ |
DFAF3D [228] | 2023 | 96.58 | 93.32 | 90.24 | 88.59 | 79.37 | 72.21 | 93.14 | 89.45 | 84.22 | – | – | – | – | – | 1 core @ 2.5 GHz (Python) | × |
HINTED [229] | 2024 | 95.16 | 90.97 | 85.55 | 84.00 | 74.13 | 67.03 | 90.61 | 86.01 | 79.29 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (C/C++) | ✓ |
Point–Voxel Hybrid: | |||||||||||||||||
PVCNN [230] | 2019 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
Fast Point R-CNN [82] | 2019 | 96.13 | 93.18 | 87.68 | 85.29 | 77.40 | 70.24 | 90.87 | 87.84 | 80.52 | – | – | – | – | 0.06 | GPU @ 2.5 GHz (Py + C/C++) | × |
PV-RCNN [83] | 2019 | 98.17 | 94.70 | 92.04 | 90.25 | 81.43 | 76.82 | 94.98 | 90.65 | 86.14 | – | 77.51 | – | – | 0.08 | 1 core @ 2.5 GHz (Py + C/C++) | ✓ |
SA-SSD [231] | 2020 | 97.92 | 95.16 | 90.15 | 88.75 | 79.79 | 74.16 | 95.03 | 91.03 | 85.96 | – | – | – | – | 0.04 | 1 core @ 2.5 GHz (Python) | ✓ |
BADet [232] | 2021 | 98.65 | 95.34 | 90.28 | 89.28 | 81.61 | 76.59 | 95.23 | 91.32 | 86.48 | – | – | – | – | 0.14 | 1 core @ 2.5 GHz (C/C++) | ✓ |
Pyramid-PV [233] | 2021 | 95.88 | 95.13 | 92.62 | 88.39 | 82.08 | 77.49 | 92.19 | 88.84 | 86.21 | – | – | – | – | 0.07 | 1 core @ 2.5 GHz (C/C++) | ✓ |
DVFENet [234] | 2021 | 95.35 | 94.57 | 91.77 | 86.20 | 79.18 | 74.58 | 90.93 | 87.68 | 84.60 | – | – | – | – | 0.05 | 1 core @ 2.5 GHz (Py + C/C++) | × |
PDV [84] | 2022 | 96.07 | 95.00 | 92.44 | 90.43 | 81.86 | 77.36 | 94.56 | 90.48 | 86.23 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (C/C++) | ✓ |
EQ-PVRCNN [235] | 2022 | 98.23 | 95.32 | 92.65 | 90.13 | 82.01 | 77.53 | 94.55 | 89.09 | 86.40 | – | – | – | – | 0.20 | GPU @ 2.5 GHz (Py + C/C++) | ✓ |
PVT-SSD [236] | 2023 | 96.75 | 95.90 | 90.69 | 90.65 | 82.29 | 76.85 | 95.23 | 91.63 | 86.43 | – | – | – | – | 0.05 | 1 core @ 2.5 GHz (Py + C/C++) | × |
PG-RCNN [237] | 2023 | 96.66 | 95.40 | 90.55 | 89.38 | 82.13 | 77.33 | 93.39 | 89.46 | 86.54 | – | – | – | – | 0.06 | GPU @ 1.5 GHz (Python) | ✓ |
Uni3DETR [238] | 2023 | – | – | – | 91.14 | 82.26 | 77.58 | – | – | – | – | – | – | – | – | – | ✓ |
Method | Year | AP2D | AP3D | APBEV | nuScenes | Waymo | Time (s) | Hardware | Code Available | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E | M | H | E | M | H | E | M | H | mAP | NDS | L1 | L2 | |||||
Radar-PointGNN [86] | 2021 | – | – | – | – | – | – | – | – | - | 0.5 | 3.4 | – | – | – | – | × |
K-Radar [87] | 2022 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
KPConvPillars [88] | 2022 | – | – | – | – | – | – | – | – | – | 4.9 | 13.9 | – | – | – | – | × |
Dual Radar [239] | 2023 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
CenterRadarNet [240] | 2024 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
RadarDistill [89] | 2024 | – | – | – | – | – | – | – | – | - | 20.5 | 43.7 | – | – | – | – | ✓ |
RADLER [90] | 2025 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | ✓ |
Method | Year | AP2D | AP3D | APBEV | nuScenes | Waymo | Time (s) | Hardware | Code Available | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E | M | H | E | M | H | E | M | H | mAP | NDS | L1 | L2 | |||||
Early Fusion: | |||||||||||||||||
F-PointNet [92] | 2017 | 95.85 | 95.17 | 85.42 | 82.19 | 69.79 | 60.59 | 91.17 | 84.67 | 74.77 | – | – | – | – | 0.17 | GPU @ 3.0 GHz (Python) | ✓ |
F-ConvNet [93] | 2019 | 95.85 | 92.19 | 80.09 | 87.36 | 76.39 | 66.69 | 91.51 | 85.84 | 76.11 | – | – | – | – | 0.47 | GPU @ 2.5 GHz (Python + C/C++) | ✓ |
RoarNet [241] | 2018 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
Complexer-YOLO [242] | 2019 | 91.92 | 84.16 | 79.62 | 55.93 | 47.34 | 42.60 | 77.24 | 68.96 | 64.95 | – | – | – | – | 0.06 | GPU @ 3.5 GHz (C/C++) | ✓ |
PointPainting [91] | 2019 | 98.39 | 92.58 | 89.71 | 82.11 | 71.70 | 67.08 | 92.45 | 88.11 | 83.36 | – | – | – | – | 0.40 | GPU @ 2.5 GHz (Python + C/C++) | × |
FusionPainting [243] | 2021 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
MVP [244] | 2021 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
F-PointPillars [94] | 2021 | – | – | – | – | – | – | – | – | – | – | – | – | – | 0.06 | 4 cores @ 3.0 GHz (Python) | ✓ |
PointAugmenting [245] | 2021 | – | – | – | – | – | – | 89.14 | 86.57 | 78.32 | – | – | – | – | – | – | ✓ |
VirConvNet [95] | 2023 | 98.00 | 97.27 | 94.53 | 92.48 | 87.20 | 82.45 | 95.99 | 93.52 | 90.38 | – | – | – | – | 0.09 | 1 core @ 2.5 GHz (C/C++) | ✓ |
HDF [246] | 2025 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
Mid-Level Fusion: | |||||||||||||||||
MV3D [96] | 2016 | 96.47 | 90.83 | 78.63 | 74.97 | 63.63 | 54.00 | 86.62 | 78.93 | 69.80 | – | – | – | – | 0.36 | GPU @ 2.5 GHz (Python + C/C++) | ✓ |
AVOD [97] | 2017 | 95.17 | 89.88 | 82.83 | 76.39 | 66.47 | 60.23 | 89.75 | 84.95 | 78.32 | – | – | – | – | 0.08 | Titan X (Pascal) | ✓ |
PointFusion [247] | 2017 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
ContFuse [98] | 2018 | – | – | – | 83.68 | 68.78 | 61.67 | 94.07 | 85.35 | 75.88 | – | – | – | – | 0.06 | GPU @ 2.5 GHz (Python) | × |
MVXNet [248] | 2019 | – | – | – | 83.20 | 72.70 | 65.20 | – | – | – | – | – | – | – | – | – | × |
PI-RCNN [249] | 2019 | 96.17 | 92.66 | 87.68 | 84.37 | 74.82 | 70.03 | 91.44 | 85.81 | 81.00 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (Python) | × |
MCF3D [250] | 2019 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
MMF [99] | 2020 | 97.41 | 94.25 | 91.80 | 88.40 | 77.43 | 70.22 | 93.67 | 88.21 | 81.99 | – | – | – | – | 0.08 | GPU @ 2.5 GHz (Python) | × |
3D-CVF [251] | 2020 | 96.78 | 93.36 | 86.11 | 89.20 | 80.05 | 73.11 | 93.52 | 89.56 | 82.45 | – | – | – | – | 0.06 | 1 core @ 2.5 GHz (C/C++) | ✓ |
EPNet [100] | 2020 | 96.15 | 94.44 | 89.99 | 89.81 | 79.28 | 74.59 | 94.22 | 88.47 | 83.69 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (Python + C/C++) | ✓ |
EPNet++ [226] | 2021 | 96.73 | 95.17 | 92.10 | 91.37 | 81.96 | 76.71 | 95.41 | 89.00 | 85.73 | – | – | – | – | 0.10 | GPU @ 2.5 GHz (Python) | × |
TransFusion [101] | 2022 | – | – | – | – | – | – | – | – | – | 68.90 | 71.70 | – | – | – | – | × |
BEVFusion [252] | 2022 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
FUTR3D [102] | 2022 | – | – | – | – | – | – | – | – | – | 69.40 | 72.10 | – | – | – | – | × |
DeepFusion [253] | 2022 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
MSMDFusion [254] | 2022 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
CAT-Det [255] | 2022 | 95.97 | 94.71 | 92.07 | 89.87 | 81.32 | 76.68 | 92.59 | 90.07 | 85.82 | – | – | – | – | 0.30 | GPU @ 2.5 GHz (Python + C/C++) | × |
HMFI [256] | 2022 | 96.29 | 95.16 | 92.45 | 88.90 | 81.93 | 77.30 | 93.04 | 89.17 | 86.37 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (C/C++) | ✓ |
LoGoNet [257] | 2023 | 96.60 | 95.55 | 93.07 | 91.80 | 85.06 | 80.74 | 95.48 | 91.52 | 87.09 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (C/C++) | ✓ |
SDVRF [258] | 2023 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
SupFusion [259] | 2023 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
FGFusion [260] | 2023 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
VCD [261] | 2023 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | × |
UniTR [262] | 2023 | – | – | – | – | – | – | – | – | – | 70.90 | 74.50 | – | – | – | – | × |
Late Fusion: | |||||||||||||||||
CLOCS [103] | 2020 | 96.77 | 96.07 | 91.11 | 89.16 | 82.28 | 77.23 | 92.91 | 89.48 | 86.42 | – | – | – | – | 0.10 | 1 core @ 2.5 GHz (Python) | × |
Fast-CLOCS [104] | 2022 | 96.69 | 95.75 | 90.95 | 89.10 | 80.35 | 76.99 | 93.03 | 89.49 | 86.40 | 63.10 | 68.70 | – | – | 0.10 | GPU @ 2.5 GHz (Python) | ✓ |
5. Evaluation and Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AD | Autonomous Driving |
ADAS | Advanced Driver Assistance Systems |
AI | Artificial Intelligence |
AP | Average Precision |
AV | Autonomous Vehicle |
BEV | Bird’s-Eye View |
CNN | Convolutional Neural Network |
CV | Computer Vision |
DL | Deep Learning |
GNSS | Global Navigation Satellite System |
IoU | Intersection over Union |
IMU | Inertial Measurement Unit |
LiDAR | Light Detection and Ranging |
mAP | Median Average Precision |
ML | Machine Learning |
NDS | nuScenes Detection Score |
NDS | Non-Maximum Suppression |
NN | Neural Network |
OD | Object Detection |
PC | Point Cloud |
Radar | Radio Detection and Ranging |
SAE | Society of Automotive Engineers |
SoA | State-of-the-Art |
Sonar | Sound Navigation and Ranging |
SLAM | Simultaneous Localization and Mapping |
ToF | Time-of-Flight |
References
- SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles; SAE International; On-Road Automated Driving (ORAD) Committee: Warrendale, PA, USA, 2021. [Google Scholar]
- Van Brummelen, J.; O’brien, M.; Gruyer, D.; Najjaran, H. Autonomous vehicle perception: The technology of today and tomorrow. Transp. Res. Part C Emerg. Technol. 2018, 89, 384–406. [Google Scholar] [CrossRef]
- Jeffs, J.; He, M.X. Autonomous Cars, Robotaxis and Sensors 2024–2044; IDTechEx: Cambridge, UK, 2023. [Google Scholar]
- Waymo LLC. On the Road to Fully Self-Driving; Waymo Safety Report; Waymo LLC.: Mountain View, CA, USA, 2021. [Google Scholar]
- Ackerman, E. What Full Autonomy Means for the Waymo Driver. IEEE Spectrum. 2021. Available online: https://spectrum.ieee.org/full-autonomy-waymo-driver (accessed on 4 March 2021).
- Dingus, T.A.; Guo, F.; Lee, S.; Antin, J.F.; Perez, M.; Buchanan-King, M.; Hankey, J. Driver crash risk factors and prevalence evaluation using naturalistic driving data. Proc. Natl. Acad. Sci. USA 2016, 113, 2636–2641. [Google Scholar] [CrossRef]
- Singh, S. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey; National Highway Traffic Safety Administration: Washington, DC, USA, 2015.
- Montgomery, W.; Mudge, R.; Groshen, E.L.; Helper, S.; MacDuffie, J.P.; Carson, C. America’s Workforce Self-Driving Future: Realizing Productivity Gains and Spurring Economic Growth; Securing America’s Future Energy: Washington, DC, USA, 2018. [Google Scholar]
- Chehri, A.; Mouftah, H.T. Autonomous vehicles in the sustainable cities, the beginning of a green adventure. Sustain. Cities Soc. 2019, 51, 101751. [Google Scholar] [CrossRef]
- Dhall, A.; Dai, D.; Van Gool, L. Real-time 3D traffic cone detection for autonomous driving. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 494–501. [Google Scholar]
- Hudson, J.; Orviska, M.; Hunady, J. People’s attitudes to autonomous vehicles. Transp. Res. Part A Policy Pract. 2019, 121, 164–176. [Google Scholar] [CrossRef]
- Hulse, L.M.; Xie, H.; Galea, E.R. Relationships with road users, risk, gender and age. Saf. Sci. 2018, 102, 1–13. [Google Scholar] [CrossRef]
- Srivastava, A. Sense-Plan-Act in Robotic Applications. In Proceedings of the Intelligent Robotics Seminar, Macao, China, 4–8 November 2019. [Google Scholar] [CrossRef]
- Betz, J.; Wischnewski, A.; Heilmeier, A.; Nobis, F.; Stahl, T.; Hermansdorfer, L.; Lohmann, B.; Lienkamp, M. What can we learn from autonomous level-5 motorsport? In Proceedings of the 9th International Munich Chassis Symposium 2018, Munich, Germany, 12–13 June 2018; Springer: Wiesbaden, Germany, 2019. [Google Scholar]
- Betz, J.; Zheng, H.; Liniger, A.; Rosolia, U.; Karle, P.; Behl, M.; Krovi, V.; Mangharam, R. Autonomous Vehicles on the Edge: A Survey on Autonomous Vehicle Racing. IEEE Open J. Intell. Transp. Syst. 2022, 3, 458–488. [Google Scholar] [CrossRef]
- Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
- Wen, L.H.; Jo, K.H. Deep learning-based perception systems for autonomous driving: A comprehensive survey. Neurocomputing 2022, 489, 255–270. [Google Scholar] [CrossRef]
- Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
- Chen, W.; Li, Y.; Tian, Z.; Zhang, F. 2D and 3D object detection algorithms from images: A Survey. Array 2023, 19, 100305. [Google Scholar] [CrossRef]
- Pravallika, A.; Hashmi, M.F.; Gupta, A. Deep Learning Frontiers in 3D Object Detection: A Comprehensive Review for Autonomous Driving. IEEE Access 2024, 12, 173936–173980. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, H.; Dong, H. A Survey of Deep Learning-Driven 3D Object Detection: Sensor Modalities, Technical Architectures, and Applications. Sensors 2025, 25, 3668. [Google Scholar] [CrossRef]
- Ma, X.; Ouyang, W.; Simonelli, A.; Ricci, E. 3d object detection from images for autonomous driving: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3537–3556. [Google Scholar] [CrossRef] [PubMed]
- Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef]
- Wang, Y.; Mao, Q.; Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Li, H.; Zhang, Y. Multi-modal 3d object detection in autonomous driving: A survey. Int. J. Comput. Vis. 2023, 131, 2122–2152. [Google Scholar] [CrossRef]
- Lahoud, J.; Cao, J.; Khan, F.S.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Yang, M.H. 3D vision with transformers: A survey. arXiv 2022, arXiv:2208.04309. [Google Scholar] [CrossRef]
- Zhu, M.; Gong, Y.; Tian, C.; Zhu, Z. A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones 2024, 8, 412. [Google Scholar] [CrossRef]
- Calvo, E.L.; Taveira, B.; Kahl, F.; Gustafsson, N.; Larsson, J.; Tonderski, A. Timepillars: Temporally-recurrent 3d lidar object detection. arXiv 2023, arXiv:2312.17260. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Arnold, E.; Al-Jarrah, O.Y.; Dianati, M.; Fallah, S.; Oxtoby, D.; Mouzakitis, A. A survey on 3d object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3782–3795. [Google Scholar] [CrossRef]
- Nagiub, A.S.; Fayez, M.; Khaled, H.; Ghoniemy, S. 3D object detection for autonomous driving: A comprehensive review. In Proceedings of the 2024 6th International Conference on Computing and Informatics (ICCI), Cairo, Egypt, 6–7 March 2024; pp. 1–11. [Google Scholar]
- Gao, W.; Li, G. Deep Learning for 3D Point Clouds; Springer: Singapore, 2025. [Google Scholar]
- Liang, W.; Xu, P.; Guo, L.; Bai, H.; Zhou, Y.; Chen, F. A survey of 3D object detection. Multimed. Tools Appl. 2021, 80, 29617–29641. [Google Scholar] [CrossRef]
- Fayyad, J.; Jaradat, M.A.; Gruyer, D.; Najjaranngharam, H. Deep Learning Sensor Fusion: Vehicle Perception and Localization: A Review. Sensors 2022, 20, 4220. [Google Scholar] [CrossRef]
- Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
- Li, H.; Zhao, Y.; Zhong, J.; Wang, B.; Sun, C.; Sun, F. Delving into the Secrets of BEV 3D Object Detection in Autonomous Driving: A Comprehensive Survey. Authorea Prepr. 2025. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
- Wang, P.; Huang, X.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed]
- Chang, M.F.; Lambert, J.; Sangkloy, P.; Singh, J.; Bak, S.; Hartnett, A. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8748–8757. [Google Scholar]
- Houston, J.; Zuidhof, G.; Bergamini, L.; Ye, Y.; Chen, L.; Jain, A.; Omari, S.; Iglovikov, V.; Ondruska, P. One thousand and one hours: Self-driving motion prediction dataset. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; pp. 409–418. [Google Scholar]
- Patil, A.; Malla, S.; Gang, H.; Chen, Y.T. The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9552–9557. [Google Scholar]
- Zamanakos, G.; Tsochatzidis, L.; Amanatiadis, A.; Pratikakis, I. A comprehensive survey of LIDAR-based 3D object detection methods with deep learning for autonomous driving. Comput. Graph. 2021, 99, 153–181. [Google Scholar] [CrossRef]
- Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1903–1911. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2147–2156. [Google Scholar]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
- Brazil, G.; Liu, X. M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9287–9296. [Google Scholar]
- Simonelli, A.; Bulo, S.R.; Porzi, L.; López-Antequera, M.; Kontschieder, P. Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1991–1999. [Google Scholar]
- Limaye, A.; Mathew, M.; Nagori, S.; Swami, P.K.; Maji, D.; Desappan, K. SS3D: Single shot 3D object detector. arXiv 2020, arXiv:2004.14674. [Google Scholar] [CrossRef]
- Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 996–997. [Google Scholar]
- Chong, Z.; Ma, X.A.A. Monodistill: Learning spatial features for monocular 3d object detection. arXiv 2022, arXiv:2201.10830. [Google Scholar] [CrossRef]
- Wang, S.; Zheng, J. MonoSKD: General distillation framework for monocular 3D object detection via Spearman correlation coefficient. arXiv 2023, arXiv:2310.11316. [Google Scholar] [CrossRef]
- Xu, J.; Peng, L.; Cheng, H.; Li, H.; Qian, W.; Li, K.; Wang, W.; Cai, D. Mononerd: Nerf-like representations for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6814–6824. [Google Scholar]
- Yan, L.; Yan, P.; Xiong, S.; Xiang, X.; Tan, Y. Monocd: Monocular 3d object detection with complementary depths. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 10248–10257. [Google Scholar]
- Li, P.; Chen, X.; Shen, S. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15 –20 June 2019. [Google Scholar]
- Chen, Y.; Liu, S.; Shen, X.; Jia, J. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Liu, Y.; Wang, L.; Liu, M. Yolostereo3d: A step back to 2d for efficient stereo 3d detection. In Proceedings of the 2021 International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13018–13024. [Google Scholar]
- Guo, X.; Shi, S.; Wang, X.; Li, H. Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3153–3163. [Google Scholar]
- Liu, Z.; Ye, X.; Tan, X.; Ding, E.; Bai, X. Stereodistill: Pick the cream from lidar for distilling stereo-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1790–1798. [Google Scholar] [CrossRef]
- Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; pp. 180–191. [Google Scholar]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 531–548. [Google Scholar]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T. Bevformer: Learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
- Liu, H.; Teng, Y.; Lu, T.; Wang, H.; Wang, L. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18580–18590. [Google Scholar]
- Ji, H.; Ni, T.; Huang, X.; Luo, T.; Zhan, X.; Chen, J. RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding. arXiv 2025, arXiv:2504.12643. [Google Scholar]
- Liu, W.; Sun, J.; Li, W.; Hu, T.; Wang, P. Deep Learning on Point Clouds and Its Application: A Survey. Sensors 2019, 19, 4188. [Google Scholar] [CrossRef]
- Nguyen, A.; Jo, K. 3D Point Cloud Segmentation: A survey. In Proceedings of the IEEE Conference on Robotics, Automation and Mechatronics, Kagawa, Japan, 4–7 August 2013. [Google Scholar]
- Xuan, Y.; Qu, Y. Multimodal Data Fusion for BEV Perception. Master’s Thesis, University of Gothenburg, Gothenburg, Sweden, 2024. [Google Scholar]
- Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-time 3D Object Detection from Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
- Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-yolo: Real-time 3d object detection on point clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar]
- Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; García, F.; De La Escalera, A. Birdnet: A 3d object detection framework from lidar information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems, Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
- Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
- Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7463–7472. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Chen, Q.; Sun, L.; Wang, Z.; Jia, K.; Yuille, A. Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 68–84. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
- Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
- Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-equivariant 3d object detection for autonomous driving. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2795–2802. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, S.; Shen, X.; Jia, J. Fast point r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9775–9784. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
- Hu, J.S.; Kuai, T.; Waslander, S.L. Point density-aware voxels for lidar 3d object detection. In Proceedings of the IEEE/CVF Conference, New Orleans, LA, USA, 18–24 June 2022; pp. 8469–8478. [Google Scholar]
- Lai-Dang, Q.V. A survey of vision transformers in autonomous driving: Current trends and future directions. arXiv 2024, arXiv:2403.07542. [Google Scholar] [CrossRef]
- Svenningsson, P.; Fioranelli, F.; Yarovoy, A. Radar-pointgnn: Graph based object recognition for unstructured radar point-cloud data. In Proceedings of the 2021 IEEE Radar Conference (RadarConf21), Atlanta, GA, USA, 8–14 May 2021; pp. 1–6. [Google Scholar]
- Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. Adv. Neural Inf. Process. Syst. 2022, 35, 3819–3829. [Google Scholar]
- Ulrich, M.; Braun, S.; Köhler, D.; Niederlöhner, D.; Faion, F.; Gläser, C.; Blume, H. Improved orientation estimation and detection with hybrid object detection networks for automotive radar. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 111–117. [Google Scholar]
- Bang, G.; Choi, K.; Kim, J.; Kum, D.; Choi, J.W. Radardistill: Boosting radar-based object detection performance via knowledge distillation from lidar features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15491–15500. [Google Scholar]
- Luo, Y.; Hoffmann, R.; Xia, Y.; Wysocki, O.; Schwab, B.; Kolbe, T.H.; Cremers, D. RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 4452–4461. [Google Scholar]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1742–1749. [Google Scholar]
- Paigwar, A.; Sierra-Gonzalez, D.; Erkent, Ö.; Laugier, C. Frustum-pointpillars: A multi-stage approach for 3d object detection using rgb camera and lidar. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2926–2933. [Google Scholar]
- Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual sparse convolution for multimodal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21653–21662. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1907–1915. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 10–13 September 2018; pp. 641–656. [Google Scholar]
- Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7345–7353. [Google Scholar]
- Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer: Cham, Switzerland, 2020; pp. 35–52. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
- Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. Futr3d: A unified sensor fusion framework for 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 172–181. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 25–29 October 2020; pp. 10386–10393. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. Fast-CLOCs: Fast camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 187–196. [Google Scholar]
- Yang, B.; Guo, R.; Liang, M.; Casas, S.; Urtasun, R. Radarnet: Exploiting radar for robust perception of dynamic objects. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 496–512. [Google Scholar]
- Long, Y.; Kumar, A.; Liu, X.; Morris, D. RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 22276–22285. [Google Scholar]
- Nabati, R.; Qi, H. Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1527–1536. [Google Scholar]
- Shi, K.; He, S.; Shi, Z.; Chen, A.; Xiong, Z.; Chen, J.; Luo, J. Radar and camera fusion for object detection and tracking: A comprehensive survey. arXiv 2024, arXiv:2410.19872. [Google Scholar] [CrossRef]
- Giuffrida, L.; Masera, G.; Martina, M. A survey of automotive radar and lidar signal processing and architectures. Chips 2023, 2, 243–261. [Google Scholar] [CrossRef]
- Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Subcategory-aware convolutional neural networks for object proposals and detection. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 924–933. [Google Scholar]
- Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teuliere, C.; Chateau, T. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2040–2049. [Google Scholar]
- Kundu, A.; Li, Y.; Rehg, J.M. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3559–3568. [Google Scholar]
- Manhardt, F.; Kehl, W.; Gaidon, A. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2069–2078. [Google Scholar]
- Xu, B.; Chen, Z. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2345–2353. [Google Scholar]
- Qin, Z.; Wang, J.; Lu, Y. Monogrnet: A geometric reasoning network for monocular 3d object localization. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8851–8858. [Google Scholar] [CrossRef]
- Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1019–1028. [Google Scholar]
- Weng, X.; Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ma, X.; Wang, Z.A.A. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6851–6860. [Google Scholar]
- Chang, J.; Wetzstein, G. Deep optics for monocular depth estimation and 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10193–10202. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [PubMed]
- Liu, L.; Lu, J.; Xu, C.; Tian, Q.; Zhou, J. Deep fitting degree scoring network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1057–1066. [Google Scholar]
- Naiden, A.; Paunescu, V.; Kim, G.; Jeon, B.; Leordeanu, M. Shift r-cnn: Deep monocular 3d object detection with closed-form geometric constraints. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 61–65. [Google Scholar]
- Bao, W.; Xu, B.; Chen, Z. Monofenet: Monocular 3d object detection with feature enhancement networks. IEEE Trans. Image Process. 2019, 29, 2753–2765. [Google Scholar] [CrossRef]
- Ku, J.; Pon, A.D.; Waslander, S.L. Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11867–11876. [Google Scholar]
- Simonelli, A.; Bulo, S.R.; Porzi, L.; Ricci, E.; Kontschieder, P. Towards generalization across depth for monocular 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Cham, Switzerland, 2020; pp. 767–782. [Google Scholar]
- Vianney, J.M.U.; Aich, S.; Liu, B. Refinedmpl: Refined monocular pseudolidar for 3d object detection in autonomous driving. arXiv 2019, arXiv:1911.09712. [Google Scholar]
- Chen, Y.; Tai, L.; Sun, K.; Li, M. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12093–12102. [Google Scholar]
- Li, P.; Zhao, H.; Liu, P.; Cao, F. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 644–660. [Google Scholar]
- Ma, X.; Liu, S.; Xia, Z.; Zhang, H.; Zeng, X.; Ouyang, W. Rethinking pseudo-lidar representation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer: Cham, Switzerland, 2020; pp. 311–327. [Google Scholar]
- Zhou, D.; Song, X.; Dai, Y.; Yin, J.; Lu, F.; Liao, M.; Fang, J.; Zhang, L. Iafa: Instance-aware feature aggregation for 3d object detection from a single image. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Brazil, G.; Pons-Moll, G.; Liu, X.; Schiele, B. Kinematic 3d object detection in monocular video. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIII 16. Springer: Cham, Switzerland, 2020; pp. 135–152. [Google Scholar]
- Li, P.; Zhao, H. Monocular 3d detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Autom. Lett. 2021, 6, 5565–5572. [Google Scholar] [CrossRef]
- Wang, L.; Du, L.; Ye, X.; Fu, Y.; Guo, G.; Xue, X.; Feng, J.; Zhang, L. Depth-conditioned dynamic message propagation for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2021; pp. 454–463. [Google Scholar]
- Chen, H.; Huang, Y.; Tian, W.; Gao, Z.; Xiong, L. Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10379–10388. [Google Scholar]
- Kumar, A.; Brazil, G.; Liu, X. Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8973–8983. [Google Scholar]
- Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2021; pp. 4721–4730. [Google Scholar]
- Reading, C.; Harakeh, A.; Chae, J.; Waslander, S.L. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8555–8564. [Google Scholar]
- Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3289–3298. [Google Scholar]
- Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T.K. Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15172–15181. [Google Scholar]
- Wang, T.; Zhu, X.; Pang, J.; Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 913–922. [Google Scholar]
- Zhou, Y.; He, Y.; Zhu, H.; Wang, C.; Li, H.; Jiang, Q. Monocular 3d object detection: An extrinsic parameter free approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7556–7566. [Google Scholar]
- Lu, Y.; Ma, X.; Yang, L.; Zhang, T.; Liu, Y.; Chu, Q.; Yan, J.; Ouyang, W. Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3111–3121. [Google Scholar]
- Wang, T.; Xinge, Z.; Pang, J.; Lin, D. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; pp. 1475–1485. [Google Scholar]
- He, C.; Huang, J.; Hua, X.S.; Zhang, L. Aug3d-rpn: Improving monocular 3d object detection by synthetic images with virtual depth. arXiv 2021, arXiv:2107.13269. [Google Scholar]
- Park, D.; Ambrus, R.; Guizilini, V.; Li, J.; Gaidon, A. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3142–3152. [Google Scholar]
- Wang, L.; Zhang, L.; Zhu, Y.; Zhang, Z.; He, T.; Li, M.; Xue, X. Progressive coordinate transforms for monocular 3d object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 13364–13377. [Google Scholar]
- Liu, Z.; Zhou, D.; Lu, F.; Fang, J.; Zhang, L. Autoshape: Real-time shape-aware monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 15641–15650. [Google Scholar]
- Liu, C.; Gu, S.; Van Gool, L.; Timofte, R. Deep line encoding for monocular 3d object detection and depth prediction. In Proceedings of the 32nd British Machine Vision Conference (BMVC 2021), Virtual, 22–25 November 2021; BMVA Press: Durham, UK, 2021; p. 354. [Google Scholar]
- Liu, X.; Xue, N.; Wu, T. Learning auxiliary monocular contexts helps monocular 3d object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1810–1818. [Google Scholar] [CrossRef]
- Huang, K.C.; Wu, T.H.; Su, H.T.; Hsu, W.H. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4012–4021. [Google Scholar]
- Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Cui, Z.; Qiao, Y.; Li, H.; Gao, P. Monodetr: Depth-guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9155–9166. [Google Scholar]
- Lian, Q.; Li, P.; Chen, X. Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1070–1079. [Google Scholar]
- Gu, J.; Wu, B.; Fan, L.; Huang, J.; Cao, S.; Xiang, Z.; Hua, X.S. Homography loss for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1080–1089. [Google Scholar]
- Li, Z.; Qu, Z.; Zhou, Y.; Liu, J.; Wang, H.; Jiang, L. Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2791–2800. [Google Scholar]
- Yang, L.; Zhang, X.; Li, J.; Wang, L.; Zhu, M.; Zhang, C.; Liu, H. Mix-teaching: A simple, unified and effective semi-supervised learning framework for monocular 3d object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6832–6844. [Google Scholar] [CrossRef]
- Li, Y.; Chen, Y.; He, J.; Zhang, Z. Densely constrained depth estimator for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 718–734. [Google Scholar]
- Kumar, A.; Brazil, G.; Corona, E.; Parchami, A.; Liu, X. Deviant: Depth equivariant network for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 664–683. [Google Scholar]
- Brazil, G.; Kumar, A.; Straub, J.; Ravi, N.; Johnson, J.; Gkioxari, G. Omni3d: A large benchmark and model for 3d object detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13154–13164. [Google Scholar]
- Zhou, Y.; Liu, Q.; Zhu, H.; Li, Y.; Chang, S.; Guo, M. Mogde: Boosting mobile monocular 3d object detection with ground depth estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 2033–2045. [Google Scholar]
- Wu, Z.; Wu, Y.; Pu, J.; Li, X.; Wang, X. Attention-based depth distillation with 3d-aware positional encoding for monocular 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2892–2900. [Google Scholar] [CrossRef]
- Hong, Y.; Dai, H.; Ding, Y. Cross-modality knowledge distillation network for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 87–104. [Google Scholar]
- Wu, Z.; Gan, Y.; Wang, L.; Chen, G.; Pu, J. Monopgc: Monocular 3d object detection with pixel geometry contexts. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 4842–4849. [Google Scholar]
- Zhou, Y.; Zhu, H.; Liu, Q.; Chang, S.; Guo, M. Monoatt: Online monocular 3d object detection with adaptive token transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17493–17503. [Google Scholar]
- Min, Z.; Zhuang, B.; Schulter, S.; Liu, B.; Dunn, E.; Chandraker, M. Neurocs: Neural nocs supervision for monocular 3d object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21404–21414. [Google Scholar]
- Zhang, W.; Liu, D.; Ma, C.; Cai, W. Alleviating foreground sparsity for semi-supervised monocular 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 7542–7552. [Google Scholar]
- Jinrang, J.; Li, Z.; Shi, Y. Monouni: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. Adv. Neural Inf. Process. Syst. 2023, 36, 11703–11715. [Google Scholar]
- Vu, K.D.; Tran, T.T.; Nguyen, D.D. MonoDSSMs: Efficient Monocular 3D Object Detection with Depth-Aware State Space Models. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3883–3900. [Google Scholar]
- Jiang, X.; Jin, S.; Zhang, X.; Shao, L.; Lu, S. MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders. arXiv 2024, arXiv:2405.07696. [Google Scholar]
- Ranasinghe, Y.; Hegde, D.; Patel, V.M. Monodiff: Monocular 3d object detection and pose estimation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10659–10670. [Google Scholar]
- Gao, Y.; Wang, P.; Li, X.; Sun, M.; Di, R.; Li, L.; Hong, W. MonoDFNet: Monocular 3D Object Detection with Depth Fusion and Adaptive Optimization. Sensors 2025, 25, 760. [Google Scholar] [CrossRef]
- Zhang, J.; Li, J.; Lin, X.; Zhang, W.; Tan, X.; Han, J.; Ding, E.; Wang, J.; Li, G. Decoupled pseudo-labeling for semi-supervised monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16923–16932. [Google Scholar]
- Shi, P.; Dong, X.; Ge, R.; Liu, Z.; Yang, A. Dp-M3D: Monocular 3D object detection algorithm with depth perception capability. Knowl.-Based Syst. 2025, 318, 113539. [Google Scholar] [CrossRef]
- Kim, J.; Moon, S.; Nah, S.; Shim, D.H. MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model. arXiv 2025, arXiv:2502.00315. [Google Scholar]
- Gao, R.; Kim, J.; Phuong, M.C.; Cho, K. Pseudo-LiDAR with Two-Dimensional Instance for Monocular Three-Dimensional Object Tracking. IEEE Access 2025, 13, 45771–45783. [Google Scholar] [CrossRef]
- Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3d object proposals for accurate object class detection. Adv. Neural Inf. Process. Syst. 2015, 28, 424–432. [Google Scholar]
- Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
- Qin, Z.; Wang, J.; Lu, Y. Triangulation learning network: From monocular to stereo 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7615–7623. [Google Scholar]
- You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv 2019, arXiv:1906.06310. [Google Scholar]
- Königshof, H.; Salscheider, N.O.; Stiller, C. Realtime 3d object detection for automated driving using stereo vision and semantic information. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 1405–1410. [Google Scholar]
- Li, C.; Ku, J.; Waslander, S.L. Confidence guided stereo 3D object detection with split depth estimation. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 5776–5783. [Google Scholar]
- Xu, Z.; Zhang, W.; Ye, X.; Tan, X.; Yang, W.; Wen, S.; Ding, E.; Meng, A.; Huang, L. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12557–12564. [Google Scholar] [CrossRef]
- Sun, J.; Chen, L.; Xie, Y.; Zhang, S.; Jiang, Q.; Zhou, X.; Bao, H. Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10548–10557. [Google Scholar]
- Qian, R.; Garg, D.; Wang, Y.; You, Y.; Belongie, S.; Hariharan, B.; Campbell, M.; Weinberger, K.Q.; Chao, W.L. End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5881–5890. [Google Scholar]
- Garg, D.; Wang, Y.; Hariharan, B.; Campbell, M.; Weinberger, K.Q.; Chao, W.L. Wasserstein distances for stereo disparity estimation. Adv. Neural Inf. Process. Syst. 2020, 33, 22517–22529. [Google Scholar]
- Pon, A.D.; Ku, J.; Li, C.; Waslander, S.L. Object-centric stereo matching for 3d object detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 8383–8389. [Google Scholar]
- Li, P.; Su, S.; Zhao, H. Rts3d: Real-time stereo 3d detection from 4d feature-consistency embedding space for autonomous driving. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1930–1939. [Google Scholar] [CrossRef]
- Königshof, H.; Stiller, C. Learning-based shape estimation with grid map patches for realtime 3D object detection for automated driving. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
- Peng, X.; Zhu, X.; Wang, T.; Ma, Y. Side: Center-based stereo 3d detector with structure-aware instance depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 119–128. [Google Scholar]
- Shi, Y.; Guo, Y.; Mi, Z.; Li, X. Stereo CenterNet-based 3D object detection for autonomous driving. Neurocomputing 2022, 471, 219–229. [Google Scholar] [CrossRef]
- Gao, A.; Pang, Y.; Nie, J.; Shao, Z.; Cao, J.; Guo, Y.; Li, X. ESGN: Efficient stereo geometry network for fast 3D object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 34, 2000–2009. [Google Scholar] [CrossRef]
- Chen, Y.N.; Dai, H.; Ding, Y. Pseudo-stereo for monocular 3d object detection in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 887–897. [Google Scholar]
- Chen, Y.; Huang, S.; Liu, S.; Yu, B.; Jia, J. Dsgn++: Exploiting visual-spatial relation for stereo-based 3d detectors. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4416–4429. [Google Scholar] [CrossRef]
- Peng, L.; Wu, X.; Yang, Z.; Liu, H.; Cai, D. Did-m3d: Decoupling instance depth for monocular 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 71–88. [Google Scholar]
- Chen, J.; Wang, Q.; Peng, W.; Xu, H.; Li, X.; Xu, W. Disparity-based multiscale fusion network for transportation detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18855–18863. [Google Scholar] [CrossRef]
- Shi, Y. Svdm: Single-view diffusion model for pseudo-stereo 3d object detection. arXiv 2023, arXiv:2307.02270. [Google Scholar]
- Choi, H.M.; Kang, H.; Hyun, Y. Multi-view reprojection architecture for orientation estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2357–2366. [Google Scholar]
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1477–1485. [Google Scholar] [CrossRef]
- Rukhovich, D.; Vorontsova, A.; Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2397–2406. [Google Scholar]
- Wang, Z.; Min, C.; Ge, Z.; Li, Y.; Li, Z.; Yang, H.; Huang, D. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv 2022, arXiv:2208.10145. [Google Scholar]
- Zhang, Y.; Zhu, Z.; Zheng, W.; Huang, J.; Huang, G.; Zhou, J.; Lu, J. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv 2022, arXiv:2205.09743. [Google Scholar]
- Jiang, Y.; Zhang, L.; Miao, Z.; Zhu, X.; Gao, J.; Hu, W.; Jiang, Y.G. Polarformer: Multi-camera 3d object detection with polar transformer. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1042–1050. [Google Scholar] [CrossRef]
- Wang, S.; Jiang, X.; Li, Y. Focal-petr: Embracing foreground for efficient multi-camera 3d object detection. IEEE Trans. Intell. Veh. 2023, 9, 1481–1489. [Google Scholar] [CrossRef]
- Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Bevdistill: Cross-modal bev distillation for multi-view 3d object detection. arXiv 2022, arXiv:2211.09386. [Google Scholar]
- Park, J.; Xu, C.; Yang, S.; Keutzer, K.; Kitani, K.; Tomizuka, M.; Zhan, W. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv 2022, arXiv:2210.02443. [Google Scholar] [CrossRef]
- Wang, S.; Liu, Y.; Wang, T.; Li, Y.; Zhang, X. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3621–3631. [Google Scholar]
- Yu, Z.; Liu, Q.; Wang, W.; Zhang, L.; Zhao, X. PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Bird’s-Eye-View. arXiv 2024, arXiv:2408.16200. [Google Scholar]
- Wirges, S.; Fischer, T.; Stiller, C.; Frias, J.B. Object detection and classification in occupancy grid maps using deep convolutional networks. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3530–3535. [Google Scholar]
- Zhou, J.; Tan, X.; Shao, Z.; Ma, L. FVNet: 3D front-view proposal generation for real-time object detection from point clouds. In Proceedings of the 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Huaqiao, China, 19–21 October 2019; pp. 1–8. [Google Scholar]
- Zhang, Y.; Xiang, Z.; Qiao, C.; Chen, S. Accurate and Real-Time Object Detection Based on Bird’s Eye View on 3D Point Clouds. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 214–221. [Google Scholar]
- Yang, B.; Liang, M.; Urtasun, R. Hdnet: Exploiting hd maps for 3d object detection. In Proceedings of the Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; pp. 146–155. [Google Scholar]
- Barrera, A.; Beltran, J.; Guindel, C.; Iglesias, J.A.; Garcia, F. Birdnet+: Two-stage 3d object detection in lidar through a sparsity-invariant bird’s eye view. IEEE Access 2021, 9, 160299–160316. [Google Scholar] [CrossRef]
- Koh, J.; Lee, J.; Lee, Y.; Kim, J.; Choi, J.W. Mgtanet: Encoding sequential lidar points using long short-term motion-guided temporal attention for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1179–1187. [Google Scholar] [CrossRef]
- Li, Z.; Guo, J.; Cao, T.; Bingbing, L.; Yang, W. Gpa-3d: Geometry-aware prototype alignment for unsupervised domain adaptive 3d object detection from point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6394–6403. [Google Scholar]
- Wang, D.Z.; Posner, I. Voting for voting in online point cloud object detection. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; Volume 1, pp. 10–15. [Google Scholar]
- He, C.; Li, R.; Li, S.; Zhang, L. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8417–8427. [Google Scholar]
- Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
- Shi, G.; Li, R.; Ma, C. Pillarnet: Real-time and high-performance pillar-based 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 35–52. [Google Scholar]
- Sun, P.; Tan, M.; Wang, W.; Liu, C.; Xia, F.; Leng, Z.; Anguelov, D. Swformer: Sparse window transformer for 3d object detection in point clouds. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 426–442. [Google Scholar]
- Fei, H.; Zhao, J.; Zhang, Z.; Wang, H.; Huang, X. PV-GNN: Point-Voxel 3D Object Detection based on Graph Neural Network. Res. Sq. 2024. [Google Scholar] [CrossRef]
- Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Ipod: Intensive point-based object detector for point cloud. arXiv 2018, arXiv:1812.05276. [Google Scholar] [CrossRef]
- Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 1951–1960. [Google Scholar]
- Zarzar, J.; Giancola, S.; Ghanem, B. PointRGCN: Graph convolution networks for 3D vehicles detection refinement. arXiv 2019, arXiv:1911.12236. [Google Scholar] [CrossRef]
- Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
- Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
- Liu, Z.; Huang, T.; Li, B.; Chen, X.; Wang, X.; Bai, X. EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 8324–8341. [Google Scholar] [CrossRef]
- Chen, C.; Chen, Z.; Zhang, J.; Tao, D. Sasa: Semantics-augmented set abstraction for point-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 221–229. [Google Scholar] [CrossRef]
- Tang, Q.; Bai, X.; Guo, J.; Pan, B.; Jiang, W. DFAF3D: A dual-feature-aware anchor-free single-stage 3D detector for point clouds. Image Vis. Comput. 2023, 129, 104594. [Google Scholar] [CrossRef]
- Xia, Q.; Ye, W.; Wu, H.; Zhao, S.; Xing, L.; Huang, X.; Deng, J.; Li, X.; Wen, C.; Wang, C. Hinted: Hard instance enhanced detector with mixed-density feature fusion for sparsely-supervised 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15321–15330. [Google Scholar]
- Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3d deep learning. Adv. Neural Inf. Process. Syst. 2019, 32, 965–975. [Google Scholar]
- He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
- Qian, R.; Lai, X.; Li, X. BADet: Boundary-aware 3D object detection from point clouds. Pattern Recognit. 2022, 125, 108524. [Google Scholar] [CrossRef]
- Mao, J.; Niu, M.; Bai, H.; Liang, X.; Xu, H.; Xu, C. Pyramid r-cnn: Towards better performance and adaptability for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2723–2732. [Google Scholar]
- He, Y.; Xia, G.; Luo, Y.; Su, L.; Zhang, Z.; Li, W.; Wang, P. DVFENet: Dual-branch voxel feature extraction network for 3D object detection. Neurocomputing 2021, 459, 201–211. [Google Scholar] [CrossRef]
- Yang, Z.; Jiang, L.; Sun, Y.; Schiele, B.; Jia, J. A unified query-based paradigm for point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8541–8551. [Google Scholar]
- Yang, H.; Wang, W.; Chen, M.; Lin, B.; He, T.; Chen, H.; He, X.; Ouyang, W. Pvt-ssd: Single-stage 3d object detector with point-voxel transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13476–13487. [Google Scholar]
- Koo, I.; Lee, I.; Kim, S.H.; Kim, H.S.; Jeon, W.J.; Kim, C. Pg-rcnn: Semantic surface point generation for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18142–18151. [Google Scholar]
- Wang, Z.; Li, Y.L.; Chen, X.; Zhao, H.; Wang, S. Uni3detr: Unified 3d detection transformer. Adv. Neural Inf. Process. Syst. 2023, 36, 39876–39896. [Google Scholar]
- Zhang, X.; Wang, L.; Chen, J.; Fang, C.; Yang, G.; Wang, Y.; Yang, L.; Song, Z.; Liu, L.; Zhang, X.; et al. Dual radar: A multi-modal dataset with dual 4d radar for autononous driving. Sci. Data 2025, 12, 439. [Google Scholar] [CrossRef] [PubMed]
- Cheng, J.H.; Kuan, S.Y.; Liu, H.I.; Latapie, H.; Liu, G.; Hwang, J.N. Centerradarnet: Joint 3d object detection and tracking framework using 4d fmcw radar. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 998–1004. [Google Scholar]
- Shin, K.; Kwon, Y.P.; Tomizuka, M. Roarnet: A robust 3d object detection based on region approximation refinement. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2510–2515. [Google Scholar]
- Simon, M.; Amende, K.; Kraus, A.; Honer, J.; Samann, T.; Kaulbersch, H.; Milz, S.; Michael Gross, H. Complexer-yolo: Real-time 3d object detection and tracking on semantic point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3047–3054. [Google Scholar]
- Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3d detection. Adv. Neural Inf. Process. Syst. 2021, 34, 16494–16507. [Google Scholar]
- Wang, C.; Ma, C.; Zhu, M.; Yang, X. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11794–11803. [Google Scholar]
- Li, J.; Chen, L.; Li, Z. Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection. IEEE Access 2025, 13, 52385–52396. [Google Scholar] [CrossRef]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar]
- Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3d object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12460–12467. [Google Scholar] [CrossRef]
- Wang, J.; Zhu, M.; Sun, D.; Wang, B.; Gao, W.; Wei, H. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection. IEEE Access 2019, 7, 90801–90814. [Google Scholar] [CrossRef]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16. Springer: Cham, Switzerland, 2020; pp. 720–736. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
- Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191. [Google Scholar]
- Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.G. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21643–21652. [Google Scholar]
- Zhang, Y.; Chen, J.; Huang, D. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 908–917. [Google Scholar]
- Li, X.; Shi, B.; Hou, Y.; Wu, X.; Ma, T.; Li, Y.; He, L. Homogeneous multi-modal feature fusion and interaction for 3D object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 691–707. [Google Scholar]
- Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y.; et al. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17524–17534. [Google Scholar]
- Ren, B.; Yin, J. SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection. arXiv 2023, arXiv:2304.08304. [Google Scholar]
- Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 22014–22024. [Google Scholar]
- Yin, Z.; Sun, H.; Liu, N.; Zhou, H.; Shen, J. Fgfusion: Fine-grained lidar-camera fusion for 3d object detection. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; Springer: Singapore, 2023; pp. 505–517. [Google Scholar]
- Huang, L.; Li, Z.; Sima, C.; Wang, W.; Wang, J.; Qiao, Y.; Li, H. Leveraging vision-centric multi-modal expertise for 3d object detection. Adv. Neural Inf. Process. Syst. 2023, 36, 38504–38519. [Google Scholar]
- Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6792–6802. [Google Scholar]
- Viadero-Monasterio, F.; Alonso-Rentería, L.; Pérez-Oria, J.; Viadero-Rueda, F. Radar-based pedestrian and vehicle detection and identification for driving assistance. Vehicles 2024, 6, 1185–1199. [Google Scholar] [CrossRef]
Sensor | Range | Accuracy | Cost | Comput. Cost | Size | Depth | Colour | Affected by Illumination | Affected by Weather |
---|---|---|---|---|---|---|---|---|---|
Monocular Camera | Medium | Medium | Low | High | Small | No | Yes | Yes | Yes |
Stereo Camera | Medium | Medium | Medium | High | Medium | Yes | Yes | Yes | Yes |
Infrared Camera | Medium | Medium | Low | Medium | Small | No | No | No | Yes |
Sonar/Ultrasonic | Low | Low | Low | Low | Small | Yes | No | No | No |
Radar | High | Medium | Medium | Medium | Medium | Yes | No | No | No |
LiDAR | High | High | High | Medium | Large | Yes | No | No | Yes |
Dataset | Year | # Cameras | # LiDARs | # Scenes | # Classes | Locations | Night | Rain | Annotated 3D BBoxes | Annotated Frames |
---|---|---|---|---|---|---|---|---|---|---|
KITTI [28] | 2012 | 2 | 1 | 22 | 3 | Germany | No | No | 80k | 15k |
ApolloScape [38] | 2018 | 2 | 2 | 73 | 27 | China | Yes | No | 70k | 80k |
nuScenes [36] | 2019 | 6 | 1 | 1000 | 23 | USA/Singapore | Yes | Yes | 1.4M | 40k |
ArgoVerse [39] | 2019 | 9 | 2 | 113 | 15 | USA | Yes | Yes | 993K | 22k |
Waymo Open [37] | 2019 | 5 | 5 | 1150 | 4 | USA | Yes | Yes | 12M | 230k |
Lyft Level 5 [40] | 2019 | 7 | 3 | 366 | 9 | USA | No | No | 1.3M | 46k |
H3D [41] | 2019 | 3 | 1 | 160 | 8 | USA | No | No | 1.1M | 27k |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Valverde, M.; Moutinho, A.; Zacchi, J.-V. A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors 2025, 25, 5264. https://doi.org/10.3390/s25175264
Valverde M, Moutinho A, Zacchi J-V. A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors. 2025; 25(17):5264. https://doi.org/10.3390/s25175264
Chicago/Turabian StyleValverde, Miguel, Alexandra Moutinho, and João-Vitor Zacchi. 2025. "A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities" Sensors 25, no. 17: 5264. https://doi.org/10.3390/s25175264
APA StyleValverde, M., Moutinho, A., & Zacchi, J.-V. (2025). A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors, 25(17), 5264. https://doi.org/10.3390/s25175264