A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends

Zhu, Minling; Gong, Yadong; Tian, Chunwei; Zhu, Zuyuan

doi:10.3390/drones8080412

Open AccessReview

A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends

¹

Computer School, Beijing Information Science and Technology University, Beijing 100101, China

²

School of Software, Northwestern Polytechnical University, Xi’an 710129, China

³

Yangtze River Delta Research Institute, Northwestern Polytechnical University, Taicang 215400, China

⁴

Department of Electrical and Electronic Engineering at City, University of London, London EC1V 0HB, UK

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(8), 412; https://doi.org/10.3390/drones8080412

Submission received: 11 June 2024 / Revised: 7 August 2024 / Accepted: 20 August 2024 / Published: 22 August 2024

(This article belongs to the Special Issue Advances in Modeling, Estimation, and Control of Intelligent Transportation Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In recent years, with the continuous development of autonomous driving technology, 3D object detection has naturally become a key focus in the research of perception systems for autonomous driving. As the most crucial component of these systems, 3D object detection has gained significant attention. Researchers increasingly favor the deep learning framework Transformer due to its powerful long-term modeling ability and excellent feature fusion advantages. A large number of excellent Transformer-based 3D object detection methods have emerged. This article divides the methods based on data sources. Firstly, we analyze different input data sources and list standard datasets and evaluation metrics. Secondly, we introduce methods based on different input data and summarize the performance of some methods on different datasets. Finally, we summarize the limitations of current research, discuss future directions and provide some innovative perspectives.

Keywords:

3D object detection; autonomous driving; transformer; survey

1. Introduction

Autonomous driving (AD), as a revolutionary technology in the field of transportation, is rapidly reshaping our perception of mobility. The emergence of this technology is attributed to the rapid advancement of deep learning, particularly the widespread adoption of Convolutional Neural Networks (CNNs) that has significantly enhanced accuracy, real-time processing, and scalability in 2D object detection [1]. In the evolution of autonomous driving, the demand for perception systems is also constantly evolving. With the improvement of more comprehensive and accurate perception requirements, 3D object detection has become one of the focuses among researchers [2].

Three-dimensional object detection involves classification and regression. Classification determines an object’s category and regression precisely locates its position in 3D space through bounding boxes. This process provides crucial information, such as category, size, and orientation. This is vital for perception systems in autonomous driving. However, it is a challenge to process point clouds or multiple RGB images with CNNs. The use of CNNs introduces computational redundancy, involving potentially unnecessary calculations. Additionally, CNNs have a limited perceptual field, only focusing on local contexts, and may lack access to global features, so it is difficult to capture broader information [3]. To address these challenges, researchers are turning to the Transformer [4] architecture, initially popularized in natural language processing, which offers novel perspectives for solving perception challenges in autonomous driving. Transformers excel at modeling long-term dependencies, leveraging contextual information, and integrating local and global features, which sets them apart from traditional CNNs. The successful application of the DETR [5] model has further sparked interest in using the Transformer architecture in 3D object detection.

In the earlier stages of research, attempts were made to extend successful 2D object detection techniques to the realm of 3D object detection, particularly by employing methods such as voxelization on LiDAR point cloud data. However, the inherent sparsity of point cloud data presented significant challenges, ultimately resulting in suboptimal algorithm performance. A notable breakthrough occurred in 2020 with the introduction of MLCVNet [6], a model built upon the foundation of VoteNet [7]. MLCVNet innovatively incorporated a Self-attention mechanism, enabling the comprehensive extraction of features. The success of this approach on various datasets demonstrates the effectiveness of Self-attention mechanisms in capturing useful features from sparse point clouds and provides a new direction for 3D object detection.

Despite these advancements, the cost associated with LiDAR technology prompted strategic shifts in the industry, exemplified by companies like Tesla transitioning towards image-based 3D object detection. BEVFormer [8], proposed by Li et al. from the Shanghai AI Lab, emerged as a noteworthy solution in this context. Leveraging the structural properties of Transformers, BEVFormer constructs a high-dimensional feature space tailored specifically for 3D object detection tasks using pure vision-based data. This innovative approach has demonstrated exceptional performance, particularly showcased on benchmark datasets such as nuScenes [9]. The strategic utilization of Transformer structures in BEVFormer represents a paradigm shift in addressing the challenges posed by sparse LiDAR data, offering a promising alternative in the field of 3D object detection.

Several review articles [10,11,12,13,14] have meticulously examined and discussed Transformer-based object detection methods, contributing significantly to the field. These works have provided valuable insights, categorizing methods based on technical routes like image-based or point cloud-based approaches. However, there is a gap in the literature regarding a comprehensive analysis that intertwines the technical nuances of Transformer architectures with their practical applications in diverse autonomous driving scenarios. Our survey seeks to fill this gap by offering a more encompassing perspective that not only dissects the technicalities of various Transformer-based 3D object detection methods but also contextualizes them within the broader framework of autonomous driving technology. Building on the foundational work of our predecessors, we aim to bridge the divide between theoretical constructs and real-world applicability, thereby providing a holistic view of the current state and future potential of Transformer-based 3D object detection.

In this article, the main contributions of our work are:

Comprehensive Overview of Input Data and Methodologies: We present an in-depth analysis of input data characteristics, standard datasets, and commonly used evaluation metrics in Transformer-based 3D object detection. This includes a detailed outline of fundamental methodologies and the core mechanisms of current Transformer models, emphasizing how these models are uniquely suited for 3D object detection in autonomous driving. Our analysis delves into the intricate details of Transformer architectures, revealing their foundational principles and operational efficiencies.
Novel Taxonomy for Transformer-based Approaches: We introduce an innovative classification method that categorizes Transformer-based 3D object detection methodologies according to their data source types. This approach allows for a nuanced understanding of each method’s motivations, technical intricacies, and advantages. We provide a critical assessment of selected methodologies, comparing their performance on standard datasets to offer a comprehensive evaluation. This classification offers a new perspective in the field, facilitating a clearer understanding of the characteristics and shortcomings of methods based on different data.
Future Directions and Practical Implications: We identify and discuss the current challenges and limitations in Transformer-based 3D object detection. Building on this, we envision potential future research directions and highlight areas ripe for innovation. Furthermore, we explore practical applications of these technologies, considering their impact on the future development of autonomous driving systems. This foresight not only guides future research efforts but also bridges the gap between academic research and practical, real-world applications.

Figure 1 shows an explanation of the structure of this article. The structure and organization of this article are as follows. Firstly, in Section 2, we introduce the data sources, datasets and evaluation metrics of 3D object detection. We also provide a detailed explanation of the Transformer structure. Then, we review and analyze 3D object detection methods based on images, point clouds, and multi-modal in Section 3. Finally, in Section 4, we elaborate the challenges and development trends in current Transformer-based 3D object detection research and provide some innovative perspectives.

2. Background

2.1. Data Sources

Three-dimensional object detection methods for autonomous driving rely on multiple sensor data to achieve high-precision environment perception and object detection. Common sources of data include monocular cameras, stereo cameras, LiDAR, radar, etc. Table 1 illustrates the detailed description of the advantages and disadvantages of some common sensors, as well as their applications in automated vehicles. Monocular cameras offer cost-effectiveness and portability, but their limited depth perception necessitates additional techniques for accurate spatial understanding. Stereo cameras provide precise 3D information by leveraging multiple lenses for depth perception but are complex and expensive to design and calibrate. LiDAR excels in delivering high-precision 3D data, but its deployment cost is relatively high [15]. Radar, known for adaptability and long-range object detection, has limited spatial resolution compared to LiDAR [16].

This still requires attention to the outputs of these sensors from Table 1. The images captured by cameras have rich color and texture attributes, while point clouds captured by sensors such as LiDAR have depth and structural information. So, the single-modal 3D object detection methods have some limitations: camera-only approaches may lack crucial depth information, and point cloud-based methods may suffer from limited image texture. In autonomous driving, a common approach like Huawei’s autonomous driving solution GOD Network involves sensor fusion, combining the strengths of various sensors to achieve comprehensive and robust environmental perception and navigation capabilities. Multi-modal 3D object detection methods, through feature fusion, achieve superior performance by leveraging features from various sensors [17]. Compared to single-modal approaches, multi-modal 3D object detection methods fully exploit the strengths of multiple modalities by combining point cloud depth information and image texture details to enhance 3D object detection.

Table 1. Several common sensors used in 3D object detection for autonomous driving [18,19].

Sensors	Advantages	Disadvantages	Applications in Autonomous Vehicles	Output
Monocular Camera	Lower cost and lightweight, Easy Integration, Low-power consumption, Strong adaptability	Limited depth perception, Ambiguity in scene structure, Vulnerability to lighting conditions, Challenges in 3D scene understanding	Traffic sign and signal recognition, Lane detection, Environmental monitoring, Light condition analysis, Reading road conditions	RGB images
Stereo Camera	Lower cost, Depth perception, Better depth resolution, Reduced sensitivity to lighting conditions	Limited field of view, Calibration difficulty, Power consumption, Limited mobility	Depth perception and obstacle detection, Mapping and localization, Cross-traffic alert	RGB images Point clouds
LiDAR	High-resolution spatial information, Precise distance measurement, Independent of lighting conditions, Applicable in various environments	High cost, Limited perception range, Susceptible to harsh weather conditions	Essential for detailed environment mapping, and classification. Particularly useful for complex navigation tasks in urban environments	Point clouds
Radar	Adaptability, Long-range perception, Lower power consumption	Limited spatial resolution, Inability to provide high-precision maps	Primarily used for detecting vehicles and large objects at a distance, particularly useful for adaptive cruise control and collision avoidance	Point clouds

2.2. Transformers and Attention Mechanism

Transformer is a neural network structure with an encoder and a decoder as its main components, which has previously been widely used as the main neural structure for natural language processing [20,21,22]. Given the unique ability of its structure to model long-term dependencies, it has been successfully adapted to the field of computer vision for autonomous driving, robotics, visual computing, intelligent monitoring and industrial inspection. The standard Transformer encoder part generally consists of six main components [3], as shown in Figure 2: (1) Input Embedding; (2) Positional encoding; (3) Self-attention mechanism; (4) Normalization; (5) Feed-forward; and (6) Skip-connection structure. As for the Transformer decoder, it is typically designed to mirror the Transformer encoder. The decoder part differs slightly from the encoder in that it may accept potential input features from the encoder.

Firstly, for the input embedding part, it is the original input features that are passed through the Multi Layer Perceptron (MLP) or other feature extraction backbone networks (such as VoVNet [23], PointNet [24], etc.) mapping into high-dimensional features to facilitate subsequent learning operations. Secondly, the positional encoding part is one of the reasons why the Transformer structure can obtain a larger field of perception than traditional CNN. Different blocks are encoded to capture effective global image information. Positional encoding is critical for vision Transformer. Thirdly, there is the core part of the Transformer, the Self-attention module, which is typically added to the embedded feature map X if sine- or cosine-based position encoding is used. Then, using three learnable weight matrices

W_{Q}

,

\begin{matrix} W_{K} \end{matrix}

,

\begin{matrix} W_{V} \end{matrix}

to project that feature map into three different feature spaces. In this way, the Query, Key and Value matrices can be expressed as in Equation (1):

\{\begin{matrix} Q u e r y & = X W_{Q}, \\ K e y & = X W_{K}, \\ V a l u e & = X W_{V} \end{matrix}

(1)

Based on the Query, Key and Value matrices, an attention map can be formulated as in Equation (2):

A t t e n t i o n m a p = S o f t m a x (\frac{Q K^{T}}{\sqrt{C_{K}}})

(2)

where Q, K, V denote the Query, Key, and Value matrices, respectively. The N × N size Attentionmap is used to calculate the similarity between two points. The similarity matrix is then multiplied with the value matrix to generate a feature map F of the same size as X. Each feature vector in F is obtained by computing a weighted sum of all the input features. It is therefore able to make connections to all input features. When the input is global, the Transformer can easily learn global features through this process. The Transformer gains a very large field of perception through this global feature learning method, which is an extremely important part of the 3D object detection task. After normalization, Transformer goes through a forward propagation layer to focus on feature representation. Finally, influenced by ResNet [25] residual connectivity idea, a skip structure is added between the input and output of the Self-attention module to enhance the learning effect. For 3D object detection, the Transformer structure has the following advantages compared to traditional CNNs [20]:

Positional Encoding: Transformer introduces positional encoding to handle the position information of elements in a sequence, which is crucial for spatial relationships in 3D object detection. Positional encoding aids the model in better understanding the relative positions of target objects, thereby improving detection accuracy.
Query-Value Association: In the Self-attention, computing the association between Query and Key enables Transformer to accurately capture the correlations between different elements in the input sequence. In the context of 3D object detection, this is crucial for determining the features and positions of target objects.
Multi-Head Structure: Transformer’s multi-head structure enables the model to simultaneously focus on different aspects of the input sequence. This is beneficial for handling diverse information in 3D scenes, such as color, shape and size, contributing to improve overall detection performance.
Global Relationship Modeling: Self-attention mechanism in Transformer allows the model to globally attend different parts of the input sequence rather than being limited to local regions. In 3D object detection, this mechanism helps the model better understand the entire scene, including the distribution and relative positions of objects.

Researchers have fully acknowledged the distinctive advantages of Transformer architectures in the realm of computer vision. References [26,27,28] have already used Transformer for processing video and image restoration, and achieved good results. However, there are still many challenges in 3D object detection. One is that it imposes a huge computational burden. The Self-attention mechanism requires similarity calculation for each token, which makes training difficult. Another challenge is the multi-modal feature fusion for 3D object detection. Relying solely on Self-attention mechanism to extract features from an input sequence is far from enough. 3D object detection requires sufficient extraction of feature information between multiple modalities. Cross-attention and Deformable-attention have emerged as prominent choices in addressing these challenges, effectively performing feature fusion and reducing computational complexity in 3D object detection.

2.2.1. Cross-Attention

As shown in Figure 3, Cross-attention is an attention mechanism that mixes two different embedding sequences in the Transformer structure, where one sequence acts as a query for the input and defines the sequence length of the output [4]. The other sequence provides the keys and values of the input. Thus, the two different embedded sequences of the input must have the same dimensionality. However, simultaneously, the two sequences can be in completely different modal forms (one is point cloud data, and the other is RGB image data).

Many recent advances in computer vision rely on the Self-attention mechanism, like ViT [29] and Swin Transformer [30], which serve as enhancements to backbone feature extractors. However, deploying Transformer architectures in large-scale resource-constrained embedded systems poses challenges and makes the incremental gains of Self-attention over well-supported CNNs potentially hard to justify. The Voxel Set Transformer [31] addresses this by incorporating query in Cross-attention as a hyperparameter. This modification significantly reduces the number of operations while improving model performance, enhancing the model’s inference capability for real-time object detection tasks. Unlike the Self-Attention mechanism, which processes a single input sequence, the Cross-attention mechanism handles a broader range of input sources. This ability often allows it to leverage features from cross-modal input data better, improving the accuracy of 3D object detection, especially in complex tasks.

On the other hand, there are more reliable cases for Cross-attention. One of the pioneering studies applying Cross-attention to computer vision is DETR [5]. One of the most innovative parts of DETR is a Cross-attention decoder based on a fixed number of slots called object queries. Unlike the original Transformer, in which each query is fed into the decoder one after the other, these queries are fed into the DETR decoder in parallel. The content of the queries is also learned and does not have to be specified prior to training, except for the number of queries. These queries can be thought of as a blank, pre-assigned template to hold the object detection results, and the Cross-attention decoder is responsible for filling in the blanks. At the same time, DETR inspired the idea of using the Cross-attention decoder for view transformation, which led to a large amount of later work on BEV-based 3D object detection. The input views are fed into a feature encoder (Self-attention or CNN-based), and the encoded features are used as key and value. The query of the object view format can be learned, which yields a great help in the construction of the BEV feature space and offers the possibility of subsequent fusion of multi-modal data.

2.2.2. Deformable-Attention

The idea of Deformable-attention comes from Deformable DETR [32]. DETR eliminates the traces of manual design in the object detection task but suffers from slow convergence and the problem of not having too high a resolution of the feature map due to Self-attention, which leads to poor performance in small object detection. In contrast, Deformable DETR only samples a small number of keys near the reference point to calculate attention, greatly reducing the complexity of the operation while allowing the use of multi-scale features to improve detection accuracy.

The following are the general steps for deformable attention:

Offset Learning: The model learns the attention shifts generated at each position. These offset vectors represent the direction and degree of attention adjustment at the current position.
Offset-based sampling: Using the learned offset, samples are taken from the original feature map to obtain sample points at different positions.
Attention weight calculation: Using sampled points, calculate attention weights. Usually, interpolation and other techniques are used to dynamically adjust the attention weights at the position.
Weighted feature summary: Using the calculated attention weights, the original features are weighted and summarized to obtain the final representation of attention features.

Deformable-attention combines Deformable CNN and Self-attention, focusing on using a small number of points (e.g., three points are used in Deformable DETR) near the reference point to calculate similarity. Unlike the global, dense attentionmap computation of Self-attention, the location of sampled points in Deformable-attention is random and learnable, creating a local, sparse and efficient attention mechanism. This unique advantage is highly valuable in 3D object detection. Following works like BEVFormer consider this attention mechanism a crucial component, opening up the possibility of deploying Transformer-structured object detection models on embedded devices.

2.3. Datasets

The availability of large datasets is critical to the success of data-driven deep learning techniques. At the same time, well-suited datasets can be effectively used to evaluate the performance of deep learning methods. Table 2 lists some common 3D object detection datasets for autonomous driving and the following section describes the datasets currently used in 3D object detection tasks.

KITTI [15]: The KITTI dataset, created collaboratively by the Karlsruhe Institute of Technology in Germany and the Toyota Institute of Technology in the USA, stands as the largest international dataset for assessing computer vision methods in autonomous driving scenarios. It is instrumental in evaluating the performance of computer vision technologies, including stereo, optical flow, visual odometry, 3D object detection, and 3D object tracking within in-vehicle environments. The dataset comprises real image data captured in urban, rural, and highway scenes, featuring up to 15 vehicles and 30 pedestrians per image, along with varying levels of occlusion and truncation. In total, the dataset includes 389 pairs of stereo images and optical flow maps, a 39.2 km visual ranging sequence, and over 200 k images of 3D labeled objects, sampled and synchronized at 10 Hz. The original dataset categories encompass roads, cities, dwellings, and people. For 3D object detection, the labels are further divided into car, van, truck, pedestrian, cyclist, tram, and miscellaneous categories.

ApolloScape [33]: ApolloScape dataset, published by Baidu Research in 2019, contains 103 outdoor scenes and 26 types of objects, a total of 143,906 images and their corresponding point cloud data, without 360° acquisition. The sensor configuration includes two VUX-1HA laser scanners and six VMX-CS6 cameras that generate high-density, high-precision point cloud data. The dataset is collected in the domestic complex traffic scene, which is suitable for lane detection, positioning, trajectory prediction, object detection, target tracking, binocular vision and scene recognition and other tasks, similar to the KITTI data set, including Easy, Moderate and Hard three subsets.

H3D [34]: Honda provides a point cloud object detection dataset for autonomous driving scenarios. The data are sourced from the HDD dataset, a large-scale naturalistic driving dataset collected in the San Francisco Bay Area. H3D encompasses a complete 360-degree LiDAR dataset (dense point cloud from Velodyne-64) with 1,071,302 3D bounding box labels. The dataset is time-series-informed, manually annotated every 2 Hz, and linearly propagated by 10 Hz.

Cityscapes [35]: Cityscapes is a large database focused on semantic understanding of urban street scenes. It provides semantics, instances, and dense pixel annotations for 30 classes divided into 8 categories (plane, human, vehicle, Construct, Object, Nature, Sky, and Void). The dataset consists of approximately 5000 finely annotated images and 20,000 crudely annotated images. Data were captured in 50 cities over several months, during daylight and good weather conditions. It was originally recorded as video, so frames were manually selected to have the following features: a large number of dynamic objects, different scene layouts, and different backgrounds.

nuScenes [9]: The nuScenes dataset comprises 1000 scenes, each lasting 20 s and featuring various scenarios. Within each scene, there are 40 keyframes, equivalent to 2 key frames per second. The remaining frames are sweeps, and the keyframes are manually annotated with bounding boxes, providing annotations for size, range, category, visibility, and other attributes. A teaser version of the dataset, containing 100 scenes, was recently released, while the full version with 1000 scenes was released in 2019.

Waymo Open [36]: Waymo, the self-driving company owned by Alphabet, introduced its open data project on 21 August 2019. The Waymo dataset includes 3000 driving sessions totaling 16.7 h, averaging about 20 s per session. It encompasses 600,000 frames, 25 million 3D bounding boxes, 22 million 2D bounding boxes, and a variety of autonomous driving scenarios.

ONCE [37]: The ONCE (One Million Scenes) dataset, released by Huawei in 2021, is a 3D object detection dataset focused on autonomous driving scenarios. The dataset contains 1 million LiDAR scenes and 7 million corresponding camera images from 144 h of driving records, which is 20 times longer than other 3D autonomous driving datasets such as nuScenes and Waymo, and covers 200 square kilometers of driving area, covering different regions, times and weather conditions. The ONCE dataset contains 15,000 fully annotated scenarios, grouped into five categories: car, bus, truck, pedestrian, and cyclist, collected in a diverse set of environments, including day/night, sunny/rainy, and urban/suburban.

PandaSet [38]: The autopilot dataset is acquired in San Francisco, comprising 48,000 camera images and 16,000 LiDAR scans. It includes 100+ scenes, each lasting 8 s, with 28 annotation classes and 37 semantic segmentation tags. This dataset serves as a blend of industry and academia, focusing on autonomous driving scene object detection.

2.4. Evaluation Metrics

In order to conduct a quantitative performance analysis of 3D object detection methods, researchers have proposed a variety of evaluation metrics, including computational complexity and spatial complexity. As with 2D object detection, the average accuracy constitutes the main evaluation metric used in 3D object detection. Each dataset has been partially modified from its original definition, resulting in dataset-specific evaluation metrics such as nuScenes, KITTI datasets, etc. Here, firstly, we review the original average precision (AP) metrics. Datasets typically use the following two metrics to indicate the performance of the algorithm, namely Precision and Recall. Calculation formula as shown in Equations (3) and (4):

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

where TP is the correctly identified positive sample; FP is the incorrectly identified positive sample; and FN is the correctly identified negative sample. Other evaluation metrics commonly used by research workers today are intersection and concurrence ratio (IoU), AP and average precision over all categories (mAP). The most commonly used one, i.e., the IoU between the ground truth A and the estimated 3D bounding box B, is shown in Equation (5):

I o U (A, B) = \frac{| A \cap B |}{| A \cup B |}

(5)

In addition to this, some large datasets have specific metrics. In KITTI dataset [15], the evaluation metrics include the average accuracy of 2D and 3D object detection and the average angular similarity Average Orientation Similarity (AOS), where the average angular similarity is used to evaluate the prediction of the object orientation angle. The AP calculation for 3D objects in the KITTI dataset is based on the projection of the 3D object detection frame onto the image coordinates and the calculation of the AP value from the IoU; generally for vehicles, the predicted frame is considered accurate if the overlap between the 3D detection frame and the real marked 3D object frame exceeds 0.7, while for the pedestrian and cyclist categories, the IoU threshold is set at 0.5.

In nuScenes dataset [9], the detection task of the nuScenes dataset requires 10 types of objects with 3D detection frames, including cars, trucks, buses, pedestrians, and bicycles. Three new evaluation metrics are proposed: Average Precision metric (mAP) as shown in Equation (6), True Positive metrics (mTP), and nuScenes Detection Score (NDS). AP, is defined by setting a threshold for the 2D center distance in the base plane rather than the IoU. In addition, nuScenes calculate AP as the normalized area under the precision-recall curve for recall and precision over 10%. Finally, it calculates the mean Average Precision (mAP) over matching thresholds of

D

= 0.5, 1, 2, 4 m and the set of classes

C

, as shown in Equation (6):

mAP = \frac{1}{| C | | D |} \sum_{c \in C} \sum_{d \in D} {AP}_{c, d}

(6)

In addition to AP, a set of TP metrics is set for each prediction frame matched to a ground truth frame. All TP metrics were calculated using a certain center distance during the matching period and were designed as positive scalars. Matching and scoring were done independently in each category. Each metric means the cumulative average of all cases where a recall level of “10” or more was achieved. If the recall level for a given category does not reach “10”, all TP errors for that category are set to “1”; for each TP metric, an average TP metric (mTP) is also calculated for all classes. nuScenes detection score NDS is distinguished from mAP with an IoU threshold, which does not capture all aspects of the nuScenes detection task, such as speed and attribute estimation, etc. So, nuScenes proposes to combine different error types into a single scalar score, as shown in Equation (7):

N D S = \frac{1}{10} [5 m A P + \sum_{m T P \in T P} (1 - m i n (1, m T P))]

(7)

where mAP is the average precision, and TP is the set of five average true positive metrics. mAP represents the mean Average Precision; mTP is composed of 5 metrics: (1) Average Translation Error (ATE); (2) Average Scale Error (ASE); (3) Average Orientation Error (AOE); (4) Average Velocity Error (AVE); (5) Average Attribute Error (AAE). Average Translation Error (ATE) is the Euclidean center distance in 2D (units in meters). ASE is the 3D IoU after aligning orientation and translation (1-IoU). AOE is the smallest yaw angle difference between prediction and ground truth (radians). All angles are measured on a full

360 °

period except for barriers where they are measured on a

180 °

period. AVE is the absolute velocity error as the L2 norm of the velocity differences in 2D (m/s). AAE is defined as 1 minus attribute classification accuracy (1-acc). For each TP metric, we compute the mean TP metric (mTP) over all classes: mTP is averaged for each the mTP is averaged for each class (C represents the set of object classes, as shown in Equation (8)).

m T P = \frac{1}{∣ C ∣} \sum_{c \in C} T P_{c} .

(8)

In summary, NDS is a composite metric that combines the attributes of predicted object location, size, direction, and velocity. Half of the NDS is based on detection performance, while the other half quantifies the quality of the detection based on the position, size, orientation, properties and velocity of the 3D candidate frame. As mAVE, mAOE and mATE can be greater than “1”, each metric is bound between “0” and “1” in Equation (7).

3. 3D Object Detection Methods

3.1. Search Strategy

To ensure a thorough review of the relevant literature, we conducted searches in Google Scholar, IEEE Xplore, and Web of Science databases, using keywords such as “autonomous driving”, “3D object detection”, and “Transformer”. The scope of the search is limited to the last three years to ensure that the latest research advances are covered. We included only peer-reviewed journal articles and conference papers written in English. The selection process is divided into three stages: First, the title and abstract of all the retrieved articles are preliminarily screened to exclude the research of obvious irrelevance, unavailability of the full text of the article and language problems; Second, the full text of the remaining articles is reviewed to further determine whether they meet the inclusion and exclusion criteria. Finally, the final selection is made for the articles that meet the criteria. In this process, we included a total of 96 articles that met the criteria. These steps ensure that our research is systematic and comprehensive. Figure 4 details how we screened the methods in the review.

The methods for 3D object detection can be broadly categorized into two schemes [39]. The first approach involves transforming a 3D stereo image into a 2D image, where 2D object detection methods play a pivotal role in achieving the task. Notable examples include techniques like VoleFCN [40] and MV3D [41]. The second scheme entails converting a monocular or multi-view image into a 3D spatial representation through inter-camera calibration. This approach encompasses methodologies like BEVDepth [42] and BEVDet [43]. Recently, innovative methods based on Transformer have attracted widespread attention. We categorize these innovative methods into three categories: image-based, point cloud-based, and multi-modal-based. The classification methods and the interrelationships between corresponding techniques are briefly outlined in Figure 5. For more detailed insights, Table 3 provides concise explanations of key innovation points, advantages, and method categories for some representative articles. Further details on specific methods will be presented in Section 3.2, Section 3.3 and Section 3.4. Methods based on Transformer not only improve the performance of 3D object detection, but also provide new possibilities for cross-modal information fusion.

3.2. Image-Based 3D Object Detection

Image-based 3D object detection methods can be mainly divided into three categories: one based on monocular images, one based on stereo images, and one based on multi-view images. As shown in Figure 6, the method based on monocular, stereo, and multi-view images takes one, one pair of left and right, and one set of images as inputs, respectively, and then regresses the 3D bounding box of the object through the neural network. However, stereo-based methods require more effective capture of local features in the image, and Transformer has yet to be widely applied in this direction. Therefore, this article categorizes image-based methods into two categories: monocular based and multi-view based. In addition, due to the lack of depth information, image-based methods often adopt Bird’s Eye View (BEV) perception for 3D object detection [12]. Image-based BEV perception involves using cameras to capture images for perceiving and comprehending the surrounding environment of a vehicle. This approach is a prevalent perception method in domains like autonomous driving, offering advantages such as a broader perspective, enhanced recognition capabilities, lower costs, and wider applicability. We provide a detailed introduction to these methods below.

3.2.1. Monocular 3D Object Detection

In the early stages of monocular 3D detection tasks, they essentially combined 2D object detection with pose or orientation estimation. Researchers viewed monocular 3D object detection as a predictive process that included 2D object detection along with other crucial parameters, particularly depth information. This approach was typically grouped into three primary categories. The first category involved using pre-trained monocular depth estimation models, demonstrated by techniques like Pseudo-LiDAR [52], and CADDN [53]. The second category utilized LiDAR data as a guiding source, with the model learning the 3D structural characteristics of objects. Notable examples include MonoPSR [54] and MonoRUn [55]. The third category focused on direct regression, using geometric information as a guide. This category approaches include MonoPair [56] and MonoFlex [57]. Nowadays, Transformer-based monocular 3D object detection still heavily relies on prior information, such as depth and height information. In addition, Transformer is also used to enhance feature representation.

Enhance feature representation. As the pioneering method, PYVA [58] used Cross-attention for monocular BEV perception. The method devised a Cross-attention decoder to facilitate view transformation, enabling the extraction of image features into the BEV space.

Chitta et al. proposed NEAT [59], an end-to-end 3D object detection network tailored for autonomous driving. This method leverages Transformers to enhance features within the image feature space before transitioning them into the BEV domain. It uses an iterative attention mechanism based on MLPs in the encoder segment and imitates the structure of Cross-attention in the neural attention domain. By utilizing MLPs, attention maps are generated for a given output location (x, y), matching the spatial dimensions of the input feature image. The attention graph is employed to perform dot product operations on the original image features, producing the BEV features corresponding to a given output location. The output from the NEAT module is mapped onto the BEV feature space by traversing all possible BEV grid locations.

Can et al. introduced the STSU [60] approach, building on the DETR architecture. It utilizes sparse queries for object detection, enabling the simultaneous detection of dynamic objects and static road layouts. Two sets of queries are employed within the Transformer section: one for lanes and another for objects. Similarly, two branches are tasked with processing the Transformer output. One branch predicts the centrelines and correlates with active centrelines, while the other one predicts dynamic objects. This work extends the scope to extract BEV coordinate representations of local road networks from a single in-vehicle camera image, facilitating the detection of dynamic objects in the BEV plane.

Depth information. Monocular 3D object detection is commonly initiated by pinpointing the 2D centroid coordinates of the object in the image. Subsequently, CNNs are used to extract local features around the 2D centroid for predicting 3D attributes. MonoDETR [44] diverges from this approach. It incorporates a distinctive depth-guided framework, leveraging a pre-predicted foreground depth map as a guiding signal. Transformers guide each object in adaptively extracting its region of interest, aiding subsequent 3D feature extraction. This methodology extracts features from the global perceptual field for each problem, affording a superior global understanding of the scene space and achieving top-tier detection accuracy on both KITTI and nuScenes datasets. Huang et al. observe that while utilizing depth information, certain existing methods may run the risk of learning 3D object detection from inaccurate depth maps, in addition to the added computational burden. Their proposed MonoDTR [61] introduces a depth-aware feature enhancement module and depth position encoding. This approach employs auxiliary depth to acquire depth-aware features, mitigating reliance on potentially inaccurate apriori depth information. Features from the Cross-attention module effectively model the relationship between contextual features and depth-aware features, resulting in enhanced performance.

Height information. Wu et al. concentrate on addressing the height issue within the BEV space and propose HeightFormer [45]. Instead of focusing on achieving more precise depth learning in image space, the method refines the height layer by layer using a Self-recursive approach, mitigating potential misleading cues from the background through query masks. This method achieves state-of-the-art performance on the NuScenes dataset. Alternatively, the true height can be directly obtained from the annotated 3D bounding box without additional depth data or sensors. Overall, this approach does not rely on additional modal data and exhibits strong applicability for 3D detection based on monocular images.

3.2.2. Multi-View Based 3D Object Detection

Multi-view image-based 3D object detection heavily relies on BEV perception as its primary method. Unlike monocular methods, multi-view based methods often take six images around the vehicle as input to the model. Constructing BEV views from surround-view images has become a prominent research direction in autonomous driving. Initially, Philion et al. introduced LSS [62] as an initial endeavor to model BEV features from surround-view images. This was achieved by estimating pixel depths and projecting them into the BEV feature space using internal and external camera parameters. Object detection in the BEV feature space became relatively straightforward, presenting a pivotal concept for BEV perception. The subsequent work, BEVDet [43], proposed by Huang et al., followed a similar approach but relied solely on the perspective transformation from LSS when constructing BEV features, leading to higher computational complexity. However, Transformer-based methods can effectively extract features from 2D planes into 3D space through methods such as position encoding. In addition, Transformer can effectively perform temporal and spatial modeling, greatly improving model performance. Table 4 illustrates the comparison between LSS and Transformer-based methods.

In 3D object detection, a dense feature representation may not be advantageous as the output typically involves a small number of candidate frames [60]. This realization led to sparse BEV feature representation, made possible by incorporating the Transformer. Wang et al. presented DETR3D [46], utilizing the Transformer for feature extraction within the BEV feature space. The decoder component of the network was based on the Deformable DETR, leveraging 3D space points to interact with 2D image features. This integration allowed for constructing a BEV space with sparse feature representation, effectively reducing the computational complexity of operations.

Position encoding. Recognizing that DETR3D primarily learns 2D image features and may not capture global features comprehensively, Liu et al. introduced PETR [47]. This approach involves adding 3D position encoding to the encoder part and integrating it with 2D image features. The subsequent interaction with the 3D query for features exploits global features, simplifying the network model while enhancing both performance and inference speed. PETRv2 [64] extends the approach by incorporating temporal information. The resulting spatial-temporal feature fusion framework achieves optimal detection and semantic segmentation performance.

Temporal modeling. Li et al. proposed BEVFormer [8], also an extension of the DETR3D framework. This model introduces a temporal Self-attention module and a spatial Cross-attention module, facilitating interactions between 3D features in the BEV space and both 3D and image features in these modules. The outcome is a BEV space enriched with comprehensive spatio-temporal features. In addition, Qin et al. presented UniFormer (now known as UniFusion) [65], which differs from BEVFormer’s temporal fusion approach. UniFormer directly transforms historical frames into the current frame and extracts BEV features concurrently with all historical images. This unified fusion of multiple frames in time and multiple views in space avoids the wastage of features.

BEVFormer v2 [63], in its proposal, reimagined the encoder segment of the temporal fusion module to concatenate multi-frame historical BEV features under the current moment in the channel dimension, amplifying the feature representation in the temporal domain.

Qi et al. consider that BEVFormer solely considers overall-level feature alignment when incorporating historical BEV information, potentially overlooking high-speed moving foreground objects. The method aligns features and integrates objects in both temporal and spatial dimensions to address this, giving rise to Object-Centric BEVFormer (OCBEV) [66]. Additionally, to tackle the uniform sampling issue for height in BEVFormer, it devises an adaptive height range for 3D reference points and introduced a heatmap after the encoder for supervision. It integrates highly confident Query into the decoder for position coding, effectively providing object location information. The method demonstrated its prowess on the nuScenes benchmark, achieving faster convergence with only half the training iterations needed for comparable performance.

Wang et al. proposed StreamPETR [67], utilizing the object query as the temporal delivery medium. The timing propagation is executed through the attention mechanism between the object queries of the anterior and posterior frames. This method emphasizes feature similarity between objects and exhibits robustness to the object’s location prior. StreamPETR achieved a milestone by surpassing CenterPoint [68] without relying on future frames. Moreover, due to the efficiency of object query temporal transmission, StreamPETR introduces minimal extra computation compared to the single-frame PETR [47], making it one of the fastest available BEV perception methods.

DA-BEV [69] further builds upon BEVFormer by incorporating two key modules: the Depth-Aware Spatial Cross-attention (DA-SCA) module and Depth-wise Contrastive Learning (DCL). The DA-SCA module addresses the issue of neglecting depth information in spatial Cross-attention by introducing depth encoding in the query and value. Additionally, DCL enhances the depth-awareness of BEV features by sampling positive and negative BEV features. Experimental results demonstrate that the DA-BEV method achieves state-of-the-art detection performance on the nuScenes dataset and addresses the issue of repeated predictions observed in BEVFormer.

Some research views replace the representation of spatial objects with the polar coordinate system to improve the model checking performance. PolarDETR [70], employing a feature fusion approach akin to BEVFormer for temporal fusion. However, in PolarDETR, the Object query representing the object is fused instead of the BEV query. Additionally, the BEV features and features related to the object location are transformed from a Cartesian coordinate system to a polar coordinate system, enhancing the model’s training efficiency. PolarFormer [71] designed a Cross-attention-based polar detection head capable of handling irregular polar meshes without being constrained by the input structure’s shape. Furthermore, a multi-scale polar representation learning strategy was implemented to address unconstrained object scale variations along the polar distance dimension. The model optimally leverages the polar representation derived from the rasterization of corresponding image observations by utilizing a sequence-to-sequence approach under geometric constraints. Extensive experiments on the nuScenes dataset demonstrated that PolarFormer outperforms state-of-the-art 3D object detection methods and exhibits competitive performance in BEV semantic segmentation tasks.

With the increasing prevalence of autonomous driving and unmanned aerial vehicles, BEV perception based on pure vision has become a prominent area of current research. Transformers have proven to be effective for feature extraction and enabled finer and higher-resolution BEV feature spaces through their integration for processing and fusing image features. The advent of Transformers has opened up new avenues for image-based 3D object detection.

3.3. Point Cloud-Based 3D Object Detection

Point cloud-based 3D object detection methods fall into three categories: voxel-based, point-based, and hybrid feature fusion-based. We use this partitioning method, dividing the methods into point based and voxel based and introduce the content of feature fusion methods in multi-modal-based methods later. Point-based approach uses collected point cloud data as the network input, while the voxel-based method represents points through a voxel grid, as shown in Figure 7. In the early stages of point cloud-based object detection, researchers often utilized PointNet++ [72], proposed by Qi et al., and the Voxel Feature Extraction (VFE) method in VoxelNet [73], as the backbone network for feature extraction from raw point clouds. Methods built upon these backbone networks achieved significant real-time processing and accuracy success. For example, Lang et al. introduced PointPillars [74], implemented on TenSorRT, achieving an impressive inference speed of 42.4 Hz. Shi et al. also presented PointRCNN [75], demonstrating exceptionally high performance on the KITTI dataset for a period. However, in 2020, Carion et al. introduced DETR [5], the first Transformer-based 2D object detection model, proposing a combination of Transformer and CNN to replace non-maximal suppression (NMS). Subsequently, Transformer-related approaches have been integrated into the 3D point cloud-based object detection domain.

3.3.1. Point-Based

Many point cloud-based methods use VoteNet as the baseline model and introduce Transformer for indoor 3D object detection. However, the ideas of these methods can still be applied to 3D object detection for autonomous driving.

Xie et al. introduce Transformer’s Self-attention mechanism to the 3D object detection task for indoor scenes for the first time, based on the VoteNet [7] proposed by Qi et al. and propose a multi-layer contextual VoteNet, MLCVNet [6], to improve detection performance by encoding contextual information. In MLCVNet, each Patch and voting Cluster is considered a token in the Transformer. The Self-attention mechanism is then used to enhance the corresponding feature representation by capturing the relationships within the Point Blocks and Voting Clusters. Meanwhile, Chen et al. also proposed PQ-Transformer [76] based on VoteNet for detecting 3D objects. PQ-Transformer uses a Transformer decoder to enhance the feature representation of 3D bounding boxes.

Both methods employ extensive manual design in group clustering, where features of candidate objects are obtained by learning from points within the corresponding local regions. However, Liu et al. argue that point grouping operations within a limited region hinder the performance of 3D object detection [77]. Therefore, Liu et al. proposed a Group-free framework. The core idea is that the features of a candidate object should come from all points in a given scene rather than a subset of the point cloud. After obtaining the candidate objects, their approach first uses a Self-attention module to capture contextual information between the candidate objects. A Cross-attention module is then devised, utilizing information from all points to refine the features of the object.

Also inspired by DETR in 2D object detection, Misra et al. first proposed a Transformer-based end-to-end 3D object detection network called 3DETR [48], which views 3D object detection as a Set-to-Set problem. Drawing on ideas from DETR and VoteNet, 3DETR is designed as a general encoder and decoder. In the encoder part, the sampled points extracted by the MLP and the corresponding features are fed directly into a Transformer block for feature optimization. In the decoder section, these features are transformed into a collection of candidate object features by a parallel Transformer decoder. These candidate object features are ultimately used to predict 3D bounding boxes.

Similarly, 3D object detection is viewed as a Set-to-Set problem by He et al. Voxel-based set attention module, can model long-term dependencies from arbitrarily sized sets of object points, bypassing the limitations of current group-based and convolution-based attention modules [31]. The VSA module, based on the Set-Transformer [78], can effectively solve the computational complexity problem by introducing the Induced Set Attention Blocks (ISAB) module of Set-Transformer in case of using the hyperparameter Latent codes of their design as the query in Cross-attention. Not only that, the VSA module can be used as a kind of backbone network and replace it in other models.

In addition to the previously discussed approaches for indoor scenes, Sheng et al. introduced an object detection framework called CT3D [79], tailored for enhancing 3D object detection in outdoor point clouds. This framework is anchored on the Channel-wise Transformer and operates in a two-stage fashion. Initially, the Channel-wise Transformer receives input from a Region Proposal Network. Subsequently, the Transformer network is divided into two key sub-modules: proposal-to-point encoding and channel decoding modules. The encoding module takes candidate frames and their corresponding 3D points as input and employs a Self-attention mechanism-based module to extract optimized point features. The channel decoding module refines the features obtained from the encoder module into a global representation through channel re-weighting. Ultimately, a feed-forward network is deployed for detection. These enhancements enabled CT3D to achieve an impressive AP of 81.77 in the moderately challenging automotive category of the KITTI test set.

While Transformer has often been widely used in 3D object detection tasks to enhance point cloud data’s features in previous presentations, Pan et al. proposed the Pointformer [80] as a backbone network specifically designed for point clouds. Specifically, the Local Transformer module is used to model the interactions between points in a local region to learn contextually relevant regional features at the object level, and the Global Transformer aims to learn context-aware representations at the scene level. In addition to this, the Local-Global Transformer has been added to capture the dependencies between multi-scale representations via Cross-attention. The authors have also achieved some improvement in performance on the KITTI dataset by replacing the backbone network in PointRCNN with Pointformer.

3.3.2. Voxel-Based

Voxel-based method draws more on the experience of image object detection. Liu et al. introduced TANet [81] to address the robustness challenge in 3D object detection. The Triple Attention module (TA) is a key component, collectively considering Channel-Wise, Point-Wise, and Voxel-Wise feature expressions. Channel-Wise evaluates the importance of individual channels within each voxel, Point-Wise assesses the significance of specific points within a voxel, and Voxel-Wise identifies crucial meshes within all voxel meshes. TANet enhances crucial object information while suppressing unstable point cloud elements, contributing to overall detection robustness.

Fan et al. identified that the down-sampling operation during feature extraction in previous 3D object detection approaches could lead to significant information loss, potentially impacting the model’s detection performance. However, with the introduction of the Transformer, downsampling becomes unnecessary, and the attention mechanism proves highly effective for unstructured data like point clouds. Building upon this insight, Fan et al. proposed SST [82]. This method treats columnarized point clouds as tokens, feeding them into the Transformer. It then conducts a grouping operation on these tokens, clustering neighboring tokens together, and applies a Self-attention operation to the tokens within each group. This process yields rich local features by aggregating information from these groups. SST also achieves state-of-the-art performance on the Waymo dataset.

The query-based Transformer has shown excellent power for constructing remote attention in computer vision domain tasks. However, point cloud data is too large for this approach to be considered rarely in LiDAR-based 3D object detection. Zhou et al. proposed CenterFormer [49], a centroid-based Transformer network for 3D object detection. CenterFormer first uses Center Heatmap to select center candidates on top of a standard Voxel-based point cloud encoder. The features of the center candidates are then embedded into the Transformer as the query. A deformable Cross-attention fusion feature method is designed to further aggregate features from multiple frames. The method was compared to the previous CNN-only based CenterPoint [68], which achieves state-of-the-art performance on the Waymo open dataset test set.

Pointformer is a backbone network designed for raw point clouds as input, and Mao et al. proposed VoTr [83] (Voxel Transformer), the first voxel-based Transformer backbone network for voxel-based 3D object detectors. They argue that previous voxel-based 3D convolutional backbone networks do not capture enough semantic information and that a Self-attention mechanism can be used to model pixel-to-pixel connections in the long term. Specifically, two Transformer-based modules are proposed: the Sparse voxel module and the Submanifold voxel module deal with empty and non-empty voxels under sparse representation, respectively. In the first module, local attention and swelling attention are designed to fuse local and remote features, expanding the range of perceptual fields while reducing the computation. In addition, a fast hash index table, Fast Voxel query, was constructed for fast lookup of relatively located voxels in the case of multi-headed attention structures to ensure speed. The authors replaced VoTr with SECOND [84] and PV-RCNN [85] backbone networks and then validated the method on the Waymo Open Dataset and KITTI dataset.

Methods based on raw point clouds usually deal with large point cloud data by means of point sets, while voxel-based methods usually take voxel blocks as tokens. Both methods utilize the Transformer as the main way of feature extraction. What makes it special is that methods based on raw point clouds leverage Transformers for computational optimization. Voxel-based methods tend to be sparse feature expressions and require the remote modeling capabilities of the Transformer.

3.4. Multi-Modal-Based 3D Object Detection

Camera and LiDAR are two common types of sensors for 3D object detection in autonomous driving. The camera provides color information, from which we can extract rich semantic features, while the LiDAR sensor can provide rich 3D structural information. Researchers have been fusing the features of these two modal data to achieve accurate 3D object detection. Traditional fusion methods, LiDAR-to-Camera and Camera-to-LiDAR, have distinct advantages and drawbacks. LiDAR-to-camera may sacrifice LiDAR’s geometric structure information. Conversely, Camera-to-LiDAR might lose the camera’s semantic information, making it less suitable for tasks such as semantic segmentation. Multi-modal-based 3D object detection methods often integrate data features from point clouds and RGB images, and use the fused features for 3D object detection. Multi-modal 3D object detection commonly starts with two approaches: integrating image data into point cloud-based methods and incorporating geometric features of point clouds into BEV-based frameworks (Feature fusion and Alignment map). In addition, some methods consider the characteristics of sensors (Sensor related), and the proposed methods can proficiently handle complex weather conditions and achieve remote sensing, which is crucial in the practical application of autonomous driving. There are also some methods that draw on the previous multi-view based 3D object detection methods and introduce point cloud features on top of those methods. A chronological overview of the multi-modal-based 3D detection methods is shown in Figure 8.

Feature fusion. The difference between the multi-modal 3D object detection method based on Transformer and previous methods lies in feature fusion. Transformer based methods can perform feature interaction between features from point clouds and features from images through Cross-attention. Many multi-modal 3D object detection methods fully utilize this, but there are slight differences in the methods of feature fusion. Based on DETR, Bai et al. introduced TransFusion [86], a Transformer-based 3D object detection model seamlessly merging information from both LiDAR and cameras. TransFusion employs an attention mechanism to dynamically fuse features from images, circumventing the challenge of rigid correlation between LiDAR point clouds and image points established by the calibration matrix.

Li et al. argue that integrating camera and LiDAR features, as opposed to raw points, leads to improved performance [87]. They emphasize the importance of feature alignment and address data augmentation challenges inherent in multi-modal fusion. Their solution, DeepFusion, incorporates LearnableAlign and InverseAug modules to tackle these issues. LearnableAlign employs a Cross-attention mechanism to dynamically capture correlations between LiDAR and camera features, while InverseAug enhances feature alignment quality. The approach achieves state-of-the-art performance on the WOD dataset. It consistently improves all unimodal detection baselines, highlighting the general applicability of depth fusion to various 3D object detection frameworks.

Similarly, CAT-Det [88] aims to enhance the fusion of LiDAR point clouds and RGB image features, addressing deficiencies in multi-modal fusion and the need for effective multi-modal data augmentation. This method introduces a point cloud and image processing branches, extracting multi-modal features from both. A cross-modular data transformation branch then amalgamates features from these branches. CAT-Det achieved an mAP of 67.05 on the KITTI test set, marking a significant advancement over LiDAR-only models.

Alignment map. In addition to utilizing Cross-attention for feature interaction, some researchers focus on modeling the mapping relationship between images and point clouds using a learnable alignment map. Chen et al. propose AutoAlign [50], a Transformer-based automatic feature fusion strategy. They employ a learnable alignment graph to model the mapping relationship between images and point clouds. This approach eschews establishing a deterministic correspondence with the camera projection matrix. The alignment graph empowers the model to align non-homomorphic features dynamically and data-driven automatically. A Cross-attention Feature Alignment module is designed to judiciously aggregate pixel-level image features for each voxel. To bolster semantic consistency during feature alignment, a Self-supervised cross-modal feature interaction module is introduced, facilitating feature aggregation guided by instance-level features. When integrated into the PointPillars model, the AutoAlign module yielded modest yet tangible improvements, as validated on the KITTI dataset.

AutoAlign suffers from high computational costs introduced by the global attention mechanism. For this reason, building on top of AutoAlign, the authors propose AutoAlign V2 [89]. In AutoAlign V2, the authors propose the Cross-attention Feature Alignment module. It focuses on sparse learnable sampling points for cross-modal relational models. The operation greatly reduces sampling candidates and dynamically determines key point regions in the image plane for each voxel query feature, accelerating cross-modal feature aggregation. In addition, the authors devised a method for cross-modal data enhancement called depth-aware GT-AUG. This method forgoes the complex point cloud filtering process or the requirement for fine-grained mask annotation of the image domain. Instead, depth information is introduced from the 3D object annotation. In terms of the overall structure, the voxelization-based point cloud feature maps are transformed onto BEV space for prediction, 1.5% improvement over the previous NDS on the nuScenes dataset.

Sensor related. In addition to considering feature fusion, some researchers have provided feasible ideas for underlying sensors. Chen et al. introduce FUTR3D [90], a BEV feature fusion framework derived from DETR3D. It accommodates image data from any sensor, necessitating only the configuration of distinct backbone networks for different sensor image data. The 3D reference points are obtained through object queries, aligning with the principles of DETR3D. This framework provides a straightforward and convenient solution for BEV-based feature fusion.

The multi-sensor approach can effectively use multi-modal data features, but the detection performance of the model tends to be severely degraded in the event of sensor failure. Ge et al. realized the severity of the multi-sensor failure situation and proposed MetaBEV [91], a framework for robust autonomous driving sensing systems. The framework first processes signal from multiple sensors through mode-specific encoders, then initializes a dense set of BEV queries, fed iteratively into a BEV Evolving decoder that selectively aggregates depth features from LiDAR, cameras, or both modalities. These updated BEV features can then be used, for example, for 3D object detection work. Through extensive experiments on the nuScenes dataset, MetaBEV has been shown to perform better than previous methods on both modalities.

Wang et al. argue that current 3D perception studies follow a modal-specific paradigm, resulting in additional computational overhead and inefficient collaboration between different sensor data. Therefore, they proposed an efficient outdoor 3D sensing multi-modal backbone: UniTR [92], which handles multiple modalities through unified modeling and shared parameters. Unlike previous work, UniTR introduces a modality-independent Transformer encoder to handle sensor data with these view differences for parallel modality representation learning and automatic cross-modal interactions without additional fusion steps. It establishes a new SOTA performance on the nuScenes benchmark, achieving NDS + 1.1 for 3D object detection, mIoU + 12.0 for BEV map segmentation, and lower inference latency.

Position encoding. On the basis of PETR, Yan et al. proposed a robust 3D detector, named Cross Modal Transformer [51] (CMT), for end-to-end multi-modal 3D object detection. Instead of an explicit view transformation, the method inputs the image and point cloud tokens and directly outputs the exact 3D bounding boxes. CMT interacts with the multi-modal features via the position-guided query approach. The core design of CMT is fairly simple, but its performance is impressive. It achieved 74.1% of the NDS (state-of-the-art single model) on the nuScenes test set while maintaining a faster reasoning speed. Moreover, CMT is very robust even in the absence of LiDAR.

Spatio-temporal fusion. Based on BEVFusion [93], Hu et al. designed a new feature fusion method called FusionFormer [94]. FusionFormer fuses LiDAR and image features via Deformable-attention order and generates the fused BEV features. The FusionFormer can simultaneously sample from 2D images and 3D voxel features, thus exhibiting flexible adaptation between different modal inputs. Therefore, multi-modal features can be input in their original form, avoiding information loss when converted to BEV features. In the fusion encoding process, the point cloud features can be used as a deep reference for the perspective transformation of the image features. While the dense semantic features from the image complement the sparsity of the point cloud features, generating more accurate and dense fused BEV features. It is worth noting that the multi-modal fusion encoder adopts a residual structure, which ensures the robustness of the model in the case of missing point clouds or image features. In addition, FusionFormer also supports the temporal fusion of historical BEV features.

Figure 8. Chronological overview of the Multi-modal-based 3D object detection methods [50,51,86,87,88,89,90,91,92,94].

In previous multi-modal methods, there is a tendency toward object-level feature fusion. However, compared with feature-level fusion methods, the enhancements achieved through object-level feature fusion are relatively small. Across these multi-modal approaches, the prevalent method involves using Cross-attention to fuse features from different modalities. Transformers are pivotal in facilitating feature interactions between multi-modal data, serving as a bridge for effective fusion. Table 5 summarises the performance of selected methods on the nuScenes dataset sorted by mAP in descending order.

From observing Table 5, the multi-modal-based method obviously performs better on the dataset than the single-modal method. The fusion of multiple modal data can greatly improve network models’ performance. The research on multi-modal 3D object detection is bound to become a hot topic in the future.

In addition, for current research, LiDAR-camera-based multi-modal methods are still the mainstream. However, imaging radar is expected to replace LiDAR because of its weather resistance, long range and low cost. Some early research works [95] started to explore some methods of Radar-Camera fusion. RCBEVDet [96] encodes the aligned radar point cloud into radar BEV features. Subsequently, the detection performance is improved by fusing the image and radar BEV features through the cross-attention multi-layer fusion module. The proposed method is proved to be robust in the presence of sensor performance failures. Methods based on Radar-Camera fusion are still promising in future multi-modal approaches.

4. Challenges and Trends

4.1. Current Research Limitations

Transformer-based methods for 3D object detection are currently undergoing rapid advancements. The widespread integration of attention mechanisms has led to the development of numerous outstanding methods, consistently pushing the boundaries of accuracy in 3D object detection tasks across major datasets. However, it is crucial to recognize that ongoing research in this field still grapples with challenges, including Transformer structures, sensor limitations and inference performance.

4.1.1. Transformer Structure

For a Transformer structure, the input sequence is often required to be equal in length. Many Transformer-based point cloud 3D object detection methods will drop or pad during the original sampling process and then send the added or deleted sequence into the Transformer for attention operation. In this sampling process, many features will be lost, and the original feature expression of the point set after addition or deletion is often lost, resulting in the algorithm’s performance not being improved. The second is about the computing load in the Transformer. Its computational complexity and spatial complexity are both O(N²), where N is the sequence length. Therefore, it is not easy to train the Transformer model in experiments, especially for 3D object detection tasks, whether using point cloud data or RGB image data. There are huge computing costs, so whether the model can be trained depends on computing resources. Some papers have studied the computational complexity of Transformer [97,98]. These papers optimize the operation of the attention mechanism from different perspectives, but research on 3D object detection based on Transformer still needs to be completed.

4.1.2. Sensor Limitations

In 3D object detection for autonomous driving, sensor selection and performance have crucial impacts on the effectiveness and reliability of the system. As shown in Table 1, different sensors have different advantages and disadvantages. Point cloud-based 3D object detection methods face challenges related to the limited detection range, leading to suboptimal performance for distant objects. Although Transformers can improve the feature extraction of point sets, they struggle to capture spatial information about distant objects from the sensor. Moreover, the performance of LiDAR is vulnerable to degradation in extreme weather conditions. And concerns persist regarding detection methods based on images, where factors like lighting conditions and adverse weather can substantially impact image quality. Ensuring the comprehensive retention of image feature information remains a challenge in the current stage. Although Transformer can provide effective feature space representation for image-based methods, it is still limited by the lack of depth information. From Table 5, it can be seen that although the performance of image-based methods has approached that of multi-modal-based methods, there is still a significant gap. How to comprehensively consider the advantages and disadvantages of each sensor remains the focus of future research.

4.1.3. Inference Performance

In practical applications of 3D object detection, the model’s inference capability on embedded devices is a crucial metric for evaluation. Most algorithm models prioritize enhancing object recognition accuracy on datasets, sometimes at the expense of optimizing inference performance post-deployment. The pursuit of higher model recognition accuracy often introduces additional inference overhead [99]. Even a brief delay of 400 ms in real-world scenarios could lead to potential accidents. Computational resources are often limited, making replicating the achieved inference speed on the target device challenging. There is a pronounced industry focus on improving this aspect, emphasizing the paramount importance of code optimization for overall model inference performance.

Moreover, in autonomous driving, edge devices have a significant impact in real-time reasoning [100]. Edge devices can process data close to the data source, thus significantly reducing the latency of data transmission [101]. This is crucial for real-time applications such as autonomous driving. Low latency can ensure that the system can quickly respond to external changes, and improve the safety and reliability of the system. In order for edge devices to be able to process high computational information well, the hardware acceleration technology of edge devices has been widely concerned [102]. Dedicated hardware, including Graphics Processing Unit (GPU), Tensor Processing Unit (TPU), and neural Network processor (NPU), speeds up the inference process. In order to adapt to the resource constraints of edge devices, researchers have proposed a variety of model compression and optimization techniques, including pruning, quantization, and knowledge distillation [103]. These techniques can significantly reduce the computation and storage requirements of the model, so that complex deep learning models can run efficiently on edge devices.

4.2. Future Research Trends

This subsection compiles research hotspots that may accelerate discussions in the future. Currently, the following research directions have demonstrated significant potential, prompting an increasing number of researchers to explore these fields. In the near future, these research directions are anticipated to become hot topics in the field of 3D object detection.

4.2.1. Collaborative Perception

Collaborative perception refers to the concept of sharing information and cooperation among multiple sensing systems or entities to enhance the effectiveness of environmental perception [104]. In autonomous driving, collaborative perception typically involves multiple vehicles or sensors working together to improve the perception and understanding of the surrounding environment. This collaborative effort aims to overcome limitations that a single perception system might face, such as occlusion of distant objects or sparse data in certain areas [105]. By sharing perception data, the system can gain a more comprehensive understanding of the road conditions, improving detection and prediction capabilities for obstacles, traffic situations, and other critical information. The concept of connected vehicles in intelligent transportation gained traction before 2020, and subsequent advancements in computer vision and robotics have propelled cooperative perception to the forefront. In 2022, this approach gained increased attention, expanding beyond datasets and collaborative strategies to encompass a wider range of related issues, scenarios, and tasks. Cooperative perception utilizes multiple agents with limited perception capabilities to exchange information based on specific group cooperation strategies and defined communication parameters. This collaborative approach leads to more comprehensive and effective 3D object perception, addressing challenges posed by occlusion. As a result, cooperative perception holds significant promise for future development in the field.

Cooperative perception needs a unique feature cooperation method to deal with the heterogeneous and cross-modal features provided by multiple agents in multi-agent systems. Xu et al. proposed V2X-ViT [106] with the help of Transformer. Specifically, a holistic attention module is constructed, which can effectively fuse the information of vehicles and infrastructure. The model consists of an alternating layer of heterogeneous multi-agent Self-attention and multi-scale window Self-attention, which can capture the interaction among agents and the spatial relationship of each agent. In addition, these key modules are designed in a unified Transformer architecture to cope with common V2X challenges, and the algorithm can also show robust performance in harsh and noisy environments.

Zhang et al. [107] regard V2X perception as heterogeneous multi-agent cooperative sensing, and use a specially designed Transformer based on multi-model information to effectively realize multi-agent cooperative feature fusion of dynamic heterogeneous relationship graphs. Specifically, the complementary information of RGB image and LiDAR point cloud is used to reduce the error of LiDAR point clouds, thereby improving the detection performance.

Baidu introduced the UniBEV, a vehicle-road-integrated BEV solution based on collaborative perception. UniBEV integrates information from multiple cameras and sensors on the vehicle end using the Transformer model. Additionally, it incorporates information from multiple intersections and sensors from the roadside perspective within the model. Notably, UniBEV is positioned as the industry’s first end-to-end perception solution, showcasing vehicle and roadside information integration for enhanced perception capabilities.

4.2.2. Occupancy Prediction

Occupancy Networks [108] are a type of 3D shape modeling method based on deep learning, primarily employed for inferring the 3D geometric representation of objects from input data such as point clouds or depth images. The core objective of this method is to learn the probability distribution of the occupancy status of each 3D point in space, indicating whether the space is occupied or not. Through the training of neural networks, Occupancy Networks can capture the high-dimensional representation of input data and effectively reconstruct the spatial occupancy of objects, thereby enabling the modeling of object shapes. The occupancy prediction is derived from the improved scheme of Tesla to solve the problems previously occurring in BEV perception. Compared to the previous schemes, the representation of the object in BEV space has changed. The occlusion aspect is solved from the perspective of a vision-only scheme, and the network can run at more than 100 FPS.

Following the practice of Occupancy Network, Li et al. proposed VoxFormer [109], a framework for predicting 3D semantic grid and semantic completion based on a monocular image sequence. Transformer learns 3D voxel features from 2D images, and then the blank voxels are distinguished according to the depth information. Finally, the representation of the 3D scene is obtained by proposing a query to focus on the image features. The geometric completion performance of the datasets is greatly superior to the advanced methods, which becomes an important basis for the subsequent occupation prediction. TPVFormer [110] proposed by Huang et al. also objects Occupancy Network and creates a new feature space, which uses three orthogonal planes to express features and encode more details. The whole scheme is more similar to the feature fusion method in BEVFormer. It is the first vision-only solution for LiDAR segmentation. It can also reconstruct the basic structure of 3D space from RGB images. The performance is comparable to the LiDAR solutions, showing great potential.

Zhang et al. proposed a dual-path Transformer network with efficient processing 3D volume to obtain semantic occupancy prediction: OccFormer [111]. It realizes the dynamic and efficient encoding of the 3D voxels generated by the camera, obtained by splitting the heavy 3D treatment into local and global Transformer pathways along the horizontal plane. The occupancy decoder applies the Mask2Former [112] to the 3D semantic occupancy by proposing preserve-pooling and class-guided sampling. This significantly alleviates the sparsity and class imbalance. The method exceeds TPV-Former 1.4% mIoU and yielded more complete and realistic 3D semantic occupancy predictions.

In the context of occupancy prediction, the autonomous driving team of Shanghai Artificial Intelligence Laboratory, known as OpenDriveLab, organized the 3D Occupancy Prediction Challenge. Multiple university research teams were invited to participate in addressing this challenge. The collaborative efforts of these researchers led to significant enhancements in the performance of the baseline algorithm. The improvements achieved during this challenge serve as a foundational advancement and contribute to the ongoing development of occupancy prediction methodologies.

4.2.3. Large Model

Large model usually refers to deep learning based models with a large number of parameters and computational resource requirements. In future development trends, exploring large models holds a prominent position. Computer vision has traditionally relied heavily on annotated data and supervised learning, which imposes limitations on model scale due to the expense of annotated data. This stands in contrast to the continuous advancements in natural language processing technology. However, recent breakthroughs, such as the Midjourney and Stable Diffusion [113] image generation models, signify a shift in the field. Efficient Self-supervision tasks, which involve generating models capable of predicting future scenes without relying on annotation information, are gaining consensus.

During the CVPR 2023 workshop, Tesla and Wayve highlighted their latest ventures into generating large models for continuous video scene generation in autonomous driving. Wayve named their model GAIA-1, recently releasing it, while Tesla referred to their attempt as the World Model. Leveraging extensive live video data collected by autonomous vehicles, these generative models predict future scenes and compare them with real-time data. This enables a loss function for training the model without relying on annotation information. Researchers, including Wang et al., have made preliminary attempts in this direction, exemplified by the DriveDreamer [114] model based on Transformers. This model constructs representations of complex environments using the diffusion model, showcasing its significant impact on developing 3D object detection tasks.

Large models not only excel in generating future scenarios but also demonstrate remarkable performance in specific downstream tasks. In recent international top algorithm competitions like nuScenes LiDAR semantic segmentation, Kitti instance segmentation, and Talk2Car, the UDeep series methods have consistently claimed top positions. Particularly in nuScenes LiDAR semantic segmentation, the UDeep series introduced a point cloud feature fusion module based on language models.

This innovative approach transforms the point cloud perception problem into a natural language sequence recognition problem. Each point in the point cloud is modeled as a sentence, and the entire point cloud is then transformed into a sequence comprising several sentences. Leveraging the potent modeling capabilities of language models, the UDeep series effectively integrates original point cloud information with prior semantic clues, achieving top-ranking results on the nuScenes leaderboard. This success underscores the substantial potential of large models in addressing specific tasks with exceptional performance.

Also, in CVPR 2023, the best paper of the conference was awarded to Planning-oriented Autonomous Driving [115]. The primary focus of this paper involves the integration of detection, tracking, mapping, trajectory prediction, occupancy grid prediction, and planning into an end-to-end network framework based on Transformer architecture. It integrates each task at the feature level through tokens, aligning with the stages of perception, prediction, and decision-making. The overarching objective is to enhance algorithmic performance in autonomous driving systems comprehensively. This approach signifies a shift away from simplistic improvements and emphasizes the increasing importance of practical and effective code in the academic community’s attention and discussions about the future.

4.3. Discussion

In this subsection, some innovative ideas for Transformer-based 3D object detection methods are provided in conjunction with the above.

4.3.1. Feature Space

Transformer-based 3D object detection methods typically use BEV as the feature space for subsequent detection and other tasks. Although BEV-based methods can effectively understand complex objects in a scene, they also overlook the height information in real world. So, incorporating additional height information into BEV perception for 3D object detection is recognized as a promising direction for future research. Analogous to integrating depth information to enhance sensor capabilities in vision-based solutions, the inclusion of height information aims to improve the existing methodology.

This proposal is supported by recent innovations such as TPVFormer, which utilizes three orthogonal planes to express spatial features in the BEV space. This addresses the computational overhead associated with voxelization and considers height information within the BEV space. The success of such approaches suggests that incorporating vertical information can contribute to a more comprehensive representation of the environment and the objects within it.

By extending the multi-modal BEV perception paradigm to include additional height cues, researchers may achieve more accurate and robust 3D object detection. This could lead to better performance in scenarios where height plays a crucial role, such as identifying objects with varying elevations, multi-level structures, or partially occluded areas. Overall, integrating height information into BEV perception is promising for advancing state-of-the-art 3D object detection methodologies.

4.3.2. Feature Fusion

Feature fusion refers to the combination of features from different levels or sources to generate more informative and expressive feature representations. In deep learning, feature fusion is often used to improve model performance and enhance learning ability for complex patterns and abstract features. The current landscape of multi-modal 3D object detection methods predominantly relies on feature-level fusion, where data from different sensors is fused after independent feature extraction. While Cross-attention in Transformers is commonly used for this fusion, it introduces notable computational overhead. Introducing Deformable-attention during feature fusion can be a promising approach to alleviate this computational burden. Previous experiments indicate that a straightforward and efficient Deformable-attention mechanism might offer unexpected benefits, highlighting its potential to enhance feature fusion in multi-modal 3D object detection.

Furthermore, there is an untapped potential for incorporating temporal information into the detection process. BEV methods based on panoramic images inherently contain valuable time series data. In contrast, 3D object detection based on point cloud data has seen limited exploration in leveraging temporal information. Integrating temporal information into 3D scenes can significantly improve detection performance, providing a richer understanding of the dynamic nature of objects in the environment over time. This avenue of research could lead to more robust and accurate multi-modal 3D object detection methods.

4.3.3. Feature Representation

In 3D space, there are two types of representations: explicit and implicit, based on whether the representation visually displays the 3D geometric structure. Explicit representation refers to the method of representing the surface or volume of a 3D object using regular or irregular discrete elements. Implicit representation refers to the method of parameterizing 3D objects or scenes using a continuous function. This function maps each point in space to an attribute value, such as occupancy probability, symbolic distance, color, etc. This representation method typically uses neural networks to approximate this function. And 3D object detection is developing towards the direction of implicit representation. Tesla’s Occupancy Network introduces a grid-based approach, segmenting the world into grid and discerning occupied and free cells. This transforms the task into semantic segmentation, simplifying 3D space representation by predicting occupancy probabilities.

In addition to exploring the Occupancy Network paradigm, there is a growing emphasis on other implicit 3D representation. The three primary forms of implicit 3D expression are the occupancy function (as used in Occupancy prediction), symbolic distance function, and Neural Radiance Field. Moreover, 3D Gaussian Splitting [116] also provides an excellent approach to combining explicit and implicit expressions. This hybrid approach leverages the strengths of both explicit and implicit representations, allowing for a more comprehensive and flexible modeling of complex 3D scenes and objects.

In particular, with the popularity of the NeRF [117] in 3D reconstruction, this paper suggests that combining the NeRF and Transformer should be tried more. NeRF is a deep learning based method employed for 3D scene modeling. Its objective is to learn attributes such as color and density for each point in a scene from input data such as images, thereby generating high-quality 3D models. According to some existing articles such as NeRF-Det [118], NeRF-RPN [119], etc., taking multi-view images into feature vectors containing color and density through NeRF. Then, they use 3D CNN for feature extraction. These methods can achieve some results in indoor 3D object detection.

5. Conclusions

In this paper, we comprehensively review and analyze various aspects of Transformer-based 3D object detection for autonomous driving. We initiate by scrutinizing the data sources for 3D object detection, providing an overview of available datasets, and offering a detailed exposition of the evaluation metrics employed in some datasets. Subsequently, we propose a classification scheme for 3D object detection methods based on the data source, distinguishing between image-based, point cloud-based, and multi-modal-based approaches. This includes a comparative performance analysis of selected methods on the datasets. Finally, we further summarize the current state of Transformer-based 3D object detection and offer insights into potential research directions in this field, discussing innovative ideas that could be amalgamated with existing methodologies. Notably, the application of Transformer shows significant promise in enhancing the fusion of multi-modal data in 3D object detection. In particular, constructing feature spaces from top to bottom using Transformer warrants further exploration and attention.

Author Contributions

Methodology, M.Z.; Resources, C.T.; Writing—Original Draft Preparation, Y.G.; Writing—Review and Editing, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Qiyuan Innovation Foundation (S20210201067) and sub-themes (No. 9072323404).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. 2D Object Detection with Transformers: A Review. arXiv 2023, arXiv:2306.04670. [Google Scholar]
Zhong, J.; Liu, Z.; Chen, X. Transformer-based models and hardware acceleration analysis in autonomous driving: A survey. arXiv 2023, arXiv:2304.10891. [Google Scholar]
Lu, D.; Xie, Q.; Wei, M.; Gao, K.; Xu, L.; Li, J. Transformers in 3d point clouds: A survey. arXiv 2022, arXiv:2205.07417. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Xie, Q.; Lai, Y.K.; Wu, J.; Wang, Z.; Zhang, Y.; Xu, K.; Wang, J. Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Washington, USA, 14–19 June 2020; pp. 10447–10456. [Google Scholar]
Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9277–9286. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–18. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Wu, J.; Yin, D.; Chen, J.; Wu, Y.; Si, H.; Lin, K. A survey on monocular 3D object detection algorithms based on deep learning. J. Phys. Conf. Ser. 2020, 1518, 012049. [Google Scholar] [CrossRef]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Ma, Y.; Wang, T.; Bai, X.; Yang, H.; Hou, Y.; Wang, Y.; Qiao, Y.; Yang, R.; Manocha, D.; Zhu, X. Vision-centric bev perception: A survey. arXiv 2022, arXiv:2208.02797. [Google Scholar]
Ma, X.; Ouyang, W.; Simonelli, A.; Ricci, E. 3D object detection from images for autonomous driving: A survey. arXiv 2022, arXiv:2202.02980. [Google Scholar] [CrossRef] [PubMed]
Kim, S.H.; Hwang, Y. A survey on deep learning based methods and datasets for monocular 3D object detection. Electronics 2021, 10, 517. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Barnes, D.; Gadd, M.; Murcutt, P.; Newman, P.; Posner, I. The Oxford Radar RobotCar Dataset: A Radar Extension to the Oxford RobotCar Dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–21 August 2020. [Google Scholar]
Alaba, S.Y.; Gurbuz, A.C.; Ball, J.E. Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection. World Electr. Veh. J. 2024, 15, 20. [Google Scholar] [CrossRef]
Oliveira, M.; Cerqueira, R.; Pinto, J.R.; Fonseca, J.; Teixeira, L.F. Multimodal PointPillars for Efficient Object Detection in Autonomous Vehicles. IEEE Trans. Intell. Veh. 2024, 1–11. [Google Scholar] [CrossRef]
Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar] [CrossRef] [PubMed]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef] [PubMed]
Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image de-raining transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12978–12995. [Google Scholar] [CrossRef]
Lee, Y.; Hwang, J.W.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–19 June 2019. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Huang, K.; Tian, C.; Su, J.; Lin, J.C.W. Transformer-based cross reference network for video salient object detection. Pattern Recognit. Lett. 2022, 160, 122–127. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Jain, J.; Li, J.; Chiu, M.T.; Hassani, A.; Orlov, N.; Shi, H. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
He, C.; Li, R.; Li, S.; Zhang, L. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8417–8427. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 954–960. [Google Scholar]
Patil, A.; Malla, S.; Gang, H.; Chen, Y.T. The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9552–9557. [Google Scholar]
Gählert, N.; Jourdan, N.; Cordts, M.; Franke, U.; Denzler, J. Cityscapes 3d: Dataset and benchmark for 9 dof vehicle detection. arXiv 2020, arXiv:2006.07864. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
Mao, J.; Niu, M.; Jiang, C.; Liang, H.; Chen, J.; Liang, X.; Li, Y.; Ye, C.; Zhang, W.; Li, Z.; et al. One million scenes for autonomous driving: Once dataset. arXiv 2021, arXiv:2106.11037. [Google Scholar]
Xiao, P.; Shao, Z.; Hao, S.; Zhang, Z.; Chai, X.; Jiao, J.; Li, Z.; Wu, J.; Sun, K.; Jiang, K.; et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3095–3101. [Google Scholar]
Wang, L.; Zhang, X.; Song, Z.; Bi, J.; Zhang, G.; Wei, H.; Tang, L.; Yang, L.; Li, J.; Jia, C.; et al. Multi-modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [Google Scholar] [CrossRef]
Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3d lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection. arXiv 2022, arXiv:2206.10092. [Google Scholar] [CrossRef]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
Zhang, R.; Qiu, H.; Wang, T.; Guo, Z.; Xu, X.; Qiao, Y.; Gao, P.; Li, H. MonoDETR: Depth-guided transformer for monocular 3D object detection. arXiv 2022, arXiv:2203.13310. [Google Scholar]
Wu, Y.; Li, R.; Qin, Z.; Zhao, X.; Li, X. HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird’s Eye View. arXiv 2023, arXiv:2307.13510. [Google Scholar]
Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 September 2022; pp. 180–191. [Google Scholar]
Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 531–548. [Google Scholar]
Misra, I.; Girdhar, R.; Joulin, A. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2906–2917. [Google Scholar]
Zhou, Z.; Zhao, X.; Wang, Y.; Wang, P.; Foroosh, H. Centerformer: Center-based transformer for 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 496–513. [Google Scholar]
Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F.; Zhou, B.; Zhao, H. AutoAlign: Pixel-instance feature aggregation for multi-modal 3D object detection. arXiv 2022, arXiv:2201.06493. [Google Scholar]
Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection. arXiv 2023, arXiv:2301.01283. [Google Scholar]
Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 8445–8453. [Google Scholar]
Reading, C.; Harakeh, A.; Chae, J.; Waslander, S.L. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8555–8564. [Google Scholar]
Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, Z.; Luo, P. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 1000–1001. [Google Scholar]
Chen, H.; Huang, Y.; Tian, W.; Gao, Z.; Xiong, L. Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10379–10388. [Google Scholar]
Chen, Y.; Tai, L.; Sun, K.; Li, M. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12093–12102. [Google Scholar]
Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3289–3298. [Google Scholar]
Yang, W.; Li, Q.; Liu, W.; Yu, Y.; Ma, Y.; He, S.; Pan, J. Projecting your view attentively: Monocular road scene layout estimation via cross-view transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15536–15545. [Google Scholar]
Chitta, K.; Prakash, A.; Geiger, A. Neat: Neural attention fields for end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15793–15803. [Google Scholar]
Can, Y.B.; Liniger, A.; Paudel, D.P.; Van Gool, L. Structured bird’s-eye-view traffic scene understanding from onboard images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15661–15670. [Google Scholar]
Huang, K.C.; Wu, T.H.; Su, H.T.; Hsu, W.H. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4012–4021. [Google Scholar]
Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 194–210. [Google Scholar]
Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.; Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17830–17839. [Google Scholar]
Liu, Y.; Yan, J.; Jia, F.; Li, S.; Gao, A.; Wang, T.; Zhang, X.; Sun, J. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv 2022, arXiv:2206.01256. [Google Scholar]
Qin, Z.; Chen, J.; Chen, C.; Chen, X.; Li, X. UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird’s-Eye-View. arXiv 2022, arXiv:2207.08536. [Google Scholar]
Qi, Z.; Wang, J.; Wu, X.; Zhao, H. OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection. arXiv 2023, arXiv:2306.01738. [Google Scholar]
Wang, S.; Liu, Y.; Wang, T.; Li, Y.; Zhang, X. Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection. arXiv 2023, arXiv:2303.11926. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Zhang, H.; Li, H.; Liao, X.; Li, F.; Liu, S.; Ni, L.M.; Zhang, L. DA-BEV: Depth Aware BEV Transformer for 3D Object Detection. arXiv 2023, arXiv:2302.13002. [Google Scholar]
Chen, S.; Wang, X.; Cheng, T.; Zhang, Q.; Huang, C.; Liu, W. Polar parametrization for vision-based surround-view 3d detection. arXiv 2022, arXiv:2206.10965. [Google Scholar]
Jiang, Y.; Zhang, L.; Miao, Z.; Zhu, X.; Gao, J.; Hu, W.; Jiang, Y.G. Polarformer: Multi-camera 3d object detection with polar transformer. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1042–1050. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Chen, X.; Zhao, H.; Zhou, G.; Zhang, Y.Q. Pq-transformer: Jointly parsing 3d objects and layouts from point clouds. IEEE Robot. Autom. Lett. 2022, 7, 2519–2526. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, Z.; Cao, Y.; Hu, H.; Tong, X. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2949–2958. [Google Scholar]
Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; Teh, Y.W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 3744–3753. [Google Scholar]
Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2743–2752. [Google Scholar]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7463–7472. [Google Scholar]
Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11677–11684. [Google Scholar]
Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.X.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8458–8468. [Google Scholar]
Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2020; pp. 10529–10538. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191. [Google Scholar]
Zhang, Y.; Chen, J.; Huang, D. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 908–917. [Google Scholar]
Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. arXiv 2022, arXiv:2207.10316. [Google Scholar]
Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. Futr3d: A unified sensor fusion framework for 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 172–181. [Google Scholar]
Ge, C.; Chen, J.; Xie, E.; Wang, Z.; Hong, L.; Lu, H.; Li, Z.; Luo, P. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. arXiv 2023, arXiv:2304.09801. [Google Scholar]
Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation. arXiv 2023, arXiv:2308.07732. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2774–2781. [Google Scholar]
Hu, C.; Zheng, H.; Li, K.; Xu, J.; Mao, W.; Luo, M.; Wang, L.; Chen, M.; Liu, K.; Zhao, Y.; et al. FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Objection. arXiv 2023, arXiv:2309.05257. [Google Scholar]
Nabati, R.; Qi, H. CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection. arXiv 2020, arXiv:2011.04841. [Google Scholar]
Lin, Z.; Liu, Z.; Xia, Z.; Wang, X.; Wang, Y.; Qi, S.; Dong, Y.; Dong, N.; Zhang, L.; Zhu, C. RCBEVDet: Radar-camera Fusion in Bird’s Eye View for 3D Object Detection. arXiv 2024, arXiv:2403.16440. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3531–3539. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Syu, J.H.; Lin, J.C.W.; Srivastava, G.; Yu, K. A comprehensive survey on artificial intelligence empowered edge computing on consumer electronics. IEEE Trans. Consum. Electron. 2023, 69, 1023–1034. [Google Scholar] [CrossRef]
Liu, S.; Liu, L.; Tang, J.; Yu, B.; Wang, Y.; Shi, W. Edge computing for autonomous driving: Opportunities and challenges. Proc. IEEE 2019, 107, 1697–1716. [Google Scholar] [CrossRef]
Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A survey on mobile edge computing: The communication perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Lu, A.; Lee, J.; Kim, T.H.; Karim, M.A.U.; Park, R.S.; Simka, H.; Yu, S. High-speed emerging memories for AI hardware accelerators. Nat. Rev. Electr. Eng. 2024, 1, 24–34. [Google Scholar] [CrossRef]
Deng, L.; Li, G.; Han, S.; Shi, L.; Xie, Y. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 2020, 108, 485–532. [Google Scholar] [CrossRef]
Han, Y.; Zhang, H.; Li, H.; Jin, Y.; Lang, C.; Li, Y. Collaborative perception in autonomous driving: Methods, datasets, and challenges. IEEE Intell. Transp. Syst. Mag. 2023, 15, 131–151. [Google Scholar] [CrossRef]
Malik, S.; Khan, M.J.; Khan, M.A.; El-Sayed, H. Collaborative Perception—The Missing Piece in Realizing Fully Autonomous Driving. Sensors 2023, 23, 7854. [Google Scholar] [CrossRef]
Xu, R.; Xiang, H.; Tu, Z.; Xia, X.; Yang, M.H.; Ma, J. V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 107–124. [Google Scholar]
Zhang, H.; Luo, G.; Cao, Y.; Jin, Y.; Li, Y. Multi-modal virtual-real fusion based transformer for collaborative perception. In Proceedings of the 2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), Beijing, China, 4–6 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4460–4470. [Google Scholar]
Li, Y.; Yu, Z.; Choy, C.; Xiao, C.; Alvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9087–9098. [Google Scholar]
Huang, Y.; Zheng, W.; Zhang, Y.; Zhou, J.; Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9223–9232. [Google Scholar]
Zhang, Y.; Zhu, Z.; Du, D. OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction. arXiv 2023, arXiv:2304.05316. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Wang, X.; Zhu, Z.; Huang, G.; Chen, X.; Lu, J. DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving. arXiv 2023, arXiv:2309.09777. [Google Scholar]
Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17853–17862. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Xu, C.; Wu, B.; Hou, J.; Tsai, S.; Li, R.; Wang, J.; Zhan, W.; He, Z.; Vajda, P.; Keutzer, K.; et al. NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection. arXiv 2023, arXiv:2307.14620. [Google Scholar]
Hu, B.; Huang, J.; Liu, Y.; Tai, Y.W.; Tang, C.K. NeRF-RPN: A general framework for object detection in NeRFs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23528–23538. [Google Scholar]

Figure 1. Explanation of the structure of this article.

Figure 2. Illustration of the Transformer encoder architecture [3].

Figure 3. Explanation of the comparison between Self-attention and Cross-attention. For Self-Attention, Q, K, V from the same sequence are fed into Multi-Head Attention for calculation. But in Cross-attention, Q comes from different sequences than K and V [4].

Figure 4. Diagram of related Transformer-based 3D object detection methods.

Figure 5. Overview of the Transformer-based 3D object detection methods.

Figure 6. Comparison of the methods based on monocular, stereo, and multi-view images.

Figure 7. A brief comparison between point-based methods and voxel-based methods. The first line is point-based method, and the input of the model is the original point clouds. The second line is a voxel-based method that requires voxelization of the original point cloud. Point cloud voxelization is the process of converting point cloud data into a discrete voxel grid representation; subsequent steps such as feature extraction and detection are all based on this voxel grid.

Table 2. Datasets for 3D object detection in autonomous driving. “-” means not mentioned.

Datasets	Sensors	LiDAR Scans	Images	3D Boxes	Locations	Year
KITTI [15]	1 LiDAR, 2 Cameras	15 k	15 k	80 k	Germany	2012
ApolloScape [33]	2 LiDARs, 2 Cameras	20 k	144 k	475 k	China	2019
H3D [34]	1 LiDAR, 2 Cameras	27 k	83 k	1.1 M	USA	2019
Cityscapes [35]	0 LiDAR, 2 Cameras	0	5 k	-	Germany	2020
nuScenes [9]	1 LiDAR, 6 Cameras	400 k	1.4 M	1.4 M	SG, USA	2020
Waymo Open [36]	5 LiDAR, 5 Cameras	230 k	1 M	12 M	USA	2020
ONCE [37]	1 LiDAR, 7 Cameras	1 M	7 M	417 k	China	2021
PandaSet [38]	2 LiDAR, 6 Cameras	8.2 k	49 k	1.3 M	USA	2021

Table 3. Representative Transformer-based 3D object detection methods.

Methods	Category	Key Innovation Points	Strengths
MonoDETR [44]	Monocular	The pre-predicted foreground depth map serves as a guidance signal, and the Transformer is used to guide each object to adaptively extract its region of interest and help in subsequent 3D detection feature extraction.	Features in the global perceptual field are extracted.
HeightFormer [45]	Image-based	A Self-recursive method is used to learn height information layer by layer refinement, and query masks are used to avoid background misdirection.	The height truth can be obtained directly from the annotated 3D bounding box without additional depth data or sensors.
DETR3D [46]	Multi-view	Transformer is used for feature extraction in BEV feature space. Based on the idea of deformable DETR, points in 3D space are used to interact with features of 2D images.	The sparse BEV feature space is constructed while reducing the computational complexity.
PETR [47]	Multi-view	Based on DETR3D, the 3D position encoding is added to the encoder part and added to the 2D image features, which are then interacted with the 3D query features.	The network performance is enhanced while the inference speed is reduced.
BEVFormer [8]	Multi-view	Transformer is used to extract features from multi-view images to build BEV feature space from top to bottom. Optimizing BEV feature representation through temporal modeling.	The network performance is greatly improved by integrating temporal features.
3DETR [48]	Point Cloud-based	Added non parametric queries and Fourier positional embeddings to the original Transformer.	The network is simple and easy to implement with excellent performance.
CenterFormer [49]	Point Cloud-based	Select the center candidate on top of the standard voxel point cloud encoder through the center heat map, and then use the features of the center candidate as a query to embed it in Transformer.	The design reduces the convergence difficulty and computational complexity of the Transformer structure.
AutoAlign [50]	Multi-modal-based	A learnable alignment map is used to model the mapping relationship between images and point clouds, enabling the model to automatically align non homomorphic features in a dynamic and data-driven manner.	The method fully utilizes the feature relationship between point clouds and images.
Cross Modal Transformer [51]	Multi-modal-based	Implicitly encoding 3D positions into multimodal features avoids bias in explicit cross view feature alignment	Strong robustness. In the absence of LiDAR, the performance of the model can reach a level comparable to that of vision based methods.

Table 4. A Brief comparison between LSS and Transformer-based methods.

Category	Paradigm	Advantage	Disadvantaged	Representative Methods
LSS [62]	Bottom-up	Explicit depth estimation, First end-to-end training and solves the multi-sensor fusion problem	Relies heavily on the accuracy of depth information, and outer product operation is too time-consuming	BEVDet [43], BEVDepth [42]…
Transformer-based	Top-down	Modeling spatial features implicitly, Extensibility, Temporal fusion capability	Hard to train	BEVFormer [8], BEVFormer v2 [63]…

Table 5. Detection performance of some methods on the nuScenes test set.

Model	Publisher	Modalities	mAP	NDS
FusionFormer [94]	arXiv	Camera, Lidar	0.726	0.751
UniTR [92]	arXiv	Camera, Lidar	0.709	0.745
CMT [51]	ICCV(2023)	Camera, Lidar	0.704	0.730
FUTR3D [90]	arXiv	Camera, Lidar	0.694	0.721
TransFusion [86]	CVPR(2022)	Camera, Lidar	0.689	0.717
AutoAlignV2 [89]	ECCV(2022)	Camera, Lidar	0.684	0.724
MetaBEV [91]	arXiv	Camera, Lidar	0.680	0.715
AutoAlign [50]	IJCAI(2022)	Camera, Lidar	0.658	0.709
StreamPETR [67]	ICCV(2023)	Camera	0.620	0.676
BEVFormerv2 [63]	CVPR(2023)	Camera	0.556	0.634
DA-BEV [69]	arXiv	Camera	0.515	0.600
PETRv2 [64]	ECCV(2022)	Camera	0.490	0.582
BEVFormer [8]	ECCV(2022)	Camera	0.481	0.569
MonoDETR [44]	arXiv	Camera	0.435	0.524
PETR [47]	ECCV(2022)	Camera	0.434	0.481
PolarDETR [70]	arXiv	Camera	0.431	0.493
HeightFormer [45]	arXiv	Camera	0.429	0.532
BEVDepth [42]	arXiv	Camera	0.418	0.538
OCBEV [66]	arXiv	Camera	0.417	0.532
PolarFormer [71]	arXiv	Camera	0.415	0.470
DETR3D [46]	PMLR(2022)	Camera	0.412	0.479
BEVDet [43]	arXiv	Camera	0.397	0.477

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, M.; Gong, Y.; Tian, C.; Zhu, Z. A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones 2024, 8, 412. https://doi.org/10.3390/drones8080412

AMA Style

Zhu M, Gong Y, Tian C, Zhu Z. A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones. 2024; 8(8):412. https://doi.org/10.3390/drones8080412

Chicago/Turabian Style

Zhu, Minling, Yadong Gong, Chunwei Tian, and Zuyuan Zhu. 2024. "A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends" Drones 8, no. 8: 412. https://doi.org/10.3390/drones8080412

APA Style

Zhu, M., Gong, Y., Tian, C., & Zhu, Z. (2024). A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends. Drones, 8(8), 412. https://doi.org/10.3390/drones8080412

Article Menu

A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends

Abstract

1. Introduction

2. Background

2.1. Data Sources

2.2. Transformers and Attention Mechanism

2.2.1. Cross-Attention

2.2.2. Deformable-Attention

2.3. Datasets

2.4. Evaluation Metrics

3. 3D Object Detection Methods

3.1. Search Strategy

3.2. Image-Based 3D Object Detection

3.2.1. Monocular 3D Object Detection

3.2.2. Multi-View Based 3D Object Detection

3.3. Point Cloud-Based 3D Object Detection

3.3.1. Point-Based

3.3.2. Voxel-Based

3.4. Multi-Modal-Based 3D Object Detection

4. Challenges and Trends

4.1. Current Research Limitations

4.1.1. Transformer Structure

4.1.2. Sensor Limitations

4.1.3. Inference Performance

4.2. Future Research Trends

4.2.1. Collaborative Perception

4.2.2. Occupancy Prediction

4.2.3. Large Model

4.3. Discussion

4.3.1. Feature Space

4.3.2. Feature Fusion

4.3.3. Feature Representation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI