A Review of 3D Object Detection for Autonomous Driving of Electric Vehicles

: In recent years, electric vehicles have achieved rapid development. Intelligence is one of the important trends to promote the development of electric vehicles. As a result, autonomous driving system is becoming one of the core systems of electric vehicles. Considering that environmental perception is the basis of intelligent planning and safe decision-making for intelligent vehicles, this paper presents a survey of the existing perceptual methods in vehicles, especially 3D object detection, which guarantees the reliability and safety of vehicles. In this review, we ﬁrst introduce the role of perceptual module in autonomous driving system and a relationship with other modules. Then, we classify and analyze the corresponding perception methods based on the different sensors. Finally, we compare the performance of the surveyed works on public datasets and discuss the possible future research interests.


Introduction
In recent years, electric vehicles (EVs) are gaining increasingly more favor and attention. Environmental protection and fiscal return are the advantages of EVs. On one hand, the driving process of EVs does not pollute and destroy the environment. On the other hand, the comprehensive cost of EVs is lower than that of traditional vehicles under a same mileage. With the strengthening of the supporting infrastructure of EVs, the technology of electric vehicles has been developed and improved. In the process, safety, comfort, energy conservation and environmental protection are the direction and eternal theme of vehicles development. Electric, intelligent, and renewable are the effective measures and approaches to achieve the aim. The development of intelligent electric vehicles can improve the safety, comfort, and economy of vehicles. Furthermore, the ability of autonomous driving is helpful to solve urban traffic congestion and beneficial to combining with the intelligent transportation environment of future urban. An autonomous driving system consists of perception, planning, decision, and control, which is illustrated in Figure 1. The perception subsystem is the basis for other subsystems. It takes data captured from different sensors as input to obtain vehicle's position and location, also including the size and direction of surrounding objects. Autonomous driving vehicles [1][2][3] are often equipped with a variety of sensors, including LiDARs, cameras, millimeter-wave radars, GPS, and so on, which are illustrated in Figure 2.
A perception subsystem needs to be accurate and robust to ensure safe driving. It is composed of several important modules, such as object detection, tracking, Simultaneous Localization and Mapping (SLAM), etc. Object detection is a fundamental ability and aims to detect all interested objects to achieve their location and categories from captured data, such as images or point clouds. Images are captured by cameras and can provide rich texture information. Cameras are cheap but cannot achieve accurate depth information, and they are sensitive to changes in illumination and weather, such as low luminosity at night-time and extreme brightness disparity when entering or leaving tunnels, rainy, or snowy weather. Point clouds are captured by LiDARs and can provide accurate 3D spatial information. They are robust to weather and extreme lighting conditions and demonstrate sparsity and ununiformity in spatial distribution. In addition, LiDARs are expensive sensors. Therefore, considering the complementary characteristics between point clouds and images, cameras and LiDARs are used as indispensable sensors to ensure intelligent vehicles' driving safety.  Notably, failure to detect objects might lead to safety-related incidents. It may result in traffic accidents, threatening human lives for failed detection of a leading vehicle [4]. To avoid collision with surrounding vehicles and pedestrians, object detection is an essential technique to analyze perceived images and point clouds, which needs to identify and localize objects. The general framework is illustrated in Figure 3. With the development of deep learning, 2D object detection is an extensive research topic in the field of computer vision. CNN-based 2D object detections [5][6][7][8] have an excellent performance in some public datasets [9][10][11]. However, 2D object detection only provide 2D bounding boxes and can not provide depth information of objects that is crucial for safe driving. Compared with 2D object detection, 3D object detection provides more spatial information, such as location, direction, and object size, which makes it become more significant in autonomous driving. 3D detection needs to estimate more parameters for 3D-oriented boxes of objects, such as central 3D coordinates, length, width, height, and deflection angle of a bounding box. In addition, 3D object detection still faces arduous problems, including the complex interaction between objects, occlusion, changes in perspective and scale, and limited information provided by 3D data.
In this paper, we present a review of 3D object detection methods to summarize the development and challenges of 3D object detection. We analyze the potential advantages and limitations of these methods. The existing 3D object detection methods are divided into image-based methods, point cloud-based methods, and multimodal fusion-based methods. A general framework of the existing object detection methods is shown in Figure 3. The categories and their limitations are briefly described in Table 1.  There is information loss in the process of projection.

Volumetric
Conduct voxelization to achieve 3D voxels and generate representation by using convolutional operations in Voxels to predict 3D bounding boxes of objects.
Expensive 3D convolutional operations increase inference time. The computation is heavy.

PointNet
Apply raw point cloud to predict 3D bounding boxes of objects directly.
Large scale of point cloud increases running time. It is difficult to generate region proposals.

Multi-sensor Fusion
Fuse image and point cloud to generate prediction on 3D bounding boxes. It is robust and complement each other.
Fusion methods are computationally expensive and are not mature enough.

Image-Based 3D Object Detection Methods
RGB-D images can provide depth information, which are used in some works. For example, Chen et al. [12] apply the poses of 3D bounding boxes to establish the energy function, and they use structured SVM for training to minimize the energy function. In DSS [13], multi-scale 3D RPN network is used to recommend objects on stereo images, which can detect objects of different sizes. Deng et al. [14] use the 2.5D method for object detection. They establish a model to detect 2D objects, and then convert 2D targets to 3D space to realize 3D object detection. Due to the large computation of RGB-D images, monocular images are used for 3D object detection.
In early days, Chen et al. propose Mono3d [15], which uses monocular images to generate 3D candidates, and then uses semantics, context information, hand-designed shape features, and location priors, which are illustrated in Figure 4, to score each candidates through energy model. Based on these candidates, Fast RCNN is used to further refine the 3D bounding boxes by location regression. The network improves the detection performance, but it is dependent on the object classes and needs a large number of candidates to achieve high recall, which leads to computational cost increase. To overcome this limitation, Pham and Jeon propose DeepStereoOP architecture [16] that is a class-independent algorithm, which exploits not noly RGB images but also depth inforamtion. Occlusion is a common phenomenon and also a great challenge in the driving environment. To alleviate this problem, Xiang et al. propose 3DVP [17] that introduces the 3D voxel patterns and uses RGB value, 3D shape and occlusion mask for appearance models. The 3D voxel patterns are illustrated in Figure 5. 3D detection is realized by minimizing the reprojection error between 3D frame projected on the image plane and 2D detection, which depends on the performance of regional recommendation network (RPN). Compared with the traditional region proposal methods, the region proposal network (RPN) can improve detection performance, but cannot deal with the problems of object scale changes, occlusion, and truncation. Therefore, SubCNN [18] uses subcategory information to generate region proposals and object candidates, where subcategories are objects with similar characteristics or attributes, such as 2D appearance, 3D pose or shape. A multi-scale image pyramid model is applied as the backbone network to improve the detection capability of small object. Although SubCNN improves the robustness to occlusion and truncation, the detection performance depnds on object categories.
Hu et al. [19] propose a multi-task framework to associate detections of objects in motion over time and estimate 3D bounding boxes information from a sequential images. They leverage 3D box depth-ordering matching for robust instance association and use 3D trajectory prediction for identification of occluded vehicles. Considering benefits from multi-task learning, Center3D [20] is proposed to efficiently estimate 3D location of objects and depth using only monocular images. It is an extension of CenterNet [21]. In recent years, 3D object detection from a 2D perspective has attracted the attention of many researchers. Lahoud and Ghanem [22] propose a 2D driven 3D object detection method to reduce the search space of 3D object. They apply manual features to train multi-layer perceptron network to predict 3D boxes. Later, they extend the work [23] and propose a multimodal region proposal network to generate region proposals, which uses an extended 2D boxes to generate 3D boxes. MonoDIS [24] leverages a novel disentangling transformation for 2D and 3D detection losses and a self-supervised confidence score for 3D bounding boxes.
Considering that depth information is helpful for 3D detection, pseudo-LiDAR is proposed based on stereo or monocular [25,26]. The depth map is first predicted and backprojecting is followed to generate a 3D point cloud in the LiDAR coordinate system. Wang et al. propose to convert image-based depth maps to pseudo-LiDAR representation, shown in Figure 6. Pseudo-LiDAR++ [27] is a 3D detection architecture based on pseudo-LiDAR. A depth-propagation algorithm is proposed based on initial depth estimates to diffuse these few exact measurements across the entire depth map. The architecture is independent of expensive LiDAR and performs almost on par with 64-beam LiDAR system but the depth map prediction is time-consuming. Moreover, there is a long tail problem when representing pseudo-LiDAR due to the inaccurate depth estimation around the boundaries of objects. Images can provide rich color and texture information. However, accuracte depth information cannot be obtained from images and depth information estimation with images has high computational cost and inaccury, which is necessary for accurate object size and location estimation, especially in the environments with occlusion and weak illumination.

Point Cloud-Based 3D Object Detection Methods
LiDAR sensors use laser beams to measure the distances of obstacles in the environment. The sensor outputs a set of 3D points. Compared with image-based methods, point clouds provide reliable depth information, which can be used to locate the object accurately. Unlike the structural information contained in the image, the LiDAR point cloud has the characteristics of disorder, sparsity, and limited information. Most Lidar-based object detection methods apply a two-stage strategy to detect object, which is illustrated Figure 7. In order to effectively utilize the point cloud, point cloud-based methods include 2D, 3D, and segmentation processing solutions. 2D processing means that the point cloud is transformed into 2D planes. This kind of method does not directly process 3D point cloud data, but first projects the point cloud to some specific perspectives, such as the front view and bird's eye view (BEV), the projected images are shown in Figure 8. The pixels of the projected images are filled separately with density, average intensity, and height of each grid point as the RGB value of each pixel, which are inputed into the off-the-shelf 2D convolution network. The 3D boxes are predicted using the convolution features [28,29]. LMNet [30] projects the point cloud into the front view, and use FCN structure with dilated convolution [31] for single-stage object detection. The method can achieve a real-time detection performance, but low detection accuracy. Generally, BEV is adopted in AVs because the overlap between objects is little in BEV. The point cloud is projected into BEV with three channels (the channels are height, intensity, and density, respectively) in BirdNet [32]. Then, Faster R-CNN [33] is used to detect 2D directional bounding boxes of objects. Finally, directional 3D bounding boxes are obtained offline by combining 2D boxes and ground truth estimation. The method has a low efficiency. Based on the framework, BirdNet+ [34] utilizes ad hoc regression branches to eliminate the need for a postprocessing stage. RT3D [35] also uses a 2D object detection method to achieve 3D detection, in which point cloud is projected into BEV (the channels are the maximum, average, and minimum height, respectively) and then R-FCN [36] is used to detect objects. The method improves the efficiency but has a low accuracy due to the loss of height information. Inspired by YoLo [37,38], Complex-YOLO [39] projects the point cloud into the BEV and then uses a single-stage strategy to estimate 3D bounding boxes of objects, which significantly improves the detection efficiency. Yange et al. [40] propose a single-stage detector which performs 2D convolution on the BEV. To avoid expensive computation of 3D CNNs, PointPillars [29] transform the point into vertical columns (pillars) and 2D operations is used to detect objects instead of 3D operations, which improves the computation efficiency.
Objects keep real physical dimensions and are naturally separable in BEV, but sparsity and variable point density of point cloud make great difficulties in detecting distant or small objects. To handle this problem, an end-to-end multiview fusion method [41] is proposed to synergize the BEV and the perspective view, which can effectively use the complementary information from both. In the method, dynamic voxelization is utilized to replace hard voxelization (HV), which eliminates the need to pad voxels to a predefined size and decreases the extra space and compute overhead of HV.
Range view (RV) is also a popular view in autonomous driving [42,43], which is shown in Figure 9. RangerCNN [44] and RangeNet++ [45] use 2D CNNs to achieve accurate 3D object detection based on range image representation, but they are subjected to the problem of scale variation. RangeIoUDet [46] learns point-wise features from the range image, which optimizes the point-wise feature and the 3D boxes by the point-based IoU and box-based IoU supervision. Although these methods are efficient, there is inevitable loss of spatial information. To improve the detection performance, MVFuseNet [47] fuses RV and BEV for spatiotemporal feature learning from a temporal sequence of LiDAR data to jointly perform both object detection and motion forecasting. 3D processing directly uses the raw point cloud as the network input to extract the suitable point cloud features. For example, 3D FCN [48] and Vote3Deep [49] directly use a 3D convolution network to detect 3D bounding boxes of objects. However, the point cloud is sparse and the computation of 3D CNN is expensive. Additionally, affected by the receptive field, the traditional 3D convolution network cannot effectively learn the local features of different scales. To learn more effective spatial geometric representation from point cloud, some specific network frameworks have been proposed for point cloud, such as PointNet [50], PointNet++ [51], PointCNN [52], Dynamic Graph CNN [53], and Point-GNN [54]. PointNets [50,51] can directly process LiDAR point clouds and extract point cloud features through the MaxPooling symmetric function to solve the disorder problem of points. The network architecture of PontNet is illustrated in Figure 10. Thanks to the networks, the performance of 3D object detection is improved but the computation of point-based methods is expensive, especially when the information of large scenes are captured by using Velodyne LiDAR HDL-64E and there are more than 100K points in one scan. Therefore, some preprocessing operations need to be conducted, such as downsampling. After the point cloud features learning models are proposed, PointRCNN [55] constructs a PointNet++-based architecture to detect 3D objects, which is simply illustrated in Figure 11. Through the bottom-up 3D PRN, the subnetwork is used to transform the proposals information into standard coordinates to learn better local spatial features. By combining with the global semantic features of each point, the accuracy of the detected bounding boxes is improved. Similarly, Yang et al. [56] add a proposal generation module based on spherical anchor, which uses PointNet++ as the backbone network to extract semantic context features for each point. At the same time, in the second stage of boxes prediction, an IoU estimation branch is added for postprocessing, which further improves the accuracy of object detection. Benifical from multi-task learning, LiDARMTL [57] utilizes a encoder-decoder architecture to predict perception parameters for 3D object detection and road understanding, which can be leveraged for online localization. Although the location accuracy of the object is improved compared with the previous methods, the calculation burden is heavy due to the large scale of the point cloud. To deal with the drawback, AFDet [58] adopts an anchor-free and Non-Maximum Suppresion-free single-stage framework to detect objects, which has the advantage in embedded systems.
Segmentation -based methods divide the point cloud into segments with spatial relationships, which implement voxelization in 3D space and extract features in grouped voxels by using 3D convolution [59]. The voxelization is showed in Figure 12. Based on the development and improvement of PointNet++, the VoxelNet [60] is proposed and widely used for 3D object detection [61]. The network first divides the 3D point cloud into a certain number of voxels. Then, after random sampling and normalization of points, the local features of each non empty voxel are extracted, and the geometric space representations of the objects are obtained. The RPN is utilized for classification and regression of 3D bounding boxes. Shi et al. [62] observe that the ground truth of the 3D bounding box not only provides the segmentation mask, but also provides the relative position information of its internal points, as is shown in Figure 13. Therefore, they propose a detection method that conducts sparse convolution and Voxel Set Abstraction to learn features. The segmentation mask and position information generated in the first stage are used as the features of the second stage, and then the ROI proposals are utilized for fine mapping of objects by sparse 3D convolution. PV-RCNN [63] deeply integrates both 3D voxel CNN and PointNet-based set abstraction to learn more discriminative point cloud features. The voxel CNN generates 3D proposals and RoI-grid pooling is leveraged to abstract proposal-specific features from the keypoint to the RoI-grid points via keypoint set abstraction, which encode rich context information. These methods improve the detection performance, but the quality of segmentation results affectes the detection results.
With the help of point cloud data, the performance of 3D object detection is significantly improved. In general, the accuracy of 3D bounding boxes with image-based methods is much less than point cloud-based methods. Currently, LiDAR point cloud-based 3D object detection has become a main trends, but point cloud cannot provide texture information to efficiently discrimate categories of objects. Moreover, the density of points decreases when the distance between object and LiDAR increases, which affects the performance of detectors, while images can still capture faraway objects. Therefore, multi-sensor fusion-based methods are proposed to improve the overall performance.

Multi-Sensor Fusion-Based 3D Object Detection Methods
Considering the advantages and disadvantages of image-based and point cloud-based methods, some methods try to apply fuse both modalities with different strategies. The fusion of LiDAR point cloud and images is done to conduct a projection transformation of the point cloud, and then to integrate the multi-view projected plane with the image by different feature fusion schemes, such as MV3D [64], AVOD [65], etc. There are three fusion schemes, including early fusion, late fusion, and deep fusion, which are illustrated in Figure 14. MV3D aggregates features by using a deep fusion scheme, where feature maps can hierarchically interact with others. AVOD is the first approach to introduce early fusion. The features of each modality proposal are merged and a FC layer is followed to output category and coordinates of 3D box for each proposal. These methods lose space information in the projection transformation process and the detection performance of small targets is poor. In addition, ROI feature fusion only uses advanced features, and the sparsity of LiDAR point cloud limits the fusion-based methods. To solve these problems, Liang et al. [66] propose a feature fusion method combining point-and ROI-level features. Different from multi-view fusion, with the introduction of PointNets, the PointNets based fusion detection frameworks are proposed. F-PointNet [67] uses the detection results of mature 2D object detection to get the frustum spaces of objects. Then, it integrates PointNet++ to reduce the three-dimensional spaces of frustums, and returns the normalized coordinates. Finally, PointNet++ is used for the second time to realize the regression of the relevant parameters of the 3D bounding boxes of the objects. PointFusion [68] combines the advantages of AVOD and F-PointNet that extracts the features of RGB image blocks and corresponding raw point cloud, respectively. These features are fused and used to predict 3D bounding boxes with dense anchors. RoarNet [69] is similar to F-PointNet that apply two-stage strategy for 3D object detection, in which 3D proposals are first generated based on monucular image and then RoarNet_3D is used to directly process point cloud to estimate parameters of 3D bounding boxes. The method can deal with the asynchronous situation between LiDAR and camera. However, the detection accuracy depends on the recall of proposals and the undetected objects proposals can not be recovered at the second step.
Compared with projection transformation, some methods process raw point cloud directly. Gong et al. [70] also use a frustum model to integrate visual information and 3D spatial information, and combine visual and distance information into a probability framework. The method solves the problems of sparse and noise in LiDAR SLAM data, but it is not robust enough to dynamic objects. Similarly, SEG-VoxelNet [71] also uses mature 2D detection technology. The difference is that framework uses the current mature segmentation model to segment the image and integrates the semantic features obtained from segmentation and point cloud features based on VoxelNet. The 3D detection results of these network frameworks depend on the mature 2D detection methods, and the feature fusion is not enough. Therefore, Sindagi et al. [72] propose a multimodal information fusion method, which combines early fusion of point features and late fusion of voxel features to fully integrate the LiDAR point cloud and image information.
To address the problem of information loss, 3D-CVF [73] combines the features of camera and LiDAR by using the cross-view spatial feature fusion strategy. Autocalibrated projection is applied to transform the image features to a smooth spatial feature map with the highest correspondence to the LiDAR features in the BEV domain. A gated feature fusion network is used mix the features appropriately. Additionally, the fusion methods based on BEV or voxel format are not accurate enough. Thus, PI-RCNN [74] proposes a novel fusion method named Point-based Attentive Cont-conv Fusion module to fuse multisensor features directly on 3D points. Except for continuous convolution, Point-Pooling and Attentive Aggregation are used to fuse features expressively.
In the process of 3D object detection, inconsistency between the localization and classification confidence is a critical issue [75]. To solve the problem, a consistency enforcing loss is utilized to increase the consistency of both the localization and classification in EPNet [76]. Moreover, the point features is enhanced with semantic image features in a point-wise manner without image annotations.
Besides fusion of camera and LiDAR, radar data are also used for 3D object detection [77,78]. CenterFusion [77] first associates radar detections to corresponding objects in the 3D space. Then, these radar detections are mapped into image plane to complement features of images in a middle-fusion method.

Datasets
A widely used dataset of 3D object detection for autonomous driving is KITTI [1], which provides RGB images, 3D velodyne point clouds, and GPS coordinates. These data are collected by a car equipped with a 64-channel LiDAR, 4 cameras, and a combined GPS/IMU system. The dataset is composed of 20 scenes, including cities, residential areas, and roads, as shown in Figure 15. In particular, the 3D object detection benchmark of KITTI consists of 7481 training images and 7518 test images as well as the corresponding point clouds, comprising a total of 80,256 labeled objects. It also contains sensor calibration information and annotated 2D and 3D bounding boxes of interested objects. The annotation of each object is classified as "easy", "moderate", and "hard" cases according to different difficulties. The nuScenes dataset [79] is anthor dataset for autonomous driving, and its scale is larger than KITTI. The datatset contains 700 scenes for training, 150 scenes for validation, and 150 scenes for testing. These data are collected by 6 cameras and a 32-beam LiDAR in Boston and Singapore. There are 23 classes in 360 degree field of view for 3D annotations. In the process of 3D object detection, some rare classes with few samples are removed and 10 classes are retained for the task. There are 1000 driving scenes with dense traffic and greatly challenging driving situations. Some images captured front camera are shown in Figure 16. Moreover, annotations of objects contain some attributes such as visibility, activity, pose, etc. With the development of autonomous driving, datasets are developing very rapidly and many other datasets are also established, such as Waymo Open dataset [80], Apol-loScape [81], H3D [82], AIO Drive [83], etc.

Metrics
IoU: Intersection-over-Union (IoU) is a common evaluation index. IoU is the overlap of predicted boxes and ground truth boxes, which is defined as follows: where B pred represents the predicted 3D boxes, and B gt is the ground truth. AP: Generally, the average precision (AP) is selected to evaluate the performance of the algorithm. The definition of average precision is where p(r i ) represents a precision when recall is r i . N is set as 11.
To compare detection performance of some methods on nuScenes datasets, other metrics also considered, including True Positive (TP)'s average translation, scale, orientation, velocity, and attribute error with ground-truth, denoted by ATE, ASE, AVE, and AAE, respectively. The final metrics is derived from a weighted sum of mAP and errors, which is a comprehensive comparison standard of detection performance.

Performance Comparison
We compare the detection results of some discussed methods in three difficulties of three categories (car, pedestrain, and cyclist). Table 2 shows the comparison results of the state-of-the-art methods on the KITTI object detection test set, in which accuracy and runtime are presented. Table 3 shows the detection results of the state-of-the-art methods on the nuScenes test set. Currently, safety is mainly taken into consideration, so the final detection accuracy and inference efficiency is utilized to evaluate performance of existing methods. Through comparison and analysis, image-based methods demonstrate low performance on 3D detection metrics due to the absence of depth information. Point cloud-based methods achieve significant improvement of performance on the task. Single-stage detection methods can achieve a fast inference but their accuracies cannot satisfy requirements of AVs. Two-stage methods can achieve high detection accuracy but the their efficiencies need to be improved. Additionally, in the current 3D object detection methods of autonomous driving, most fusion-based 3D object detection performance is lower than that based on point cloud due to the lack of mature and effective multi-sensor fusion strategy. Therefore, effective and robust fusion strategy is urgent and meaningful.

Conclusions
We review the current mainstream 3D object detection techniques and analyze the advantages and disadvantages of using RGB image, LiDAR point cloud, and image and point cloud fusion for 3D object detection. The performances of these methods are com-pared on the KITTI public benchmark dataset. The directly use of LiDAR point cloud for detection provides a simple and effective solution for 3D object detection. However, due to the sparsity of point cloud and the lack of color information, it is necessary to use multimodal information to overcome the problem of insufficient and incomplete single-modal information. Among the existing state-of-the-art methods, there is still a lack of mature multimodal information fusion detection frameworks. The research on multimodal fusion based detection methods is urgent and meaningful. By introducing the visual information, the representation of features can be enhanced to improve the recognition capacity of different objects. In addition, most of the current 3D object detection methods are based on a single frame. It is incomplete and insufficient due to the occlusion of objects. Therefore, it is meaningful to fuse the temporal information. The uncertainty of the information can be reduced and the detection accuracy can be improved by fusing the context information. Data Availability Statement: Please refer to public KITTI 3D object detection benchmark (accessed on 5 August 2020) at http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d and public nuScenes detection leaderboard (accessed on 5 August 2020) at https://www.nuscenes.org/ object-detection?externalData=all&mapData=all&modalities=Any.