FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection

3D object detection with multi-sensors is essential for an accurate and reliable perception system of autonomous driving and robotics. Existing 3D detectors significantly improve the accuracy by adopting a two-stage paradigm which merely relies on LiDAR point clouds for 3D proposal refinement. Though impressive, the sparsity of point clouds, especially for the points far away, making it difficult for the LiDAR-only refinement module to accurately recognize and locate objects.To address this problem, we propose a novel multi-modality two-stage approach named FusionRCNN, which effectively and efficiently fuses point clouds and camera images in the Regions of Interest(RoI). FusionRCNN adaptively integrates both sparse geometry information from LiDAR and dense texture information from camera in a unified attention mechanism. Specifically, it first utilizes RoIPooling to obtain an image set with a unified size and gets the point set by sampling raw points within proposals in the RoI extraction step; then leverages an intra-modality self-attention to enhance the domain-specific features, following by a well-designed cross-attention to fuse the information from two modalities.FusionRCNN is fundamentally plug-and-play and supports different one-stage methods with almost no architectural changes. Extensive experiments on KITTI and Waymo benchmarks demonstrate that our method significantly boosts the performances of popular detectors.Remarkably, FusionRCNN significantly improves the strong SECOND baseline by 6.14% mAP on Waymo, and outperforms competing two-stage approaches. Code will be released soon at https://github.com/xxlbigbrother/Fusion-RCNN.

Abstract-3D object detection with multi-sensors is essential for an accurate and reliable perception system of autonomous driving and robotics.Existing 3D detectors significantly improve the accuracy by adopting a two-stage paradigm which merely relies on LiDAR point clouds for 3D proposal refinement.Though impressive, the sparsity of point clouds, especially for the points far away, making it difficult for the LiDAR-only refinement module to accurately recognize and locate objects.To address this problem, we propose a novel multi-modality two-stage approach named FusionRCNN, which effectively and efficiently fuses point clouds and camera images in the Regions of Interest (RoI).FusionRCNN adaptively integrates both sparse geometry information from LiDAR and dense texture information from camera in a unified attention mechanism.Specifically, it first utilizes RoIPooling to obtain an image set with a unified size and gets the point set by sampling raw points within proposals in the RoI extraction step; then leverages an intra-modality self-attention to enhance the domain-specific features, following by a well-designed cross-attention to fuse the information from two modalities.FusionRCNN is fundamentally plug-andplay and supports different one-stage methods with almost no architectural changes.Extensive experiments on KITTI and Waymo benchmarks demonstrate that our method significantly boosts the performances of popular detectors.Remarkably, Fu-sionRCNN significantly improves the strong SECOND baseline by 6.14% mAP on Waymo, and outperforms competing twostage approaches.Code will be released soon at https:// github.com/xxlbigbrother/Fusion-RCNN.

I . I N T R O D U C T I O N
3D object detection is one of the fundamental tasks in autonomous driving and robotics, which aims to capture accurate 3D information with multiple sensors.Since LiDAR sensors enjoy the natural advantage of obtaining accurate depth and shape information, previous methods achieve competitive performance by using only point clouds.Furthermore, some attempts significantly improve the performance through a two-stage refinement module, which inspires the researchers to explore more effective LiDAR-based two-stage detectors.
Two-stage methods can be divided into three main categories based on the representation of Point of Interest, i.e., point-based, voxel-based and point-voxel-based.Pointbased approaches [1]- [4] take the sampling points as input, and obtain point features for RoI refinement.Voxel-based methods [5], [6] rasterize point clouds into voxel-grids and extract features from 3D CNNs for refinement.Point-Voxelbased approaches [7], [8] combine the two types of feature learning schemes to improve detection performance.However, no matter for what representation, the sparsity and nonuniform distribution characteristics of point clouds make 1 Beijing Institute of Technology, Beijing, CN, {xxlbigbrother, dscdyc1010295799, dean.dinglihe,jwang123bit}@gmail.com,{ciom xtf1, lijianan}@bit.edu.cn it difficult to distinguish and locate objects in the far distance, leading to false or missed detections, as illustrated in Fig. 1.Things get extremely worse when the proposals contain few (1-5) points, from which we can hardly obtain enough semantic information.Fortunately, camera is complementary to LiDAR by providing dense texture information.How to design the LiDAR-Camera fusion paradigm in two-stage to well leverage their complementary strengths is of great importance.
In this work, we focus on fusing LiDAR point clouds and images in the refinement stage.Previous works [9] utilize an image segmentation sub-network to extract image features and attach image features to the raw points.However, we find that the point-based fusion ignores semantic density of image features and heavily relies on the image segmentation sub-networks.In light of the above, this work presents a deep fusion method, dubbed FusionRCNN, which comprises three steps: i) extract RoI features from points and images corresponding to proposals from any one-stage detectors; ii) fuse the features of these two modalities through welldesigned intra-modality self-attention and inter-modality crossattention, abandoning the heavy reliance on hard-associations between points and images while keeping the semantic density of images; iii) feed the encoded fusion features into a transformer-based decoder to predict the refined 3D bounding boxes and confidence scores.
Our FusionRCNN is generic and can significantly boosts the detection performance.Extensive experiments on KITTI [10] and Waymo [11] demonstrate that our FusionRCNN brings obvious performance gain upon LiDAR-only methods, especially for difficult samples with sparse point clouds (Hard level on KITTI and 50m−Inf on Waymo).Remarkably, applying our two-stage refinement network on SECOND [12] baseline improves the detection performance by 11.88 mAP in the range of ≥ 50m (46.93 → 58.81 mAP on Vehicle) on Waymo.
To sum up, this work makes the following contributions: • We propose a flexible and effective two-stage multimodality 3D detector named FusionRCNN, which fuses   The pioneer work MV3D [13] projects the point clouds to 2D bird-eye view grids and places lots of predefined 3D anchors for generating high accurate 3D candidate boxes, motivating following efficient bird-eye view representation methods.VoxelNet [14] applies mini PointNet [15] for voxel featurea extraction.SECOND [12] introduces 3D sparse convolution to accelerate 3D voxel processing.For Point-based methods, PointNet and its variants [16] directly take the raw points as input and use symmetric operators to address the unorderness of point clouds.PointRCNN [1] and STD [2] segment foreground points with PointNet and generate proposals.3DSSD [17] proposes a new sampling strategy for efficient computation.Range View detectors [18], [19] represent LiDAR point clouds as dense range images, where pixels contains extra accurate depth information.Compared to other methods, Voxel-based detetors balances the efficiency and performance, we choose the voxel-based detector as RPN networks in this paper.LiDAR-Camera 3D Detection: Recently, LiDAR-Camera 3D detection has been receiving increasing attention as the two types sensors are complementary.LiDARs provide sparse point clouds containing accurate depth information, while cameras provide high-resolution images containing rich color and textures.MV3D [13] creates 3D object proposal from LiDAR bev features and projects the proposals to multi-view images to extract RoI features.F-PointNet [20] lift images proposal into a 3D frustum and achieve high performance.Point-level fusion methods decorate raw foreground LiDAR points and apply a common LiDAR-based detectors on the decorated point clouds.Among these methods, PointPainting [21], PointAugmenting [22], MVP [23], FusionPainting [24] and AutoAlign which have gained great success are input-level decoration, while DeepFusion [25] and Deep Continues Fusion are feature-level decoration.Recent works TransFusion [26] and FUTR3D [27] initialize object queries in 3D space and fuse image feature on the proposals.To our knowledge, few works focus on two-stage fusion networks, in this paper we propose a novel framework which can be applied as a plugand-play RCNN [28], [29] module to existing detectors and boost their performance significantly.

I I I . M E T H O D
Given M predicted proposals containing 3D bounding boxes B = {b i } M i=1 , where b i = {x, y, z, l, h, w, θ} (box center position, size, and heading angle), and confidence scores S = {s i } M i=1 from any one-stage detectors.We aim to improve the detection results based on point clouds P and camera images where B r and S r are corrected bounding boxes and confidence scores, and R represents the proposed network.Fig. 2 shows the overall architecture of the proposed FusionRCNN.We adopt the RoI Feature Extractor (Sec.III-A) to extract the RoI features from points and images corresponding to B, then fuses the features of these two modalities through Fusion Encoder (Sec.III-B).The encoding fusion features are further fed into Decoder (Sec.III-C) and predict the refined 3D bounding boxes and confidence scores.

A. RoI Feature Extractor
Start with giving 3D bounding boxes B, point clouds P and camera images I, in order to capture sufficient structure and context information, we fix the center of the bounding box b i while expanding the length, width and height with radio k, and feed the scaled RoI to the feature extractor.We adopt a two-branch architecture, where the point/image RoI features are extracted from point clouds P and images I individually.
For the point branch, points within the corresponding box b i after expansion are sampled or padded to a unified number N .Inspired by the point embedding methods used in [4] [3], We enhance the point features by concatenating the distance to the eight corners and the center of b i as where ∆p j is the distance to the j-th corners of the box b i , p b is the center coordinates of the bounding box, p e is extra LiDAR point information like reflectivity, and L(•) is a linear projection layer to map point features into an embedding with C channels.Formally, the point RoI features are For the image branch, the original multi-view images are converted into feature maps via ResNet [30] and FPN [31].We project the expanded 3D bounding boxes onto the 2D feature map, and crop the 2D feature to obtain the image embedding corresponding to the RoI.Specifically, eight 3D corners are projected onto the 2D feature map by the intrinsics and extrinsics of the cameras, from which we calculate the minimum circumscribed rectangle and perform RoI pooling to get the image feature F I i with a unified size S × S corresponding to b i .Another linear layer finally projects F I i into the same dimension C as the point features.Formally, the image RoI features are

B. Fusion Encoder
Based on the above RoI Feature Extractor, we can get the per-point feature and the per-pixel image feature (pixel size varies since we fix a S × S pooling size while the projected proposal sizes are different) inside the RoI.Instead of fusing features by painting the image features into points like previous methods [21], [22], which prefer to utilize the direct correspondence between points and image pixels but neglects the fact that a local region of pixels can contribute to one point and vice versa, we leverage self-attention and cross-attention to achieve the Set-to-Set fusion.Specifically, to make point and image features align with each other and better model the inner relationships, we first feed them into the multi-head self-attention layer respectively.
For embedded point features F P , we have where W Q P , W K P , W V P are linear projections and LN(•) represents layernorm layer.Attention(•) represents the multihead attention, in which the results of h-th head can be obtained as where d is the feature dimension.Correspondingly, the image features are fed into another multi-head self-attention layer to enhance the context information as Then, we fuse the information of the two domains at feature level through cross-attention as Note that the cross-attention is not necessary, point and image branches can work independently, which increases the flexibility of our model and allows us to train the network decoupled.
Finally, F P I cross are fed into FFN with two linear layers.
In the encoding layer, we adopt a novel fusion strategy to promote the complementary of the two modalities.The rich semantic information of image will be integrated into the point features.Correspondingly, the object structure information extracted from point branches can also guide the aggregation of image features to reduce the impact of occlusion and other situations.In our fusion encoder, we stack several encoding layers to ensure full feature fusion.

C. Decoder
The encoded fusion features are fed into the decoding layers to obtain the features of the final box.We initialize a learnable query embedding E with d channels as a query, in which the encoded features are used as keys and values.
where F P I is the output fusion features from fusion encoding layers.The decoder module is also composed of several decoding layers.

D.
We train our model by end-to-end strategy.The overall loss is the sum of the RPN loss and the second stage network loss.RPN loss adopts the loss of the original network (SECOND [12]), and the newly introduced second stage loss includes confidence loss L conf and regression loss L reg , We employ the binary cross entropy loss as the L to guide the prediction of positive samples and negative samples as The division of positive and negative samples is based on IoU as where t is a threshold of IoU.For positive samples, the regression loss is composed of smooth L1 loss of all parameters of bounding box as where p, p represent the parameters of predictions and aligned ground truth boxes respectively.

I V. E X P E R I M E N T S
We evaluate FusionRCNN on both KITTI [10], [36] and Waymo Open Dataset [11], and conduct extensive ablation studies to validate our design choices.

A. Implementation Details
Model setup.We implement our network by open-sourced OpenPCDet [37].We employ SECOND [12] as the RPN and follow the settings in OpenPCDet.For RoI head, we adopt ResNet50 pretrained on ImageNet [38] as image backbone and keep its weight frozen during training to save time, the highest resolution output of FPN is selected as the feature map.For each RoI, the expanding radio k is 2, we sample 256 point clouds, and the corresponding projected image region is converted to 7×7 resolution by RoIPooling.In addition, the number of encoding layers is set to 3 and the number of decoding layers is set to 1 to balance performance and efficiency.Training details.The network is trained end-to-end on 8 Tesla V100 GPUs.On the Waymo Open Dataset, we apply Adam optimizer and the cycle decay strategy, the learning rate is 0.0008.Following CT3D [4], we train the model for 80 epochs.On KITTI, we apply the same training strategy, and train 100 epochs with a learning rate of 0.003, Moreover, we design several kinds of data augmentation i.e. flip, rotation and scaling supporting both images and point clouds.

B. Results on Waymo
Data and metrics.Waymo Open Dataset is a large-scale outdoor public dataset for autonomous driving research, which contains RGB images from five high-resolution cameras and 3D point clouds from five LiDAR sensors.The whole dataset consists of 798 scenes (20s fragment) for training and 202 scenes for validation and 150 for testing.The measures are reported based on the distances from 3D objects to sensor, i.e., 0-30m, 30-50m and >50m, respectively.These metrics are further divided into two difficulty levels: LEVEL1 for 3D boxes with more than 5 LiDAR points and LEVEL2 for boxes with at least 1 LiDAR point.Remarkably, the cameras in Waymo only cover around 250-degrees but not 360-degrees horizontally.Our framework can adapt to this situation.All models are trained on 20% Waymo dataset.Main results.We first evaluate the performance of FusionR-CNN on the large public Waymo Open Dataset.Tab.I reports the results of vehicle detection with 3D and BEV AP on validation sequences.Note that with the strong SECOND [12] baseline, FusionRCNN outperforms all previous methods in both LEVEL 1 and LEVEL 2, leading PV-RCNN [7] by 8.61% mAP and Voxel-RCNN [6] by 3.32% mAP on LEVEL 1. FusionRCNN achieves 78.91% for the commonly used LEVEL 1 3D mAP evaluation metric, surpassing the previous state-of-the-art method CT3D [4] by a significant margin(2.61%mAP).We ascribe this performance gain to our novel two-stage deep fusion design that effectively integrates geometry information from LiDAR and dense texture information from camera, which helps refine bounding box parameters and confidence scores accurately.Additionally, we show multi-class detection results with Vehicle, Pedestrian, and Cyclist in Tab.II.After adopting FusionRCNN, we can see that the baseline model SECOND and CenterPoint [39] significantly improved small objects, i.e., 10.55% mAP on Cyclist for SECOND, 6.43% on Pedestrian for CenterPoint.Tab.III shows that our method surpasses other single-frame methods in the stricter eval standard(IoU threshold for 0.8), which suggests that our method works excellently in location with rich structure and texture information.Visualization.Experiments on Waymo show that our method has excellent performance in long-range detection.As CT3D use the same one-stage detector as RPN, We show a qualitative comparison between FusionRCNN and CT3D which merely uses point clouds in the refinement stage.The comparison is shown in Fig. 3.

C. Results on KITTI
Data and metrics.KITTI Dataset has been widely used in 3D detection tasks since its release.It contains multiple types of sensors like stereo cameras and a 64-beam Velodyne.There are 7,481 training samples commonly divided into 3,712 samples for training and 3,769 samples for validation, and 7,518 samples for testing.We conduct experiments on the commonly used category car whose detection IoU threshold is 0.7.We also report the results for three difficulty levels(easy, moderate and hard) according to the object size, occlusion state and truncation level.Main results.To further verify our framework, we conduct experiments on the KITTI validation set and compare with previous state-of-art methods.Tab.IV shows our method improves the one-stage method SECOND for all three difficulty levels with a significant margin (+1.29% for Easy, +7.02% for Moderate and +2.1% for Hard) and has a great competitive with all LiDAR-based and LiDAR-Camera methods.Our FusionRCNN achieves better performance than two-stage fusion competitor PI-RCNN [9], which brings 7.11% improvement on Moderate mAP.Furthermore, we compare FusionRCNN with the released method PV-RCNN [7] and CT3D [4] since they share the same RPN.FusionRCNN performs better than PV-RCNN in all difficulty levels , while compared with the state-of-the-art method CT3D, our method has better performance overall, which leads CT3D by 0.36% on Easy level and 0.33% on Hard level with comparable result in Moderate.Remarkably, FusionRCNN achieves the AP of 79.32%(Hard), and outperforms state-of-the-art 3D detectors.Compared with point-based two-stage methods, our novel two-stage fusion framework is better at capturing structural and contextual information effectively.

D. Ablation Studies
Effect of LiDAR-Camera fusion.We investigate the effect of introducing texture information from camera images.We switch FusionRCNN to a LiDAR-based method named FusionRCNN-L by disabling the image branch in RoI Feature Extractor and cross-attention module in Fusion Encoder, then inference with the same settings.As shown in Tab.V, FusionRCNN-L achieves 90.25% mAP in Vehicle BEV detection and surpasses most of the methods in Tab.I.By adopting LiDAR-Camera fusion, FusionRCNN gains further promotion, especially for long-range detection (50m-Inf).Different RPN Backbones.we plug FusionRCNN into popular single-stage detectors, i.e., SECOND, PointPillar and CeterPoint to verify the generality of FusionRCNN.Tab.VI shows our method improves all three baseline models with significant boosts, +6.14%, +2.7% and +5.55% 3D mAP on         LEVEL 1.This benefits are from that our method utilizes a novel LiDAR-Camera fusion mechanism, leveraging structure and semantic information from LiDAR and camera images.RoI Feature Extractor.Our RoI feature extractor contains a point and an image branch.Previous works [3], [4], [7] have proved that raw points have more accurate structure information to benefit local bounding box contextual information extraction.We mainly conduct an ablation study on image branch.Some parameters may affect the performance of image feature extraction and in turn detection performance.We test with different output size S of RoI image features in Tab.VII.We find that these settings have little impact on image extraction branch.One possible explanation is that LiDAR and image features fuse dynamically in our fusion encoding layer, and the image features contribute to category classification with high-level contextual information.

V. C O N C L U S I O N
In this work, we propose a novel two-stage multi-modality 3D detector named FusionRCNN, which successfully integrates LiDAR point cloud and camera image information in the regions of interest.FusionRCNN leverages a well-designed attention mechanism to achieve Set-to-Set fusion, and thus becomes more robust to the LiDAR-Camera calibration noise.We show that FusionRCNN outperforms state-of-the-art twostage 3D detectors both on Waymo Open Dataset and KITTI dataset, which is plug-and-play and has enormous potential to boost all existing one-stage 3D detectors.

Fig. 1 .
Fig. 1.Comparison of our method with previous LiDAR-based two-stage methods.When objects comprise sparse point clouds, LiDAR-based methods fail to correctly determine the category and give less confident scores, while our method effectively combines point cloud structure with dense image information to solve such problems.

FFN×NFig. 2 .
Fig. 2. Overall architecture of FusionRCNN.Given 3D proposals, LiDAR and image features are extracted separately through RoI feature extractor.Then, the features are fed into K fusion encoding layers which comprises self-attention and cross-attention modules.Finally, point features fused with image information are further fed into a decoder and predict the refined 3D bounding boxes and confidence scores.
image and point clouds in regions of interest and can boost existing one-stage detectors with minor changes.• We utilize a novel transformer-based mechanism to simultaneously achieve attentive fusion between pixel set and point set, which is robust to calibration noise.• Our method has superior performance compared to twostage approaches on KITTI and Waymo Open Dataset, especially on difficult samples with sparse points.I I .R E L AT E D W O R K S LiDAR-Based 3D Detection: Existing LiDAR-based 3d detection methods can be broadly grouped into three categories, The Voxel-based, Point-based, and Range View.Voxel-based detetors voxelize the unstructured point clouds as a regular 2D/3D grid which conventional CNNs can be easily applied.

Fig. 3 .
Fig. 3. Qualitative comparison between LiDAR-based two-stage detector (CT3D) and our FusionRCNN on the Waymo Open Dataset.Green boxes and Blue boxes are ground-truth and prediction, respectively.Three proposal vehicles in red circle are zoom-in and visualize on 2D images and 3D point clouds.Our FusionRCNN works better than CT3D with only LiDAR input in long range detection.
T D I S TA N C E O N WAY M O VA L I D AT I O N S E T .MethodOverall 0-30m 30-50m 50m-Inf Latency (ms) FusionRCNN-L 90.25 96.58 89.24 80.61 125 FusionRCNN 91.94 97.12 91.22 85.22 185TABLE VI A B L AT I O N S O N D I F F E R E N T O N E -S TA G E D E T E C T O R S O N WAY M O VA L I D AT I O N S E T .

RoI Feature Extractor Encoding Layer Decoding Layer 3D Proposals Crop Linear Q
LV E S O N O P E N P C D E T .

TABLE
R .
8 , R E S P E C T I V E LY, * : R E S U LT S F R O M [ 4 0 ] .

TABLE VII A
B L AT I O N O N O U T P U T S I Z E O F R O I I M A G E F E AT U R E S .