Cascaded Cross-Modality Fusion Network for 3D Object Detection

We focus on exploring the LIDAR-RGB fusion-based 3D object detection in this paper. This task is still challenging in two aspects: (1) the difference of data formats and sensor positions contributes to the misalignment of reasoning between the semantic features of images and the geometric features of point clouds. (2) The optimization of traditional IoU is not equal to the regression loss of bounding boxes, resulting in biased back-propagation for non-overlapping cases. In this work, we propose a cascaded cross-modality fusion network (CCFNet), which includes a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss to resolve these two issues. Our CMF module is developed to reinforce the discriminative representation of objects by reasoning the relation of corresponding LIDAR geometric capability and RGB semantic capability of the object from two modalities. Specifically, CMF is added in a cascaded way between the RGB and LIDAR streams, which selects salient points and transmits multi-scale point cloud features to each stage of RGB streams. Moreover, our center 3D IoU loss incorporates the distance between anchor centers to avoid the oversimple optimization for non-overlapping bounding boxes. Extensive experiments on the KITTI benchmark have demonstrated that our proposed approach performs better than the compared methods.


Introduction
Among various tasks of scene understanding, object detection is crucial for autonomous driving [1], robotics, and augmented reality. Deep learning-based 2D object detection which aims to predict the position and category of targets with given images has made unprecedented achievement in recent years [2]. RGB images provide fine-grained contextual information but still lack accurate depth information, which lets the prediction of 2D object detection suffer from space ambiguity [3]. Recently, extensive research has focused on 3D object detection to estimate the accurate 3D location of the target, benefitting from available point cloud sources.
LIDAR provides spatial and geometric descriptions for the 3D environment which targets exist in, but point cloud still lacks texture and color information like RGB images. Therefore, LIDAR-RGB fusion based 3D object detection takes advantage of two sensors to compensate the weaknesses of each other and capture more discriminative features of objects. However, two distinct modalities with various data formats and properties lead to challenges in this task. RGB images have ordered and grid structure which has been studied in numerous research, while the point cloud has unordered and spare structure. Moreover, the problem of how to correlate the semantic features of images with the geometric features of point cloud is indispensable in the fusion process. In common sense, the semantic and contextual information of images is always extracted in high-level features and the shape and texture information always exists in low-level features [4]. These changing encoding characteristics make each stage of LIDAR-RGB fusion have specific demand and cooperation manner. For instance, in the low-level feature of images, the apparent texture and accurate shape information could easier match the geometric outlook of objects. In the deeper layers, the semantic and contextual features of images need implicit category-wise geometric information. To solve these problems, existing works mainly depend on cross-modality feature alignment to fuse the RGB and LIDAR features.
According to the way of fusing multi-modality sensor data, we classify previous works into three categories: (1) early fusion-based methods (2) late fusion-based methods and (3) deep fusion-based methods. For detail, early fusion-based methods usually utilize a separate perception algorithm to process the multi-modality raw sensor data. However, they always require the precise alignment of data. If the raw sensor data are not well aligned in the early stage, it would lead to heavy performance degradation due to the feature dislocation. Depending on coordinate location of two sensors, PointPainting [5] and PI-RCNN [6] project the image semantic segmentation to point cloud space by projecting matrix. Although this early fusion process enables the network to handle aligned two-modality information as a whole without specific modality adjustment, the early stage fusion also conveys the noise in one modality to another modality. This noise would unavoidably be aligned and combined with discriminative features of objects, significantly damaging the prominence of features.
Late fusion-based methods only fuse the processed features at the decision level because the spatial and modal difference between the point cloud and the image is greatly reduced in this stage. MV3D [7], AVOD [3], CLOCs [8] extract point cloud and image features through independent modules and fuses them at the decision-making layer. However, the fusion in the decision-making layer has little effect on the raw data information fusion, and the confidence scores of the proposals generated by the two modules are not related. For the deep fusion-based methods, 3D-CVF [9] and MMF [10] adopt feature extractors respectively for LIDAR and image, and fuse images and LIDAR hierarchically and semantically. Finally, the semantic fusion of multi-scale information is realized. However, these methods are difficult to solve the problems of the difference between data formats and sensor positions. Moreover, 3D-CVF lacks continuous feature fusion in the feature extraction process will result in insufficient feature fusion. MMF only utilizes the sparse depth map projected from the point cloud, which leads to a weaker influence of the point cloud data on the generation of the anchor.
To solve these challenges, we observe that it is hard to align two-modality features throwaway. As aforementioned, features with different characteristics always need corresponding features from another modality. However, this demand is unknown for hand-crafted fusion design. Moreover, the processes of encoding RGB and LIDAR always have the dynamic appetite for extracting specific features, e.g., the contextual features of images are in the deeper layers, while low-level features are in early layers. The features in which layers are suitable for feature alignment are changing in optimization. Therefore, it is more reasonable to build a dynamic multi-modal fusion method. In this paper, we propose a cascaded cross-modality fusion network (CCFNet) for LIDAR-RGB fusion-based 3D object detection to address the above challenges. Our CCFNet is developed to establish a dynamic alignment manner by letting each stage choose specific salient features from previous stages. Our CCFNet mainly consists of a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss.
In order to build a dynamic aligned network, we insert cascaded multi-scale fusion module (CMF) between each stage of LIDAR and RGB streams. Our CMF collects point cloud features from adjacent stages and aligns them with image features. By processing CMF in a cascaded way, the alignment in each stage could adjustably select specific point cloud features from previous stages to meet its demand. Besides, as pointed in [11], traditional IoU loss has a plateau, making it infeasible to optimize in the case of nonoverlapping bounding boxes, which is much more severe in 3D cases. These non-overlapping anchors are still useful to let aligned RoI have a roughly location sensitivity, i.e., guiding the RPN to generate anchors close to true bounding boxes. In this work, we advocate a novel center 3D IoU loss to make use of this benefit of non-overlapping bounding boxes. By introducing the measurement of distance between anchor center and ground truth center, our center 3D IoU loss is equipped with the ability to decrease the possibility of unreasonable anchors.
Our contribution can be summarized as follows: (1) We propose a novel cascade approach to fuse and align LIDAR-RGB information. Our approach obtains multiple residual operations which could back-propagate gradient of guidance of alignment to the previous parts in encoder to select informative point cloud features.
(2) In order to make use of non-overlapping bounding boxes, we propose a novel center 3D IoU loss to allow the model to be sensitive to the location of generated anchors.
(3) Our approach has achieved better performance on the KITTI benchmark and performs favorably compared to methods.

RGB-Based 3D Object Detection
Recently, the performance of RGB-based 3D object detection is significantly improved because of mature deeper learning method, such as Faster-RCNN [12], SSD [13] and YOLO [14]. Song et al. [15] propose the 3D Region Proposal Network based on Faster R-CNN, which takes the 3D volume scene in the RGB-D image as input. This network finally outputs the bounding boxes of the 3D objects; Gupta et al. [16] change the network input to 2.5D (extracting suitable expression based on RGB-D), which can improve the speed of the object detection algorithm; Tekin [17] et al. are inspired by [12,14,18] to propose a new CNN, which predicts the location of the projected points of the target 3D bounding box in the 2D image domain. Finally, this method uses the camera pose estimation algorithm to predict the 6D pose of an object. Moreover, [13] proposes the method of synthesizing data and decomposing the pose space of the model is used to increase the detection rate, but if the target is occluded, the accuracy of the 3D bounding box will be greatly reduced. Furthermore, ref. [19] proposes the DenseFusion Network architecture, which fuses the depth of each pixel with the image information to infer the local fine appearance and geometric spatial information of the target to deal with the situation of heavy occlusion. Finally, DenseFusion integrates an iterative fine-tuning module in the neural network to improve the real-time processing speed.

LIDAR-Based 3D Object Detection
The works of 3D object detection algorithm based on LIDAR point cloud could be divided into: (1) pseudo-image based methods, (2) PointNet-based methods and (3) voxel-based methods. First of all, the pseudo-image-based methods take advantage of the expertise of 2D image understanding by projecting point cloud data to some specific angles of view, such as a bird's-eye view and front view. These methods include VeloFCN [20], MV3D-LIDAR [7], PIXOR [21], PointRCNN [22], etc. However, the point cloud is sparse, which easily leads to loss of information in the projecting procedure [23]. Besides, Charles [24] et.al. propose the PointNet that can directly process point cloud data. It uses the MaxPooling symmetric function to extract point cloud features to solve the problem of disorder nature of point cloud. In this method, a small neural network is trained to ensure the invariance of the laser point cloud in the process of realizing rotation or translation conversion. However, the PointNet network simply connects all points and only considers global features and single point features, without local information, which results in poor results for multi-classification problems with multiple instances. To solve the above problems, Charles et al. [25] then propose an improved network Pointnet++, which obtains the deep semantic features of the target by cascading the processing modules of the sampling layer, combination layer, and feature extraction layer. In PointNet++, two strategies of multi-scale combination and multi-resolution combination are used to ensure more accurate target feature extraction. PV-RCNN [26] proposes a novel 3D object detection framework that deeply integrates both a 3D voxel convolutional neural network and a PointNet-based set abstraction. SA-SSD [27] is a point-based method, which improves accuracy by deeply mining the geometric information of three-dimensional objects. Moreover, 3DSSD [28] designs a new set abstraction module and discards the feature pyramid module to reduce inference time and training memory. Furthermore, [29] proposes VoxelNet, which divides the 3D point cloud into a certain number of voxels. First, VoxelNet performs random sampling and normalization of the point cloud to extract features for each non-empty voxel and obtains the geometric space representation of the target. Finally, VoxelNet uses RPN for classification and position regression. Voxel-FPN [30], SECOND [31], PointPillar [32] and Part-A 2 [33] are all the same type of the voxel-based methods.

LIDAR-RGB Fusion-Based 3D Object Detection
The LIDAR-RGB 3D fusion object detection algorithm mainly includes MV3D [7], AVOD [3], 3D-CVF [9], MMF [10], etc., which are more robust in practical applications. MV3D is the first pioneer in using the fusion of LIDAR point cloud data and RGB image information. As a perceptual fusion framework, MV3D uses the front view (FV) and bird's-eye view (BEV) of point cloud to represent 3D point cloud information and merges them with RGB images to predict directional 3D bounding box. In order to solve the problem of slow recognition speed of MV3D [3,7] proposes an AVOD fusion algorithm. AVOD first simplifies the input of MV3D and then improves the 3D RPN (region proposal network) architecture. Finally, it adds 3D bounding box geometric constraint improvements. On the basis of improving the recognition accuracy, the recognition rate of AVOD is significantly improved. Moreover, 3D-CVF generates dense RGB voxel features and uses the adaptive gated fusion network to align the RGB image with the LIDAR point cloud. Finally, cross-modal fusion is achieved through an accurate multi-modal data position.

Our Approach
In this paper, we propose a cascaded cross-modality fusion network (CCFNet) for LIDAR-RGB fusion-based 3D object detection. As shown in Figure 1, the features of LDIAR and RGB images are extracted by two separate streams. We use ResNet50 and four set ablation modules of PointNet as the feature extractor of RGB images and LIDAR respectively. Between each stage of two streams, we insert our cascaded multi-scale fusion module (CMF) to connect and fuse the image features and LIDAR features that share the same downsample ratio. Finally, the outputs of two streams are concatenated and sent to the detection head. We also describe the components of the total training losses, including our novel 3D IoU loss and the spatial setting of anchors and targets.

Cascaded Multi-Scale Fusion Module
We tackle the optimization of corresponding LIDAR-RGB feature fusion of each stage by building a cascaded structure where the image features of each stage could access the point cloud information from previous stages. In this way, the image features could dynamically select the suitable and multi-scale point cloud features from different stages and accordingly optimize the efficiency of LIDAR-RGB feature fusion.
Let i ∈ {1, 2, 3, 4} denote the stage of LIDAR and RGB streams, A i ∈ R H×W× C a and B i ∈ R N×C b are the RGB and LIDAR feature maps at the i-th stage, where H, W, C a denote the height, width and channel dimension of the image feature, and N, C b are the number of points and channel dimension of LIDAR feature respectively. Our CMF mainly has two main procedures, i.e., multi-scale fusion and LIDAR-RGB projecting. Since the CMF modules in different stages have little difference, we first introduce the general mechanism of our CMF module and then describe the actual implementation of CMF modules in different stages. As shown in Figure 2, the CMF module has two inputs, i.e., the feature A i−1 from the previous stage and A i from the current stage. We first select the points having sailent features in A i−1 , which has N i−1 points, and then fuse it with A i . For detail, we first use max() function to highlight the category characteristic of feature and then select N i points possessing a larger value from the point set of A i−1 .
According to the index of selected points, selected point features A s i−1 ∈ R N i ×C b i−1 can be easily found. By processing one Conv1d, A s i−1 is activated by the global vector from A i by global average pooling, which finally generates A gap i−1 . Then, A cat i ∈ R 2N i ×C b i is conducted after concating A i and A gap i−1 along the N dimension. We name, after this process, multi-scale fusion, which could be represented as follows: where φ() denote the step of choosing N i points with salient features from A i−1 and Tile() is the function of tiling the vector along N dimension to generate a tensor whose resolution is the same as A i . Besides, we project the point cloud features A cat i to image space through the LIDAR-RGB projecting procedure. Specifically, we utilize the principle of spatial perspective to project LIDAR points to image. Each position (x a , y a , z a ) of points a belonging to the point set of A cat i should be multiplied by image size, since these coordinates of position along X and Y axes have been normalized to [−1, 1]. Therefore, the corresponding position (x m , y m ) of point a in image, where the superscript m denotes the image domain, could be calculated by: Then, A i is transmitted to the next stage as one of the inputs of the next CMF module. For the CFM module in the first stage, we directly project A 1 to image space of B 1 . For the CMF module in the second stage, we follow the full processes described above. However, in the third and fourth stages, the standard processes of CMF module would lead to the biased saturation of some points, since inherited features from previous stages have numerous repeated points. They would account most of the point candidates which have salient features, if the selecting process is not well regulated. This phenomenon would become severe when the stage goes deeper. Therefore, we design a regulation algorithm to avoid the oversaturated problem, as shown in Algorithm 1. Moreover, the whole procedure of CMF in the third and fourth stages is shown in Figure 2.

Center 3D IoU Loss
Moreover, the 3D object detector always generates abundant anchors to predict the true bounding boxes, whose number is much more than that of ground truth bounding boxes. However, due to the meaning of IoU and the setting of corresponding traditional IoU loss, only anchors which overlap with ground truth would contribute to the optimization, while those anchors which have no overlapping are simply punished and contribute nothing. As shown in Figure 3a, the anchor and bounding boxes do not overlap with each other, where the IoU score is 0 and no training gradient would back-propagate. However, the positive anchors only account a small part of total generated anchors, resulting in limited income compared with such a heavy computational cost. However, these non-overlapping anchors are still meaningful. We believe that even without overlapping, anchors which are near to ground truth bounding boxes are more useful than these anchors far from that. These non-overlapping anchors could provide a constant for RPN to build a position sensitivity which enables RPN to generate anchors close to true bounding boxes. Moreover, this distance information could adjust the inconsistency between loss and the quality of obtained bounding boxes. As shown in Figure 3b, these two cases have the same IoU score, but apparently the left case would be much better than the right one, i.e., the two axes are aligned in left, but one axis is aligned in right. (b) These two cases have the same IoU score, but apparently the left case would be much better than the right one Inspired by this observation, we introduce our center 3D IoU loss to solve the aforementioned limitations. We define 3D anchor as (x, y, z, h, w, d, θ), where (x, y, z) is the 3D coordinate of anchor and bounding box center, h, w, d are the height, width and depth of 3D anchor, θ is the angle of anchor along Z axis. In order to calculate the IoU score of predicted anchor and ground truth bounding box which are denoted as superscript p and gt, we first obtain the whole 3D volume of predicted anchor V an and ground truth bounding box V gt and the overlapped volume of them V overlap and the volume of the smallest enclosing convex region V scr . Therefore, 3D IoU could be formulated as: In order to revise the inconsistcy between loss and the quality of obtained bouding boxes as illustrated in Figure 3, we add the center distance 3D distance between anchor and ground truth bounding box: Our center 3D IoU loss consists of these two parts and is formulated as follows: This center 3D IoU loss optimizes the two IoU metrics: the overlapped area and the distance from the center of the anchor to that of ground truth bounding box. It helps the generator of anchors to be sensitive to the position of ground truth bounding box, since even without overlapping, the anchor near the ground truth also reduces the loss relatively. Then, the NMS process maintains a more accurate bounding box. The quantitative results and analysis in Section 4.4 indicate the effectiveness of our center 3DIOU loss in improving 3D detection performance.

3D Region Proposal Network
We use a multibox SSD-like [31] RPN as detection head of our CCFNet. For the input of the 3D RPN, we use the feature maps generated by combining LIDAR and image features. Specifically, the architecture of RPN consists of three stages and each of them is composed of several convolutional layers and a downsampled convolutional layer sequentially. Then, the outputs of three stages are upsampled to a fixed size and are concatenated into one feature map. Finally, the concatenated feature map is sent to three 1 × 1 convolutions for classification and orientation estimation.

KITTI:
The KITTI [34] records the details of scenery in both image and point cloud format. Moreover, 7481 training pairs and 7518 test pairs are included in three categories: car, cyclist, and pedestrian. According to the target size and distance, the difficulty of the data is divided into easy, moderate, and hard. Due to this, the ground truth of the test set is not available, so we follow the protocol of [7,29], splitting the training samples into a new train set and a new validation set. Our ablation study is conducted on a validation set and the final performance comparison with other research is conducted using test set obtained by server submission. The KITTI dataset uses the PR (precision recall) curve and AP value to judge the accuracy of the detection model. By setting different thresholds, different recall and precision are obtained to draw the PR curve. The 3D IoU evaluation thresholds are related to the target category. The official thresholds of car, cyclist, and pedestrian are 0.7, 0.5, and 0.5 by default.

Implementation Detail
Considering that the points of the distant objects in point cloud are too sparse to produce effective features, we select the point cloud in the range of [−40, 40] × [0, 70] × [ −1, 3] according to the X, Y, Z coordinate system. We extracted 16,382 point clouds from the raw point cloud data as the input of the point cloud feature extraction network and used a set abstraction layer to perform the input LIDAR point with the size of 4096,1024, 256, 64. The subsequent propagation layer was originally used for a point cloud semantic segmentation and 3D scheme generation, therefore we kept it in our network to maintain semantic information. We set the RGB image input resolution to 1280 × 384 × 3, and set up the up-sampling module and the contact module to fuse the image features of each step. The adam optimizer is adopted to optimize the network, which sets the initial learning efficiency to 0.002, the weight to 0.001, the training epoch to 50, and the batch size to 12. Our approach is implemented on single RTX 2080Ti, which has 11G memory.

Anchors and Targets
In this paper, we use the default setting of anchor for three categories as SECOND [31]. We assign a one-hot vector for anchor classification, 7 vectors for box regression, and a one-hot vector for direction classification. For the box regression, the following localization regression residuals are leveraged for training: where x, y and z are the 3D center coordinates; w, l, and h are the width, length and height of the 3D bounding box, respectively; θ is the yaw rotation around the up-axis (z axis); the superscript gt and an denote the ground truth and anchor respectively; and d an = (l an ) 2 + (w an ) 2 is the diagonal line at the bottom of the anchor box. Our total loss is composed by the above losses and our center 3D IoU loss.

Ablation Study
In this section, we validate the effectiveness of the proposed CMF module and center 3D IoU loss. First, we insert our CMF module in different stages with 8 settings, as shown in Table 1. Compared with groups (1) to (5), it is obvious that cross-modality fusion performs better at a deeper stage. Especially in hard-level detection, the 3.11% improvement from no fusion to fusion in the fourth stage is a large margin. We believe that the reasoning of semantic information in images and the high-level geometric information in LIDAR, which could be aligned in easy level instances, further enhance the discriminative representation of instances in far distance. Moreover, we compare the groups (6)- (8) and notice that adding more connections between stages of image and LIDAR streams provides further improvement, indicating the cascaded and dense shortcuts between stages convey adjustable attention to choose the most suitable corresponding features in the fusion process. We also make extensive experiments on the setting of hyper-parameters of Algorithm 1. As shown in Table 2 Table 3. It is obvious that our center 3D IoU loss brings about significant improvement.

Evaluation on the KITTI Validation Set
We show the results evaluated on the KITTI validation set in Tables 4 and 5 for the convenience of comparison with future work. Our network performs better than the method of two tables at all difficulty levels. We show some detection results in Figure 4. In order to facilitate visualization, we use 8 corner points of the 3D bounding box to visualize the 3D prediction. The red 3D bounding boxes denote the ground truth bounding boxes and the green one is our prediction. Overall, this evaluation shows that our proposed network can provide high-precision results.

Evaluation on the KITTI Test Set
In Tables 6 and 7, we compare our approach with the previously proposed method on BEV and 3D dectection task. From the comparison results, we can conclude that our approach has a better effect on BEV and 3D detection tasks. Our approach is compared with other public algorithms, and the types of input data are divided into two categories: LIDAR and LIDAR + Image. We observe that the proposed CCFNet can achieve better performance than other compared LIDAR-RGB fusion-based and LIDAR-based methods. CCFNet outperforms VoxelNet by 13.16%, PointRCNN by 2.86%, SECOND by 4.96%, MV3D by 16.27%, AVOD by 12.84%, AVOD-FPN by 6.74%, FPointNet by 8.23%, MMF by 1.19%, CLOCs by 0.17% in terms of 3D middle-level mAP. Moreover, Table 7 shows the performance of our approach for 3D BEV object localization. CCFNet can achieve much better performance than other LIDAR-RGB fusion-based and LIDAR-based detection, outperforming VoxelNet by 9.36%, PointRCNN by 13.2%, SECOND by 14.96%, MV3D by 11.72%, AVOD by 3.18%, AVOD-FPN by 4.83%, F-PointNet by 4.62%, MMF by 0.41%, CLOCs_SecCas by 0.39% in terms of BEV middle-level mAP. The result in the KITTI leaderboard is available at: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_ benchmark=3d.

Conclusions
In this paper, we propose a cascaded cross-modality fusion approach (CCFNet) to improve the LIDAR-RGB feature fusion process. We tackle the challenges of multi-sensor fusion as a dynamic matching problem, which meets the need of adaptively choosing suitable point cloud features from previous stages. Technically, on the one hand, our CMF module selects the points which have salient features to project to image space and are transmitted to the next stage. We insert our CMF module into each stage of LIDAR and RGB streams to build dynamic attention for meeting the demand for selecting corresponding point cloud features from different stages. On the other hand, a novel center IoU loss is proposed to make use of non-overlapping bounding boxes which account for most of the generated anchor candidates. It enables our network to generate anchors close to the ground truth bounding boxes, instead of generating unreasonable anchors. Extensive experiments have demonstrated the effectiveness of the CMF module and center IoU loss. We also achieve better performance than compared methods on the KITTI benchmark. Moreover, our approach still has some limitations. The proposed CMF modules unavoidably increase the computational costs, which are vital for real-world applications. Inspired by PVCNN [36], we will improve the point cloud feature extraction module in the next step.