A Two-Phase Cross-Modality Fusion Network for Robust 3D Object Detection

A two-phase cross-modality fusion detector is proposed in this study for robust and high-precision 3D object detection with RGB images and LiDAR point clouds. First, a two-stream fusion network is built into the framework of Faster RCNN to perform accurate and robust 2D detection. The visible stream takes the RGB images as inputs, while the intensity stream is fed with the intensity maps which are generated by projecting the reflection intensity of point clouds to the front view. A multi-layer feature-level fusion scheme is designed to merge multi-modal features across multiple layers in order to enhance the expressiveness and robustness of the produced features upon which region proposals are generated. Second, a decision-level fusion is implemented by projecting 2D proposals to the space of the point cloud to generate 3D frustums, on the basis of which the second-phase 3D detector is built to accomplish instance segmentation and 3D-box regression on the filtered point cloud. The results on the KITTI benchmark show that features extracted from RGB images and intensity maps complement each other, and our proposed detector achieves state-of-the-art performance on 3D object detection with a substantially lower running time as compared to available competitors.


Introduction
As a crucial task in various engineering applications including autonomous driving, safety management, et cetera, high-precision object detection has drawn a great deal of attention in recent years. Two sources of inputs are commonly used in objection detection: RGB images and LiDAR point clouds.
A large number of deep learning-based models such as the series of Faster RCNN, SSD, YOLO [1][2][3], and a lot more custom versions of them have been developed for 2D object detection with RGB images. Despite tremendous progresses made in the past few years, vision-based 2D detectors still have major limitations, especially when they are developed for applications such as autonomous driving, where failures of the detector can have disastrous consequences [4]. The vulnerability to environmental interference as well as the lack of depth information are major drawbacks and inherent deficiencies of vision-based 2D detectors, which can hardly be remedied without employing different modalities of data [5][6][7].
In terms of 3D object detection with point clouds, a breakthrough is made with the introduction of Pointnet [8] which enables direct 3D objection on disordered raw point clouds without prior knowledge.

1.
We propose a cascading 3D detector that exploits multi-modal information at both the feature fusion and decision making levels.

2.
At the decision-level, we design a two-phase detector in which the second-phase 3D detection gets assistance from the first-phase 2D detection in a way that 2D detection results are transformed into 3D frustums to filter the point cloud, in order to improve both the detection accuracy and real-time performance of 3D detection. 3.
At the feature-level, we design a two-stream fusion network to merge cross-modality features extracted from RGB images and intensity maps, so as to produce more expressive and robust features for high-precision 2D detection. The validity of the proposed feature fusion scheme is examined and strongly supported by the experimental results and through visualizing features at multiple network stages.

2D Object Detection with Images
Objection detection has been a fundamental task in the field of computer vision for decades. Motivated by the success of Alexnet [17] on image classification, a great number of 2D detectors based on deep CNNs have been developed in the past few years. Depending on whether region proposals are involved, most detectors can be classified into two categories: (1) two-stage detectors which are proposal-based and anchor-based, including Faster RCNN, R-FCN, et cetera. [1,18] that use a region proposal network (RPN) to generate proposals for detection. (2) one-stage detectors which are proposal-free, including YOLO, SSD, et cetera. [2,3], that regard object detection as a regression problem which is solved with end-to-end convolution networks.
To improve the performance of vision-based 2D detectors, extensive efforts have been made from different perspectives. As the most representative work in exploring the effectiveness of multi-layer fusion, Feature pyramid networks (FPN) [19] proposes producing more robust and expressive features that comprise high-level semantics and low-level spatial details by introducing the top-down structure with lateral connections from the bottom-up feature extraction backbone. R-FCN [18] introduces the position-sensitive score map to resolve the conflict between translation invariance in classification and translation variance in object detection, for more accurate and faster detection; Cascade-RCNN [20] implements multi-stage detection with progressively higher thresholds for the balance of better performance and high Intersection over union (IoU).

3D Object Detection with Point Clouds
Depending on how the point cloud is represented, studies on 3D object detection fall into three categories: (1) bird's-eye view (BEV) and front-view (FV) based methods [21][22][23][24][25]. MV3D [23] is the first study to project the point cloud onto the ground plane to create the BEV. Li et al. [25] converts the point clouds into front-view 2D maps. On the basis of projected 2D maps, 2D-CNN is applied to generate region proposals as done in vision-based 2D detectors. (2) Voxel based methods [26][27][28][29][30], which transforms the point clouds into voxels by dividing the point clouds into three-dimensional voxel grids with spatial dependencies. Three-dimensional CNNs are then applied to generate region proposals. (3) Raw point cloud-based methods. The milestone works are Pointnet [8] and its improved version Pointnet++ [31]. These methods operate directly on the point clouds to extract local features layer by layer at different scales and obtain deep features through a multilayer network. Afterwards, a number of variants of Pointnet [8] are proposed to further improve the performance of point cloud-based 3D detection from different perspectives. A-CNN [32] proposes an annular convolution to capture the local neighborhood geometry of each point that adapts to the variability and scalability of point cloud, improving the performance of the point-based model. GAPnet [33] learns local geometric representations of point-wise and its neighborhood by implementing graph attention mechanisms to enhance network robustness. RS-Pointnet [34] learns features and shapes from geometric topology constraint among points to obtain contextual shape-aware information for object detection and segmentation.
It is worth mentioning that a number of knowledge-based methods [35,36] have also been proposed to operate directly on raw point clouds. Traditionally, since 3D point cloud learning is conducted based on scenario-specific knowledge-base and ontologies built artificially, the performance and generalization ability of such approaches when applied in object detection in scenes that are highly dynamic and diverse is not satisfying. Though, with the mechanism of self-learning introduced, recent approaches [37,38] have shown promising improvements since ontologies are updated continuously and automatically.
As pointed out in previous studies [15,39], voxel-based and point-based representation each have their benefits and drawbacks. Voxel-based methods generate high-quality proposals but suffer from information loss. A high-resolution representation is required in order to preserve fine details for high-quality feature extraction. However, the computational overhead and memory usage increase cubically with the voxel resolution. Point-based methods are flexible in receptive fields but structuring the sparse point cloud is inefficient. PV-RCNN [15] proposes the PVConv that disentangles the voxel-based branch and the point-based branch, in which the former branch utilizes voxel-CNN to aggregate the neighboring points in low-resolution voxel grids and the latter one extracts the features from each point with high resolution. PV-RCNN [15] first uses the 3D voxel-CNN with multi-scale layers to generate proposals, while the voxel set abstraction module extracts the features of key-points by FPS (farthest point sampling) for points in voxels. Approaches proposed in the two aforementioned researches attempted to improve the efficiency and precision of 3D object detection by integrating the advantages of voxel-based and point-based methods, which inspires the design of our second phase 3D detector.

Object Detection Based on Multi-Modal Fusion
To compensate for the limitations of a single or a single-modal sensor in complex and dynamic conditions, multi-sensor fusion has become a research focus in the field of objection detection. There are two main research foci: fusion of cross-spectral images for 2D detection, and fusion of visible images and point clouds for 3D detection.
For robust all-day 2D detection, numerous studies have proposed fusing cross-spectral images mainly from the visible and infrared spectrums [40][41][42][43]. Although the modal of inputs is different from those of our work, mechanisms such as attention based adaptive weighting, the framework of multiple streams of feature extraction, et cetera, are all quotable approaches in designing multi-modal fusion networks.
For 3D detection, although some studies such as Stereo-RCNN [44] focused on exploring the fusion structure to process stereo images for 3D detection, most researches focus on fusing visible images and point clouds [9,12,14,22,23,45].
Similar to studies done in the field of point cloud-based 3D detection, in order to establish a corresponding relation between multi-modal inputs so that a fusion operation is applicable, voxelization and projection of LiDAR point cloud is often considered as an indispensable step in the design of many previously proposed fusion methods. MV3D [23] implements the fusion by projecting 3D object proposals generated from the BEV feature maps to the feature maps extracted from the front view and RGB images respectively. Region-wise features obtained are then fused via ROI pooling for final detection. AVOD [22] improves MV3D [23] by eliminating the LiDAR FV branch and introducing a decision-level fusion on top of the existing feature-level fusion. Wang et al. [46] fuses the BEV LiDAR data and the front view image at the feature-level with non-homogeneous pooling, followed by a one-stage RPN.
Attempts have also been made to implement the multi-modal fusion in the intermediate feature level by projecting image semantics into the space of the point cloud to enhance the point features. Liang et al. [21] proposes projecting the multi-scale intermediate image features into BEV with continuous convolutions to generate dense BEV feature maps upon which 3D detection is performed. Three-dimensional-CVF [5] designs a cross-view feature fusion strategy which transforms the 2D image features to a smooth spatial feature map corresponding to the LiDAR features in the BEV view.
EPNet [6] introduces a novel grid generator to establish point-wise correspondence between LiDAR and image data without having to generate BEV data.
To implement an adaptively weighted fusion of multi-modal information in the intermediate feature levels so that the detector is robust to environmental changes, weighting mechanisms that determines the contribution of features from each modality are studied. Kim et al. [47] employs the gated information fusion network to adjust the weights of the features from each modality so that degraded features are suppressed. 3D-CVF [5] develops a gated feature fusion network based on spatial-wise attention to generate the fused feature. EPNet [6] adaptively estimates the importance of image features utilizing the LiDAR feature and weights the semantic image features before they are employed to enhance the corresponding LiDAR point features at multiple network layers.
Multi-modal fusion schemes that are built upon PointNet [8,31] which operate directly on point cloud have also been proposed. PointFusion [14] fuses the image's features, point-wise and global features of the point cloud, and utilizes two different fusion networks to predict object classes and regress bounding boxes. F-Pointnet [12] implements the multi-modal fusion in a cascaded manner in which different modalities of inputs are used in different phases.

Overview
We propose a novel two-phase 3D object detector in which cross-modality fusion of RGB images and point clouds are implemented at both the feature-level and the decision-level. As illustrated in Figure 1, our two-phase 3D detector comprises of phase-1: a two-stream fusion RCNN which merges RGB images and intensity maps at the feature-level and produces 2D proposals to generate 3D frustums in the space of the point cloud, and phase-2: a PVConvNet-based 3D detector which performs 3D instance segmentation and box regression on point clouds in the 3D frustums.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 15 gated information fusion network to adjust the weights of the features from each modality so that degraded features are suppressed. 3D-CVF [5] develops a gated feature fusion network based on spatial-wise attention to generate the fused feature. EPNet [6] adaptively estimates the importance of image features utilizing the LiDAR feature and weights the semantic image features before they are employed to enhance the corresponding LiDAR point features at multiple network layers.
Multi-modal fusion schemes that are built upon PointNet [8,31] which operate directly on point cloud have also been proposed. PointFusion [14] fuses the image's features, point-wise and global features of the point cloud, and utilizes two different fusion networks to predict object classes and regress bounding boxes. F-Pointnet [12] implements the multi-modal fusion in a cascaded manner in which different modalities of inputs are used in different phases.

Overview
We propose a novel two-phase 3D object detector in which cross-modality fusion of RGB images and point clouds are implemented at both the feature-level and the decision-level. As illustrated in Figure 1, our two-phase 3D detector comprises of phase-1: a two-stream fusion RCNN which merges RGB images and intensity maps at the feature-level and produces 2D proposals to generate 3D frustums in the space of the point cloud, and phase-2: a PVConvNet-based 3D detector which performs 3D instance segmentation and box regression on point clouds in the 3D frustums.

Two-Stream Fusion RCNN
Instead of concatenating the RGB image and the intensity map to obtain an RGB-I representation of the scene, we argue that a feature-level fusion would contribute to merging more expressive and useful information and therefore producing more informative and robust representations of the perceived scene.
As shown in Figure 2, we built two streams of feature extraction with ResNet101 [48] to extract features from the RGB images and the intensity maps, respectively.

Two-Stream Fusion RCNN
Instead of concatenating the RGB image and the intensity map to obtain an RGB-I representation of the scene, we argue that a feature-level fusion would contribute to merging more expressive and useful information and therefore producing more informative and robust representations of the perceived scene.
As shown in Figure 2, we built two streams of feature extraction with ResNet101 [48] to extract features from the RGB images and the intensity maps, respectively. Features extracted at the same stage are concatenated to produce fused features. Moreover, the fusion process is implemented based on the FPN structure [19] so that multiple stages of fused features are combined in order to preserve both low-level details and high-level semantics.
Specifically, the extracted multi-scale features are input into the modified RPN to generate proposals. The modified RPN network consists of a 3 × 3 convolution layer followed by ReLU activation and two sibling fully-connected layers to classify objects and regress anchor boxes. Proposals are generated by sliding anchors on multiple scales of features and are then concatenated as the outputs. Together with proposals, the fused multi-scale features are fed into the PyramidRoI pooling layer to fuse the semantic information at different levels and scales. The fused multi-scale semantics are then fed into the top model of the first cascaded head to predict the class of objects and regress their bounding boxes.
Inspired by Cascade RCNN [20], we design a cascade detector head to further improve the detection performance, especially on small targets. Each detector head is comprised of two convolutional layers followed by ReLu activation and two fully-connected layers for classification and regression. In the sequence of detector heads, the predicted boxes of the previous stage are filtered with Non maximum suppression (NMS) and are then served as proposals of the following stage. The threshold of IoU is increased along with the depth of the detector head so that the network tends to focus more on small-scale targets.
Besides, we explore the effect of weighted fusion based on attention and present the results in Section 4.3. The attention module is implemented based on CBAM [49], which consists of a channelwise and a spatial-wise attention module, and is incorporated into the backbone network of each stream. The channel-wise attention module employs the global max-pooling and global average pooling to process each scale of feature maps, which are then fed into a shared Multi-layer Perceptron (MLP) followed by a sigmoid module to generate channel-wise attention values. The spatial-wise attention module employs the same pooling operation as in the channel-wise attention module to process the feature maps, upon which convolutional operations and sigmoid activations are applied to produce spatial-wise attention values. Features extracted at the same stage are concatenated to produce fused features. Moreover, the fusion process is implemented based on the FPN structure [19] so that multiple stages of fused features are combined in order to preserve both low-level details and high-level semantics.
Specifically, the extracted multi-scale features are input into the modified RPN to generate proposals. The modified RPN network consists of a 3 × 3 convolution layer followed by ReLU activation and two sibling fully-connected layers to classify objects and regress anchor boxes. Proposals are generated by sliding anchors on multiple scales of features and are then concatenated as the outputs. Together with proposals, the fused multi-scale features are fed into the PyramidRoI pooling layer to fuse the semantic information at different levels and scales. The fused multi-scale semantics are then fed into the top model of the first cascaded head to predict the class of objects and regress their bounding boxes.
Inspired by Cascade RCNN [20], we design a cascade detector head to further improve the detection performance, especially on small targets. Each detector head is comprised of two convolutional layers followed by ReLu activation and two fully-connected layers for classification and regression. In the sequence of detector heads, the predicted boxes of the previous stage are filtered with Non maximum suppression (NMS) and are then served as proposals of the following stage. The threshold of IoU is increased along with the depth of the detector head so that the network tends to focus more on small-scale targets.
Besides, we explore the effect of weighted fusion based on attention and present the results in Section 4.3. The attention module is implemented based on CBAM [49], which consists of a channel-wise and a spatial-wise attention module, and is incorporated into the backbone network of each stream. The channel-wise attention module employs the global max-pooling and global average pooling to process each scale of feature maps, which are then fed into a shared Multi-layer Perceptron (MLP) followed by a sigmoid module to generate channel-wise attention values. The spatial-wise attention module employs the same pooling operation as in the channel-wise attention module to process the feature maps, upon which convolutional operations and sigmoid activations are applied to produce spatial-wise attention values.

PVConvNet-Based Object Detection
With the assistance of 2D detection, we implement a PVConvNet-based 3D detector to process the point clouds of interest in frustums, which are basically 3D transformations of 2D bounding boxes. PV-CNN [15] combines the advantages of Pointnet [8,31] and voxel models [26,27], and improves the accuracy of positioning the object in the point cloud and identifying the scene more efficiently. We adopt the PVConvNet to complete detection task on filtered point clouds, including point-voxel convolution, 3D instance segmentation and 3D box estimation.

Point-Voxel Convolution
The point-voxel convolution contains two branches as shown in Figure 1. One is the voxel-based branch with good data locality and regularity, and the other is the point-based branch. The voxel-based branch transforms the points into low-resolution voxel grids and aggregates the neighboring points with voxel-based convolutions, and then it converts voxels back to points by devoxelization. The point-based branch extracts the features for each individual point.

3D Detection
With the fused features obtained from voxel-based and point-based branches, we implement the 3D instance segmentation and 3D box estimation as did in F-Pointnet [12] to produce the final output. Similar to 2D instance segmentation which is a binary classification of each pixel, 3D instance segmentation classifies point clouds and predicts the confidence that a point is part of an object of interest. In our implementation, we encode the object category from the two-stream fusion RCNN into one-hot class features vector, and concatenate them with the point cloud features learned by the 3D detection model. Having obtained the segmentation results, we convert the point cloud to the local coordinate system and utilize PointNet [8] to perform more accurate box estimation.

Experimental Setups
The KITTI vision benchmark suite [16] is used to evaluate our proposal. As done in F-Pointnet [12], we divided a total of 7481 images and corresponding point clouds into two subsets with roughly the same size, as the training and the testing dataset, respectively. All objects were subcategorized into "easy", "moderate" and "hard" according to the heights of the 2D bounding boxes, levels of occlusion and truncation. The intensity values of the point cloud were extracted, transformed and projected onto the front view plane in the coordinate system of the camera to generate intensity maps. The kitti-object-eval-python script is used to calculate the AP (average precision), which is used as the metrics to measure the detection performance of our work and comparable detectors.

Implementation Details
For the two-stream fusion RCNN, the pre-trained RestNet101 [48] model on ImageNet [50] was used to initialize the backbone network of feature extraction. It was then trained on the KITTI training set using SGD [51] with a weight decay of 0.0005, a momentum of 0.9 and a batch size of 1 on a 4 Titan XP GPU server (Nvidia, Santa Clara, CA, U.S.A.). The learning rate was set to 1 × 10 −3 in the first 10 epochs and was decreased to 1 × 10 −4 in the last 4 epochs. Other implementation details were the same as the original Faster RCNN [1]. For the PVConvNet, we used the training data prepared in F-Pointnet [12] to train the network, which contained over 7000 pairs of color images and frames of filtered point cloud data. The 3D detector was trained using the Adam optimizer [52] with a learning rate of 1 × 10 −3 on a 4 Titan XP GPU server with batch size of 32 and epochs of 200.

Cross-Modality Fusion
To verify the effectiveness of feature-level fusion of RGB images and intensity maps in enhancing the expressiveness of the merged features, we compared the performance of our two-stream fusion RCNN (VI-fusion) and the baseline Faster RCNN [1]. The results presented in Table 1 show significant improvements in detection performance in all categories of objects at all levels of difficulties.
In Figure 3, we visualize the feature maps generated at stage 1, stage 4 and stage 8 in both network streams for more discoveries as to why merging features from RGB images and intensity maps contribute to producing more informative and robust features.  The V-labeled rows present feature maps of the RGB stream, while the I-labeled rows present those of the intensity stream.
Observations from the visualized feature maps are three-fold: First, as shown in Figure 3a,b, while RGB features at stage 4 seem to have lost most visual details, intensity features at stage 4 outlines objects rather clearly, meaning fine visual details are still preserved in the intensity stream. As a result, merging RGB and intensity features at the same stage is beneficial since they not only represent entirely different sets of physical information, but also contain different levels of semantics and visual details of the same scene. Second, as shown in the 4th column of Figure 3b, the area of the car is attended to in the RGB feature map, while the area of the cyclist is considered as less relevant to the task of detection. In contrast, a better overall representation of all objects is obtained in the intensity feature map since the region of interest encompasses both objects without having the most conspicuous object being overwhelmingly dominant. Third, although the intensity feature map preserves much less visual detail of smaller targets, such as pedestrians and cyclists, due to the sparsity of point clouds, it is clearly shown in Figure 3c that the intensity feature provides a more proper description of the area that the detector should attend to, as opposed to biased attention caused by RGB features.
It was also discovered in our experiments that the intensity of the point cloud is subject to changes in the materials and the micro-structures of the objects. Features extracted from the intensity maps are therefore less robust when the objects of interest are highly diverse in terms of these two factors.

Cascade Detector Head and Attention-Based Weighted Fusion
To examine the effectiveness of the cascade network head, we increased the stage of heads gradually from 1 to 3 and evaluated the performance of 3D detection. As shown in Table 2, Model-v2, which was equipped with 2 heads outperformed Model-v1 and Model-v3, which were equipped with 1 and 2 network heads, respectively. Model-v1-att indicates that the attention module was implemented with 1 detector head. Model-v2-att indicates that the attention module was implemented with 2 detector heads. It was observed that having a second stage of network heads helps to significantly improve the overall performance of the detector, especially on small-scale targets. However, a deeper cascade structure does not lead to further improvement in detection performance. As a result, we adopted the two-stage cascade design in our 2D detector.
As for the detectors with attention modules attached, a performance degradation instead of an improvement was seen. It was proven by the results that although weighted fusion is intuitively believed to be beneficial, the design of the weighting mechanism is challenging. Attention mechanisms are not necessarily contributive in enhancing the fused feature since it is a non-trivial task to devise a strategy that is capable of adaptively adjusting the contribution of multi-modal features whose qualities are subject to changes in numerous complex environmental factors.

Comparison with Other Methods
The detection performance of our proposal and 9 available competitors are compared and the results are given in Tables 3 and 4. The top performance networks are marked in bold Table 4. The real-time performance of different models on the KITTI dataset.

Input Data Latency
MMLab-PointRCNN [53] Point 112.4 ms/frame MV3D [23] Point + RGB 360 ms/frame AVOD [22] Point + RGB 80 ms/frame PointGNN [54] Point 80 ms/frame Frustum ConvNet [55] Point + RGB 470 ms/frame F-Pointnet++ [12] Point + RGB 97.3 ms/frame Range RCNN [56] Range 60 ms/frame PV-RCNN [57] Point 80 ms/frame STD [58] Point 80 ms/frame Ours (v2) Point + RGB + Intensity 59.6 ms/frame Figure 4 shows some visualized results of 3D detection. As shown in Tables 3 and 4, our work is proven to be a state-of-the-art detector as it achieves the best detection performance in 1 subcategory and approaches the best in the other 8, at the cost of the least computation time. The Range RCNN [56], Deformable RCNN [57], and STD [58] each lead in the detection of 1 or 2 subcategories of cars and cyclists; however, they perform much worse in all subcategories of pedestrians as compared to ours. Our proposal outperforms F-PointNet++ [12] in detection of all subcategories of cars and "easy" pedestrians, while rivaling its performance in all other subcategories of objects.
The results suggest that the 3D detection methods, which uses point cloud as the only input, perform well in the detection of objects such as cars and cyclists, which have regular structures and robust geometric features, while performing poorly in detecting pedestrians with far more diverse appearances and geometries, due to the absence of abundant texture and semantic features from images.

Conclusions
For robust and efficient 3D object detection, we propose a novel two-phase fusion network which exploits cross-modality information from RGB images and LiDAR point clouds at both the feature-level and the decision-level. The results of comparison between our proposal and the baseline Faster RCNN strongly support the assumption that a cross-modality fusion at the feature-level contributes effectively towards enhancing the expressiveness and robustness of the fused features and consequently improving the performance of detection on all subcategories of objects. We investigated the underlying causes by visualizing feature maps at multiple stages from both modalities. It was discovered that the intensity feature still preserves fine visual details which are hardly observable in the corresponding RGB feature at the same network stage. Moreover, it is shown that at least in some cases, intensity features help to refine or adjust the area that the network attends to and therefore a more proper overall representation of all objects of interest is obtainable. Compared to available state-of-the-art competitors, our proposal achieves either the best or near the best detection accuracy in multiple categories of objects while significantly outperforming real-time performance. Future studies will investigate more robust 2D representation of point clouds to further improve the performance of first-phase 2D detection.

Conflicts of Interest:
The authors declare no conflict of interest.