You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

23 October 2020

A Two-Phase Cross-Modality Fusion Network for Robust 3D Object Detection

and
1
School of Automotive Engineering, Wuhan University of Technology, Wuhan 430070, China
2
Hubei Key Laboratory of Advanced Technology for Automotive Components, Wuhan University of Technology, Wuhan 430070, China
3
Hubei Research Center for New Energy & Intelligent Connected Vehicle, Wuhan 430070, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Remote Sensors

Abstract

A two-phase cross-modality fusion detector is proposed in this study for robust and high-precision 3D object detection with RGB images and LiDAR point clouds. First, a two-stream fusion network is built into the framework of Faster RCNN to perform accurate and robust 2D detection. The visible stream takes the RGB images as inputs, while the intensity stream is fed with the intensity maps which are generated by projecting the reflection intensity of point clouds to the front view. A multi-layer feature-level fusion scheme is designed to merge multi-modal features across multiple layers in order to enhance the expressiveness and robustness of the produced features upon which region proposals are generated. Second, a decision-level fusion is implemented by projecting 2D proposals to the space of the point cloud to generate 3D frustums, on the basis of which the second-phase 3D detector is built to accomplish instance segmentation and 3D-box regression on the filtered point cloud. The results on the KITTI benchmark show that features extracted from RGB images and intensity maps complement each other, and our proposed detector achieves state-of-the-art performance on 3D object detection with a substantially lower running time as compared to available competitors.

1. Introduction

As a crucial task in various engineering applications including autonomous driving, safety management, et cetera, high-precision object detection has drawn a great deal of attention in recent years. Two sources of inputs are commonly used in objection detection: RGB images and LiDAR point clouds.
A large number of deep learning-based models such as the series of Faster RCNN, SSD, YOLO [1,2,3], and a lot more custom versions of them have been developed for 2D object detection with RGB images. Despite tremendous progresses made in the past few years, vision-based 2D detectors still have major limitations, especially when they are developed for applications such as autonomous driving, where failures of the detector can have disastrous consequences [4]. The vulnerability to environmental interference as well as the lack of depth information are major drawbacks and inherent deficiencies of vision-based 2D detectors, which can hardly be remedied without employing different modalities of data [5,6,7].
In terms of 3D object detection with point clouds, a breakthrough is made with the introduction of Pointnet [8] which enables direct 3D objection on disordered raw point clouds without prior knowledge. Though, the sparseness of the point cloud and the high computational overhead impacts both the detection accuracy and the real-time performance of point cloud-based 3D detection.
Cross-modality fusion which combines aforementioned two modalities of information for the purpose of more precise and robust object detection has therefore become a research focus.
Most schemes of cross-modality fusion are implemented at three different stages or levels. The raw data-level fusion [9,10], which is an early-stage fusion, superimposes data from multiple sensors, producing a significantly larger amount of input data upon which the extracted features are not guaranteed to be more expressive. The decision-level fusion [11,12,13], which is a late-stage fusion, is hindered by critical issues such as difficulties in obtaining prior knowledge, loss of information, et cetera. The intermediate feature-level fusion [11,14], which merges the features extracted from each modality of data to produce more robust and informative fused features, is intuitively considered as a more effective way of exploiting useful multi-modal information. However, fusing multi-modal features remains a non-trivial task mainly due to two reasons: (1) the features extracted from the image and the point cloud differ greatly in many perspectives including the point of view, the density, the level of semantics and spatial details, et cetera. As a result, it is challenging to build a good correspondence between multi-modal features; (2) each modal of inputs is vulnerable to changes in different influencing factors in the surrounding environment. It is quite difficult, yet essential, to design a fusion strategy that is capable of combining meaningful multi-modal features while eliminating mutual interference under ever-changing environmental conditions.
Therefore, we propose to develop an effective and efficient fusion scheme that implements cross-modality fusion at both the decision-level and the feature-level to avoid information loss and having to devise a complicated yet possibly ineffective fusion strategy. For decision-level fusion, inspired by the Frustum Pointnet [12], which brings forward a method that uses images to assist point cloud based object detection, we propose a novel two-phase cascaded detector with the first-phase being a 2D detector that produces proposals. The 2D proposals are used to filter LiDAR point clouds, so that only the points that reside within the regions of interest are fed into the second-phase 3D detector. Since the performance of the two-phase cascaded detector is bounded by each stage, it is essential to further enhance the detection performance of the first-phase 2D detector. Furthermore, with only RGB images used for 2D detection in the first phase, the complementarity among different sensors are not fully exploited. Therefore, we propose implementing a feature-level fusion in the first-phase and building the 2D detector in the framework of a two-stream Faster R-CNN [1] to extract and fuse features from RGB images and intensity maps, which are generated by projecting the reflection intensity values of LiDAR point clouds to the front view plane. The aim of this design is to produce more robust and expressive features upon which more accurate object classification and bounding box regression can be achieved. The intensity map is chosen as the complementary source of input, because as pointed out in MCF3D [9], the reflection intensity value represents the materials and depths of objects in a certain extent, and is immune to changes in weather and lighting conditions. However, instead of concatenating RGB images and intensity maps to produce fused raw data as is done in MCF3D [9], we believe that a feature-level fusion is more effective in preserving useful information while eliminating cross-modality interference. A cascade detector head is then implemented for classification and bounding box regression, aiming to improve the detection performance on small targets. Finally, the 2D proposals generated by the first-phase detector are transformed into the point cloud space so as to acquire the corresponding regions of interest in the point cloud. The filtered point cloud is classified and segmented with the Point-Voxel Convolution Network (PVConvNet) [15] to produce 3D detection results. Experimental results on the KITTI benchmark dataset [16] show that our proposed detector achieves state-of-the-art performance when compared to available competitors.
Main contributions of our work are summarized as follows:
  • We propose a cascading 3D detector that exploits multi-modal information at both the feature fusion and decision making levels.
  • At the decision-level, we design a two-phase detector in which the second-phase 3D detection gets assistance from the first-phase 2D detection in a way that 2D detection results are transformed into 3D frustums to filter the point cloud, in order to improve both the detection accuracy and real-time performance of 3D detection.
  • At the feature-level, we design a two-stream fusion network to merge cross-modality features extracted from RGB images and intensity maps, so as to produce more expressive and robust features for high-precision 2D detection. The validity of the proposed feature fusion scheme is examined and strongly supported by the experimental results and through visualizing features at multiple network stages.

3. Methods

3.1. Overview

We propose a novel two-phase 3D object detector in which cross-modality fusion of RGB images and point clouds are implemented at both the feature-level and the decision-level. As illustrated in Figure 1, our two-phase 3D detector comprises of phase-1: a two-stream fusion RCNN which merges RGB images and intensity maps at the feature-level and produces 2D proposals to generate 3D frustums in the space of the point cloud, and phase-2: a PVConvNet-based 3D detector which performs 3D instance segmentation and box regression on point clouds in the 3D frustums.
Figure 1. The diagram of the multi-phase fusion network.

3.2. Two-Stream Fusion RCNN

Instead of concatenating the RGB image and the intensity map to obtain an RGB-I representation of the scene, we argue that a feature-level fusion would contribute to merging more expressive and useful information and therefore producing more informative and robust representations of the perceived scene.
As shown in Figure 2, we built two streams of feature extraction with ResNet101 [48] to extract features from the RGB images and the intensity maps, respectively.
Figure 2. The two-stream fusion RCNN.
Features extracted at the same stage are concatenated to produce fused features. Moreover, the fusion process is implemented based on the FPN structure [19] so that multiple stages of fused features are combined in order to preserve both low-level details and high-level semantics.
Specifically, the extracted multi-scale features are input into the modified RPN to generate proposals. The modified RPN network consists of a 3 × 3 convolution layer followed by ReLU activation and two sibling fully-connected layers to classify objects and regress anchor boxes. Proposals are generated by sliding anchors on multiple scales of features and are then concatenated as the outputs. Together with proposals, the fused multi-scale features are fed into the PyramidRoI pooling layer to fuse the semantic information at different levels and scales. The fused multi-scale semantics are then fed into the top model of the first cascaded head to predict the class of objects and regress their bounding boxes.
Inspired by Cascade RCNN [20], we design a cascade detector head to further improve the detection performance, especially on small targets. Each detector head is comprised of two convolutional layers followed by ReLu activation and two fully-connected layers for classification and regression. In the sequence of detector heads, the predicted boxes of the previous stage are filtered with Non maximum suppression (NMS) and are then served as proposals of the following stage. The threshold of IoU is increased along with the depth of the detector head so that the network tends to focus more on small-scale targets.
Besides, we explore the effect of weighted fusion based on attention and present the results in Section 4.3. The attention module is implemented based on CBAM [49], which consists of a channel-wise and a spatial-wise attention module, and is incorporated into the backbone network of each stream. The channel-wise attention module employs the global max-pooling and global average pooling to process each scale of feature maps, which are then fed into a shared Multi-layer Perceptron (MLP) followed by a sigmoid module to generate channel-wise attention values. The spatial-wise attention module employs the same pooling operation as in the channel-wise attention module to process the feature maps, upon which convolutional operations and sigmoid activations are applied to produce spatial-wise attention values.

3.3. PVConvNet-Based Object Detection

With the assistance of 2D detection, we implement a PVConvNet-based 3D detector to process the point clouds of interest in frustums, which are basically 3D transformations of 2D bounding boxes. PV-CNN [15] combines the advantages of Pointnet [8,31] and voxel models [26,27], and improves the accuracy of positioning the object in the point cloud and identifying the scene more efficiently. We adopt the PVConvNet to complete detection task on filtered point clouds, including point-voxel convolution, 3D instance segmentation and 3D box estimation.

3.3.1. Point-Voxel Convolution

The point-voxel convolution contains two branches as shown in Figure 1. One is the voxel-based branch with good data locality and regularity, and the other is the point-based branch. The voxel-based branch transforms the points into low-resolution voxel grids and aggregates the neighboring points with voxel-based convolutions, and then it converts voxels back to points by devoxelization. The point-based branch extracts the features for each individual point.

3.3.2. 3D Detection

With the fused features obtained from voxel-based and point-based branches, we implement the 3D instance segmentation and 3D box estimation as did in F-Pointnet [12] to produce the final output. Similar to 2D instance segmentation which is a binary classification of each pixel, 3D instance segmentation classifies point clouds and predicts the confidence that a point is part of an object of interest. In our implementation, we encode the object category from the two-stream fusion RCNN into one-hot class features vector, and concatenate them with the point cloud features learned by the 3D detection model. Having obtained the segmentation results, we convert the point cloud to the local coordinate system and utilize PointNet [8] to perform more accurate box estimation.

4. Experiments

4.1. Experimental Setups

The KITTI vision benchmark suite [16] is used to evaluate our proposal. As done in F-Pointnet [12], we divided a total of 7481 images and corresponding point clouds into two subsets with roughly the same size, as the training and the testing dataset, respectively. All objects were subcategorized into ”easy”, ”moderate” and ”hard” according to the heights of the 2D bounding boxes, levels of occlusion and truncation. The intensity values of the point cloud were extracted, transformed and projected onto the front view plane in the coordinate system of the camera to generate intensity maps. The kitti-object-eval-python script is used to calculate the AP (average precision), which is used as the metrics to measure the detection performance of our work and comparable detectors.

4.2. Implementation Details

For the two-stream fusion RCNN, the pre-trained RestNet101 [48] model on ImageNet [50] was used to initialize the backbone network of feature extraction. It was then trained on the KITTI training set using SGD [51] with a weight decay of 0.0005, a momentum of 0.9 and a batch size of 1 on a 4 Titan XP GPU server (Nvidia, Santa Clara, CA, U.S.A.). The learning rate was set to 1 × 10−3 in the first 10 epochs and was decreased to 1 × 10−4 in the last 4 epochs. Other implementation details were the same as the original Faster RCNN [1]. For the PVConvNet, we used the training data prepared in F-Pointnet [12] to train the network, which contained over 7000 pairs of color images and frames of filtered point cloud data. The 3D detector was trained using the Adam optimizer [52] with a learning rate of 1 × 10−3 on a 4 Titan XP GPU server with batch size of 32 and epochs of 200.

4.3. Ablation Study

4.3.1. Cross-Modality Fusion

To verify the effectiveness of feature-level fusion of RGB images and intensity maps in enhancing the expressiveness of the merged features, we compared the performance of our two-stream fusion RCNN (VI-fusion) and the baseline Faster RCNN [1]. The results presented in Table 1 show significant improvements in detection performance in all categories of objects at all levels of difficulties.
Table 1. Comparison of the baseline Faster RCNN and the two-stream fusion RCNN average precision (AP).
In Figure 3, we visualize the feature maps generated at stage 1, stage 4 and stage 8 in both network streams for more discoveries as to why merging features from RGB images and intensity maps contribute to producing more informative and robust features.
Figure 3. Visualized feature maps at multiple stages from two modalities. (a), (b) and (c) each present a test scene. In each subfigure, the V-labeled row present the input and the feature maps of the RGB stream, while the I-labeled row present those of the intensity stream.
The V-labeled rows present feature maps of the RGB stream, while the I-labeled rows present those of the intensity stream.
Observations from the visualized feature maps are three-fold: First, as shown in Figure 3a,b, while RGB features at stage 4 seem to have lost most visual details, intensity features at stage 4 outlines objects rather clearly, meaning fine visual details are still preserved in the intensity stream. As a result, merging RGB and intensity features at the same stage is beneficial since they not only represent entirely different sets of physical information, but also contain different levels of semantics and visual details of the same scene. Second, as shown in the 4th column of Figure 3b, the area of the car is attended to in the RGB feature map, while the area of the cyclist is considered as less relevant to the task of detection. In contrast, a better overall representation of all objects is obtained in the intensity feature map since the region of interest encompasses both objects without having the most conspicuous object being overwhelmingly dominant. Third, although the intensity feature map preserves much less visual detail of smaller targets, such as pedestrians and cyclists, due to the sparsity of point clouds, it is clearly shown in Figure 3c that the intensity feature provides a more proper description of the area that the detector should attend to, as opposed to biased attention caused by RGB features.
It was also discovered in our experiments that the intensity of the point cloud is subject to changes in the materials and the micro-structures of the objects. Features extracted from the intensity maps are therefore less robust when the objects of interest are highly diverse in terms of these two factors.

4.3.2. Cascade Detector Head and Attention-Based Weighted Fusion

To examine the effectiveness of the cascade network head, we increased the stage of heads gradually from 1 to 3 and evaluated the performance of 3D detection. As shown in Table 2, Model-v2, which was equipped with 2 heads outperformed Model-v1 and Model-v3, which were equipped with 1 and 2 network heads, respectively.
Table 2. Comparison of different depth of cascaded detector heads (AP).
Model-v1-att indicates that the attention module was implemented with 1 detector head. Model-v2-att indicates that the attention module was implemented with 2 detector heads. It was observed that having a second stage of network heads helps to significantly improve the overall performance of the detector, especially on small-scale targets. However, a deeper cascade structure does not lead to further improvement in detection performance. As a result, we adopted the two-stage cascade design in our 2D detector.
As for the detectors with attention modules attached, a performance degradation instead of an improvement was seen. It was proven by the results that although weighted fusion is intuitively believed to be beneficial, the design of the weighting mechanism is challenging. Attention mechanisms are not necessarily contributive in enhancing the fused feature since it is a non-trivial task to devise a strategy that is capable of adaptively adjusting the contribution of multi-modal features whose qualities are subject to changes in numerous complex environmental factors.

4.4. Comparison with Other Methods

The detection performance of our proposal and 9 available competitors are compared and the results are given in Table 3 and Table 4.
Table 3. The mean average precision (mAP) of different models on the KITTI dataset (AP).
Table 4. The real-time performance of different models on the KITTI dataset.
Figure 4 shows some visualized results of 3D detection.
Figure 4. Visualization of 3D Detection Results.
As shown in Table 3 and Table 4, our work is proven to be a state-of-the-art detector as it achieves the best detection performance in 1 subcategory and approaches the best in the other 8, at the cost of the least computation time. The Range RCNN [56], Deformable RCNN [57], and STD [58] each lead in the detection of 1 or 2 subcategories of cars and cyclists; however, they perform much worse in all subcategories of pedestrians as compared to ours. Our proposal outperforms F-PointNet++ [12] in detection of all subcategories of cars and ”easy” pedestrians, while rivaling its performance in all other subcategories of objects.
The results suggest that the 3D detection methods, which uses point cloud as the only input, perform well in the detection of objects such as cars and cyclists, which have regular structures and robust geometric features, while performing poorly in detecting pedestrians with far more diverse appearances and geometries, due to the absence of abundant texture and semantic features from images.

5. Conclusions

For robust and efficient 3D object detection, we propose a novel two-phase fusion network which exploits cross-modality information from RGB images and LiDAR point clouds at both the feature-level and the decision-level. The results of comparison between our proposal and the baseline Faster RCNN strongly support the assumption that a cross-modality fusion at the feature-level contributes effectively towards enhancing the expressiveness and robustness of the fused features and consequently improving the performance of detection on all subcategories of objects. We investigated the underlying causes by visualizing feature maps at multiple stages from both modalities. It was discovered that the intensity feature still preserves fine visual details which are hardly observable in the corresponding RGB feature at the same network stage. Moreover, it is shown that at least in some cases, intensity features help to refine or adjust the area that the network attends to and therefore a more proper overall representation of all objects of interest is obtainable. Compared to available state-of-the-art competitors, our proposal achieves either the best or near the best detection accuracy in multiple categories of objects while significantly outperforming real-time performance. Future studies will investigate more robust 2D representation of point clouds to further improve the performance of first-phase 2D detection.

Author Contributions

Conceptualization, Y.J. and Z.Y.; data curation, Y.J.; formal analysis, Y.J. and Z.Y.; funding acquisition, Z.Y.; investigation, Y.J.; methodology, Y.J. and Z.Y.; project administration, Z.Y.; resources, Y.J.; software, Y.J.; supervision, Z.Y.; validation, Y.J.; visualization, Y.J.; writing—original draft, Y.J.; writing—review and editing, Y.J. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 51805388, in part by the Chinese National Key Research and Development Project under Grant No. 2018YFB0105203, in part by the Open Foundation of Foshan Xianhu Laboratory of the Advanced Energy Science and Technology Guangdong Laboratory under Grant No. XHD2020-003, and in part by the National innovation and entrepreneurship training program for college students under Grant No. S202010497248.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  2. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision(ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
  3. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  4. Yuille, A.L.; Liu, C. Deep Nets: What have they ever done for Vision? arXiv 2018, arXiv:1805.04025. [Google Scholar]
  5. Yoo, J.H.; Kim, Y.; Kim, J.S.; Choi, J.W. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection. arXiv 2020, arXiv:2004.12636. [Google Scholar]
  6. Huang, T.; Liu, Z.; Chen, X.; Bai, X. EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection. arXiv 2020, arXiv:2007.08856. [Google Scholar]
  7. Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 18–20 June 2019; pp. 7345–7353. [Google Scholar]
  8. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 652–660. [Google Scholar]
  9. Wang, J.; Zhu, M.; Sun, D.; Wang, B.; Gao, W.; Wei, H. MCF3D: Multi-Stage Complementary Fusion for Multi-Sensor 3D Object Detection. IEEE Access 2019, 7, 90801–90814. [Google Scholar] [CrossRef]
  10. Al-Osaimi, F.R.; Bennamoun, M.; Mian, A. Spatially optimized data-level fusion of texture and shape for face recognition. IEEE Trans Image Process 2011, 21, 859–872. [Google Scholar] [CrossRef]
  11. Gunatilaka, A.H.; Baertlein, B.A. Feature-level and decision-level fusion of noncoincidently sampled sensors for land mine detection. IEEE Trans Pattern Anal 2001, 23, 577–589. [Google Scholar] [CrossRef]
  12. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
  13. Oh, S.; Kang, H. Object detection and classification by decision-level fusion for intelligent vehicle systems. Sensors 2017, 17, 207. [Google Scholar] [CrossRef]
  14. Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 244–253. [Google Scholar]
  15. Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-Voxel CNN for efficient 3D deep learning. In Proceedings of the Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 965–975. [Google Scholar]
  16. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
  17. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  18. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in neural information processing systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
  19. Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 2117–2125. [Google Scholar]
  20. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
  21. Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
  22. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
  23. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 1907–1915. [Google Scholar]
  24. Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7652–7660. [Google Scholar]
  25. Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3d lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
  26. Song, S.; Xiao, J. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar]
  27. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
  28. Xu, Y.; Ye, Z.; Yao, W.; Huang, R.; Tong, X.; Hoegner, L.; Stilla, U. Classification of LiDAR Point Clouds Using Supervoxel-Based Detrended Feature and Perception-Weighted Graphical Model. IEEE J. Stars 2019, 13, 72–88. [Google Scholar] [CrossRef]
  29. Zhao, H.; Xi, X.; Wang, C.; Pan, F. Ground Surface Recognition at Voxel Scale From Mobile Laser Scanning Data in Urban Environment. IEEE Geosci. Remote Sens. 2019, 17, 317–321. [Google Scholar] [CrossRef]
  30. Aijazi, A.K.; Checchin, P.; Trassoudaine, L. Segmentation based classification of 3D urban point clouds: A super-voxel based approach with evaluation. Remote Sens. Basel 2013, 5, 1624–1650. [Google Scholar] [CrossRef]
  31. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in neural information processing systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
  32. Komarichev, A.; Zhong, Z.; Hua, J. A-CNN: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 18–20 June 2019; pp. 7421–7430. [Google Scholar]
  33. Chen, C.; Fragonara, L.Z.; Tsourdos, A. GAPNet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv 2019, arXiv:1905.08705. [Google Scholar]
  34. Liu, Y.; Fan, B.; Xiang, S.; Pan, C. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 18–20 June 2019; pp. 8895–8904. [Google Scholar]
  35. Xing, X.; Mostafavi, M.; Chavoshi, S.H. A knowledge base for automatic feature recognition from point clouds in an urban scene. ISPRS Int. J. Geo. Inf. 2018, 7, 28. [Google Scholar] [CrossRef]
  36. Rao, Y.; Lu, J.; Zhou, J. Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; pp. 5376–5385. [Google Scholar]
  37. Lin, Q.; Zhang, Y.; Yang, S.; Ma, S.; Zhang, T.; Xiao, Q. A self-learning and self-optimizing framework for the fault diagnosis knowledge base in a workshop. Robot Cim. Int. Manuf. 2020, 65, 101975. [Google Scholar] [CrossRef]
  38. Poux, F.; Ponciano, J. Self-Learning Ontology For Instance Segmentation Of 3d Indoor Point Cloud. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 309–316. [Google Scholar] [CrossRef]
  39. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; pp. 10529–10538. [Google Scholar]
  40. Konig, D.; Adam, M.; Jarvers, C.; Layher, G.; Neumann, H.; Teutsch, M. Fully convolutional region proposal networks for multispectral person detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 49–56. [Google Scholar]
  41. Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inform Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef]
  42. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recogn. 2019, 85, 161–171. [Google Scholar] [CrossRef]
  43. Zhang, Y.; Yin, Z.; Nie, L.; Huang, S. Attention Based Multi-Layer Fusion of Multispectral Images for Pedestrian Detection. IEEE Access 2020, 8, 165071–165084. [Google Scholar] [CrossRef]
  44. Li, P.; Chen, X.; Shen, S. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 18–20 June 2019; pp. 7644–7652. [Google Scholar]
  45. Rashed, H.; Ramzy, M.; Vaquero, V.; Sallab, A.E.; Sistu, G.; Yogamani, S. FuseMODNet: Real-Time Camera and LiDAR based Moving Object Detection for robust low-light Autonomous Driving. In Proceedings of the the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 29 October–1 November 2019. [Google Scholar]
  46. Wang, Z.; Zhan, W.; Tomizuka, M. Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1–6. [Google Scholar]
  47. Kim, J.; Koh, J.; Kim, Y.; Choi, J.; Hwang, Y.; Choi, J.W. Robust deep multi-modal learning based on gated information fusion network. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, WA, Australia, 2–6 December 2018; pp. 90–106. [Google Scholar]
  48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  49. Woo, S.; Park, J.; Lee, J.; So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  50. Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  51. Bottou, L. Neural networks: Tricks of the trade; Springer: Heidelberg/Berlin, Germany, 2012; Chapter 18. Stochastic Gradient Descent Tricks; pp. 421–436. [Google Scholar]
  52. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  53. Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 18–20 June 2019; pp. 770–779. [Google Scholar]
  54. Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; pp. 1711–1719. [Google Scholar]
  55. Wang, Z.; Jia, K. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. arXiv 2019, arXiv:1903.01864. [Google Scholar]
  56. Liang, Z.; Zhang, M.; Zhang, Z.; Zhao, X.; Pu, S. RangeRCNN: Towards Fast and Accurate 3D Object Detection with Range Image Representation. arXiv 2020, arXiv:2009.00206. [Google Scholar]
  57. Bhattacharyya, P.; Czarnecki, K. Deformable PV-RCNN: Improving 3D Object Detection with Learned Deformations. arXiv 2020, arXiv:2008.08766. [Google Scholar]
  58. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 18–20 June 2019; pp. 1951–1960. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.