# SECOND: Sparsely Embedded Convolutional Detection

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

**S**parsely

**E**mbedded

**CON**volutional

**D**etection), which addresses these challenges in 3D convolution-based detection by maximizing the use of the rich 3D information present in point cloud data. This method incorporates several improvements to the existing convolutional network architecture. Spatially sparse convolutional networks are introduced for LiDAR-based detection and are used to extract information from the z-axis before the 3D data are downsampled to something akin to 2D image data. We also use a GPU (Graphics Processing Unit)-based rule generation algorithm for sparse convolution to increase the speed. In comparison to a dense convolution network, our sparse-convolution-based detector achieves a factor-of-4 speed enhancement during training on the KITTI dataset and a factor-of-3 improvement in the speed of inference. As a further test, we have designed a small model for real-time detection that has a run time of approximately 0.025 s on a GTX 1080 Ti GPU, with only a slight loss of performance.

- We apply sparse convolution in LiDAR-based object detection, thereby greatly increasing the speeds of training and inference.
- We propose an improved method of sparse convolution that allows it to run faster.
- We propose a novel angle loss regression approach that demonstrates better orientation regression performance than other methods do.
- We introduce a novel data augmentation method for LiDAR-only learning problems that greatly increases the convergence speed and performance.

## 2. Related Work

#### 2.1. Front-View- and Image-Based Methods

#### 2.2. Bird’s-Eye-View-Based Methods

#### 2.3. 3D-Based Methods

#### 2.4. Fusion-Based Methods

## 3. SECOND Detector

#### 3.1. Network Architecture

#### 3.1.1. Point Cloud Grouping

#### 3.1.2. Voxelwise Feature Extractor

#### 3.1.3. Sparse Convolutional Middle Extractor

#### Review of Sparse Convolutional Networks

#### Sparse Convolution Algorithm

**Rule**, is a matrix that specifies the input index i given the kernel offset k and the output index j. The inner sum in Equation (5) cannot be calculated via GEMM, so we need to gather the necessary input to construct the matrix, perform GEMM, and then scatter the data back. In practice, we can gather the data directly from the original sparse data by using a preconstructed input–output index rule matrix. This increases the speed. In detail, we construct a rule matrix table ${R}_{k,i,t}=R[k,i,t]$ with dimensions of $K\times {N}_{in}\times 2$, where K is the kernel size (expressed as a volume), ${N}_{in}$ is the number of input features and t is the input/output index. The elements $R[:,:,0]$ store the input indexes for gathering, and the elements $R[:,:,1]$ store the output indexes for scattering. The top part of Figure 2 shows our proposed algorithm.

#### Rule Generation Algorithm

Algorithm 1: 3D Rule Generation |

#### Sparse Convolutional Middle Extractor

#### 3.1.4. Region Proposal Network

#### 3.1.5. Anchors and Targets

#### 3.2. Training and Inference

#### 3.2.1. Loss

#### Sine-Error Loss for Angle Regression

#### Focal Loss for Classification

#### Total Training Loss

#### 3.2.2. Data Augmentation

#### Sample Ground Truths from the Database

#### Object Noise

#### Global Rotation and Scaling

#### 3.2.3. Optimization

#### 3.2.4. Network Details

**k**is the kernel size and

**s**is the stride. Because all layers have the same size across all dimensions, we use scalar values for

**k**and

**s**. All $Conv2D$ layers have the same padding, and all $DeConv2D$ layers have zero padding. In the first stage of our RPN, three $Conv2D(128,3,1(2\left)\right)$ layers are applied. Then, five $Conv2D(128,3,1(2\left)\right)$ layers and five $Conv2D(256,3,1(2\left)\right)$ layers are applied in the second and third stages, respectively. In each stage, $\mathbf{s}=2$ only for the first convolutional layer; otherwise, $\mathbf{s}=1$. We apply a single $DeConv2D(128,3,\mathbf{s})$ layer for the last convolution in each stage, with $\mathbf{s}=1$, 2, and 4 for the three stages, sequentially. For pedestrian and cyclist detection, the only difference with respect to car detection is that the stride of the first convolutional layer in the RPN is 1 instead of 2.

## 4. Experiments

#### 4.1. Evaluation Using the KITTI Test Set

#### 4.2. Evaluation Using the KITTI Validation Set

#### 4.3. Analysis of the Detection Results

#### 4.3.1. Car Detection

#### 4.3.2. Pedestrian and Cyclist Detection

#### 4.4. Ablation Studies

#### 4.4.1. Sparse Convolution Performance

#### 4.4.2. Different Angle Encodings

#### 4.4.3. Sampling Ground Truths for Faster Convergence

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. arXiv, 2017; arXiv:1703.06870. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. arXiv, 2017; arXiv:1711.07319. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
- Chen, X.; Kundu, K.; Zhu, Y.; Ma, H.; Fidler, S.; Urtasun, R. 3D object proposals using stereo imagery for accurate object class detection. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 40, 1259–1272. [Google Scholar] [CrossRef] [PubMed] - Kitti 3D Object Detection Benchmark Leader Board. Available online: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d (accessed on 28 April 2018).
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 3. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation. arXiv, 2017; arXiv:1712.02294. [Google Scholar]
- Du, X.; Ang Jr, M.H.; Karaman, S.; Rus, D. A general pipeline for 3D detection of vehicles. arXiv, 2018; arXiv:1803.00387. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. arXiv, 2017; arXiv:1711.08488. [Google Scholar]
- Wang, D.Z.; Posner, I. Voting for Voting in Online Point Cloud Object Detection. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; Volume 1. [Google Scholar]
- Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1355–1361. [Google Scholar]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv, 2017; arXiv:1711.06396. [Google Scholar]
- Li, B. 3D fully convolutional network for vehicle detection in point cloud. In Proceedings of the IEEE 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1513–1518. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2004; pp. 580–587. [Google Scholar]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Košecká, J. 3D bounding box estimation using deep learning and geometry. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5632–5640. [Google Scholar]
- Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3D lidar using fully convolutional network. arXiv, 2016; arXiv:1608.07916. [Google Scholar]
- Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-YOLO: Real-time 3D Object Detection on Point Clouds. arXiv, 2018; arXiv:1803.06199. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-Time 3D Object Detection From Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7652–7660. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 4. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
- Li, Y.; Bu, R.; Sun, M.; Chen, B. PointCNN. arXiv, 2018; arXiv:1801.07791. [Google Scholar]
- Graham, B. Spatially-sparse convolutional neural networks. arXiv, 2014; arXiv:1409.6070. [Google Scholar]
- Graham, B. Sparse 3D convolutional neural networks. arXiv, 2015; arXiv:1505.02890. [Google Scholar]
- Graham, B.; van der Maaten, L. Submanifold Sparse Convolutional Networks. arXiv, 2017; arXiv:1706.01307. [Google Scholar]
- Graham, B.; Engelcke, M.; van der Maaten, L. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks. In Proceedings of the IEEE Computer Vision and Pattern Recognition CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Song, S.; Xiao, J. Deep sliding shapes for amodal 3D object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar]
- Vasudevan, A.; Anderson, A.; Gregg, D. Parallel multi channel convolution using general matrix multiplication. In Proceedings of the 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 10–12 July 2017; pp. 19–24. [Google Scholar]
- SparseConvNet Project. Available online: https://github.com/facebookresearch/SparseConvNet (accessed on 28 April 2018).
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Amsterdam, The Netherlands, 2016; pp. 21–37. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. arXiv, 2017; arXiv:1708.02002. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]

**Figure 1.**The structure of our proposed SECOND detector. The detector takes a raw point cloud as input, converts it to voxel features and coordinates, and applies two VFE (voxel feature encoding) layers and a linear layer. Then, a sparse CNN is applied. Finally, an RPN generates the detection.

**Figure 2.**The sparse convolution algorithm is shown above, and the GPU rule generation algorithm is shown below. ${N}_{in}$ denotes the number of input features, and ${N}_{out}$ denotes the number of output features. N is the number of gathered features. $Rule$ is the rule matrix, where $Rule[i,:,:]$ is the i-th rule corresponding to the i-th kernel matrix in the convolution kernel. The boxes with colors except white indicate points with sparse data and the white boxes indicate empty points.

**Figure 3.**The structure of our proposed sparse middle feature extractor. The yellow boxes represent sparse convolution, the white boxes represent submanifold convolution, and the red box represents the sparse-to-dense layer. The upper part of the figure shows the spatial dimensions of the sparse data.

**Figure 4.**The detailed structure of the RPN. Blue boxes represent convolutional layers, purple boxes represent layers for concatenation, sky blue boxes represent stride-2 downsampling convolutional layers, and brown boxes represent transpose convolutional layers.

**Figure 5.**Results of 3D detection on the KITTI test set. For better visualization, the 3D boxes detected using LiDAR are projected onto images from the left camera.

**Figure 6.**Results of detection on the KITTI validation set. In each image, a green box indicates successful detection, a red box indicates detection with low accuracy, a gray box indicates a false negative, and a blue box indicates a false positive. The digit and letter beside each box represent the instance ID and the class, respectively, with “V” denoting a car, “P” denoting a pedestrian and “C” denoting a cyclist. In the point clouds, green boxes indicate ground truths, and blue boxes indicate detection results.

**Figure 7.**Sampling vs. nonsampling methods for 3D map evaluation on the KITTI validation set (Car class, moderate difficulty).

**Table 1.**Comparison of the execution speeds of various convolution implementations. SparseConvNet is the official implementation of submanifold convolution [27]. All benchmarks were run on a GTX 1080 Ti GPU with the data from the KITTI dataset.

Sparse Convolution (1 layer) | |||

Channels | SECOND | SpConvNet [31] | Dense |

$64\times 64$ | 8.6 | 21.2 | 567 |

$128\times 128$ | 13.8 | 24.8 | 1250 |

$256\times 256$ | 25.3 | 37.4 | $N/A$ |

$512\times 512$ | 58.7 | 86.0 | $N/A$ |

Submanifold Convolution (4 layers) | |||

Channels | SECOND | SpConvNet [31] | Dense |

$64\times 64$ | 7.1 | 16.0 | $N/A$ |

$128\times 128$ | 11.3 | 21.5 | $N/A$ |

$256\times 256$ | 20.4 | 37.0 | $N/A$ |

$512\times 512$ | 49.0 | 94.1 | $N/A$ |

**Table 2.**3D detection performance: Average precision (AP) (in %) for 3D boxes in the KITTI test set. In AVOD and AVOD-FPN [9], a custom 85/15 training/validation split and ground plane estimation are adopted to improve the results. For F-PointNet [11], a GTX 1080 GPU, which has 67% of the peak performance of a GTX 1080 Ti (used for our method) or a Titan Xp (used for AVOD), was used for inference. The bold number indicates the best result in a table.

Method | Time (s) | Car | Pedestrian | Cyclist | ||||||
---|---|---|---|---|---|---|---|---|---|---|

Easy | Moderate | Hard | Easy | Moderate | Hard | Easy | Moderate | Hard | ||

MV3D [8] | 0.36 | 71.09 | 62.35 | 55.12 | N/A | N/A | N/A | N/A | N/A | N/A |

MV3D (LiDAR) [8] | 0.24 | 66.77 | 52.73 | 51.31 | N/A | N/A | N/A | N/A | N/A | N/A |

F-PointNet [11] | 0.17 | 81.20 | 70.39 | 62.19 | 51.21 | 44.89 | 40.23 | 71.96 | 56.77 | 50.39 |

AVOD [9] | 0.08 | 73.59 | 65.78 | 58.38 | 38.28 | 31.51 | 26.98 | 60.11 | 44.90 | 38.80 |

AVOD-FPN [9] | 0.1 | 81.94 | 71.88 | 66.38 | 46.35 | 39.00 | 36.58 | 59.97 | 46.12 | 42.36 |

VoxelNet (LiDAR) [14] | 0.23 | 77.47 | 65.11 | 57.73 | 39.48 | 33.69 | 31.51 | 61.22 | 48.36 | 44.37 |

SECOND | 0.05 | 83.13 | 73.66 | 66.20 | 51.07 | 42.56 | 37.29 | 70.51 | 53.85 | 46.90 |

**Table 3.**Bird’s eye view detection performance: Average precision (AP) (in %) for BEV boxes in the KITTI test set.

Method | Time (s) | Car | Pedestrian | Cyclist | ||||||
---|---|---|---|---|---|---|---|---|---|---|

Easy | Moderate | Hard | Easy | Moderate | Hard | Easy | Moderate | Hard | ||

MV3D [8] | 0.36 | 86.02 | 76.90 | 68.49 | N/A | N/A | N/A | N/A | N/A | N/A |

MV3D (LiDAR) [8] | 0.24 | 85.82 | 77.00 | 68.94 | N/A | N/A | N/A | N/A | N/A | N/A |

F-PointNet [11] | 0.17 | 88.70 | 84.00 | 75.33 | 58.09 | 50.22 | 47.20 | 75.38 | 61.96 | 54.68 |

AVOD [9] | 0.08 | 86.80 | 85.44 | 77.73 | 42.51 | 35.24 | 33.97 | 63.66 | 47.74 | 46.55 |

AVOD-FPN [9] | 0.1 | 88.53 | 83.79 | 77.90 | 50.66 | 44.75 | 40.83 | 62.39 | 52.02 | 47.87 |

VoxelNet (LiDAR) [14] | 0.23 | 89.35 | 79.26 | 77.39 | 46.13 | 40.74 | 38.11 | 66.70 | 54.76 | 50.55 |

SECOND | 0.05 | 88.07 | 79.37 | 77.95 | 55.10 | 46.27 | 44.76 | 73.67 | 56.04 | 48.78 |

**Table 4.**3D detection performance: Average precision (AP) (in %) for 3D boxes in the KITTI validation set.

Method | Time (s) | Easy | Moderate | Hard |
---|---|---|---|---|

MV3D [8] | 0.36 | 71.29 | 62.68 | 56.56 |

F-PointNet [11] | 0.17 | 83.76 | 70.92 | 63.65 |

AVOD-FPN [9] | 0.1 | 84.41 | 74.44 | 68.65 |

VoxelNet [14] | 0.23 | 81.97 | 65.46 | 62.85 |

SECOND | 0.05 | 87.43 | 76.48 | 69.10 |

SECOND (small) | 0.025 | 85.50 | 75.04 | 68.78 |

**Table 5.**Bird’s eye view detection performance: Average precision (AP) (in %) for BEV boxes in the KITTI validation set.

Method | Time (s) | Easy | Moderate | Hard |
---|---|---|---|---|

MV3D [8] | 0.36 | 86.55 | 78.10 | 76.67 |

F-PointNet [11] | 0.17 | 88.16 | 84.02 | 76.44 |

VoxelNet [14] | 0.23 | 89.60 | 84.81 | 78.57 |

SECOND | 0.05 | 89.96 | 87.07 | 79.66 |

SECOND (small) | 0.025 | 89.79 | 86.20 | 79.55 |

**Table 6.**A comparison of the performances of different angle encoding methods on the KITTI validation set for the Car class.

Method | Easy | Moderate | Hard |
---|---|---|---|

Vector [9] | 85.99 | 74.79 | 67.82 |

SECOND | 87.43 | 76.48 | 69.10 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yan, Y.; Mao, Y.; Li, B.
SECOND: Sparsely Embedded Convolutional Detection. *Sensors* **2018**, *18*, 3337.
https://doi.org/10.3390/s18103337

**AMA Style**

Yan Y, Mao Y, Li B.
SECOND: Sparsely Embedded Convolutional Detection. *Sensors*. 2018; 18(10):3337.
https://doi.org/10.3390/s18103337

**Chicago/Turabian Style**

Yan, Yan, Yuxing Mao, and Bo Li.
2018. "SECOND: Sparsely Embedded Convolutional Detection" *Sensors* 18, no. 10: 3337.
https://doi.org/10.3390/s18103337