3D Object Detection Using Frustums and Attention Modules for Images and Point Clouds

: Three-dimensional (3D) object detection is essential in autonomous driving. Three-dimensional (3D) Lidar sensor can capture three-dimensional objects, such as vehicles, cycles, pedes-trians, and other objects on the road. Although Lidar can generate point clouds in 3D space, it still lacks the ﬁne resolution of 2D information. Therefore, Lidar and camera fusion has gradually become a practical method for 3D object detection. Previous strategies focused on the extraction of voxel points and the fusion of feature maps. However, the biggest challenge is in extracting enough edge information to detect small objects. To solve this problem, we found that attention modules are beneﬁcial in detecting small objects. In this work, we developed Frustum ConvNet and attention modules for the fusion of images from a camera and point clouds from a Lidar. Multilayer Perceptron (MLP) and tanh activation functions were used in the attention modules. Furthermore, the attention modules were designed on PointNet to perform multilayer edge detection for 3D object detection. Compared with a previous well-known method, Frustum ConvNet, our method achieved competitive results, with an improvement of 0.27%, 0.43%, and 0.36% in Average Precision (AP) for 3D object detection in easy, moderate, and hard cases, respectively, and an improvement of 0.21%, 0.27%, and 0.01% in AP for Bird’s Eye View (BEV) object detection in easy, moderate, and hard cases, respectively, on the KITTI detection benchmarks. Our method also obtained the best results in four cases in AP on the indoor SUN-RGBD dataset for 3D object detection.


Introduction
The detection of object instances in 3D sensory data has tremendous importance in many applications. Three-dimensional (3D) technology can receive more abundant and comprehensive environmental information. Therefore, it is widely used in robot navigation, automatic driving, Augmented Reality (AR), and industrial detection.
Point cloud and RGB image fusion can simultaneously extract 2D and 3D features by using a neural network. Objects can be detected with higher accuracy by simultaneously considering 2D and 3D information. With the progress of point clouds [1,2], 3D object detection methods [3,4] can resort to learning features directly from point clouds. For example, PointNets [3,4] are capable of classifying a whole point cloud or predicting a semantic class for each point in the point clouds. Three-dimensional (3D) point clouds are usually transformed into images or voxel grids [5] before PointNet [3,4]. It shows good performance in 3D object detection. However, the weakness of PointNet [3,4] and PointNet++ [4] is that a 3D bounding box cannot be estimated with direction. A new Frustum scheme was proposed by F-PointNets [2] and Frustum ConvNet [1], which use RGB-D data and a multilayer 2D region proposal to help the point clouds' segmentation form the 3D space. The global features are obtained from the local feature combination. F-PointNets used T-net [2] to determine the position and direction of a 3D bounding box. The disadvantage of the Frustum is that objects with unclear boundaries and small-scale instances are difficult to detect.
To solve this problem, we would like to refer to the attention modules used in 2D object detection methods. Guo et al. [5] proposed a method based on a Gaussian Mixture Model (GMM), in which attention modules were improved by colour, intensity, and orientation feature maps. The attention modules focused on interesting areas to enhance the features of the edge information and small objects. Fan et al. [6] proposed a Region Proposal Network (RPN) with an attention module, enabling the detector to pay attention to objects with high resolution while perceiving the surroundings with low resolution. These works inspired us to use attention modules for object detection in 3D point clouds.
In our work, we developed the images and 3D point clouds fusion method to improve 3D object detection. A new attention module was designed with Frustum ConvNet [1] to enhance feature extraction and improve small object detection. We added attention modules to the input layer of Multilayer Perceptron (MLP) in PointNet [3,4]. We used the tanh activation function to extract and strengthen the attention of a small object, which can improve small object detection effectively.
In this paper, we propose a Frustum ConvNet with attention modules for 3D object detection, in which both images and point clouds are used. The contributions of the paper are as follows: 1.
In the PointNet of Frustum ConvNet, we added the Convolutional Block Attention Module (CBAM) [7] attention module at the hidden layer of Multilayer Perceptron (MLP) to improve the detection accuracy. The CBAM attention module can improve the contrast between the object and the surrounding environment.

2.
We propose an improved attention module by adding Multilayer Perceptron (MLP) and using the tanh activation function. The tanh function is used for average-pooling and max-pooling layers to extract features. The mean of the tanh activation function is 0. Furthermore, the tanh function can cope with cases when the feature values have big differences. Finally, the feature information of the pooling layers is fused through the sigmoid function.

3.
We evaluated our approach on the KITTI [8] object detection benchmark. Compared with the state-of-the-art method, our method achieved competitive results, with an improvement of 0.21%, 0.27%, and 0.01% in Average Precision (AP) for 3D object detection in easy, moderate, and hard cases, respectively, and an improvement of 0.27%, 0.43%, and 0.36% in AP for Bird's Eye View (BEV) object detection in easy, moderate, and hard cases, respectively, on KITTI detection benchmarks. Our method also obtains the best results in four cases in AP on the indoor SUN-RGBD [9] dataset for 3D object detection.
The rest of this paper is organized as follows. Section 2 introduces the previous 3D object detection methods. Section 3 describes the architecture of Frustum ConvNet with Attention Module (FCAM). Section 4 presents the results of our experiments. We conclude in Section 5.

Related Works
This section briefly introduces the previous 3D object detection methods and related attention works. We organize our reviews into three categories of technical approaches, namely 3D object detection techniques from point clouds, attention modules in object detection, and activation functions in a neural network:

Three-Dimensional (3D) Object Detection from Point Clouds
Three-dimensional (3D) voxel patterns (3DVPs) [10] employ a set of Aggregate Channel Feature (ACF) [11] detectors to perform 2D detection and 3D pose estimation. A Multiview 3D Object Detection Network (MV3D) [12] proposed a sensory-fusion framework that takes both Lidar point clouds and RGB images as inputs and predicts oriented 3D bounding boxes. Different from the MV3Ds [12], Li et al. [13] and Song et al. [14] converted the features in point clouds into a voxel grid to improve accuracy at the cost of a large amount of computation. VoxelNet [15] proposed a generic 3D detection network that unifies feature extraction and bounding box prediction into a single-stage, end-to-end trainable deep network. In this method, 3D object detection can operate directly on sparse 3D points and capture 3D shape information effectively.

Attention Module in Object Detection
Recently, some methods have been put forward to incorporate attention processing to improve the performance of CNNs in 2D-based large-scale classification tasks. Wang et al. [16] proposed a Residual Attention Network, which can incorporate state-of-theart feed-forward network architecture in an end-to-end training fashion. This network can extract a large amount of attention information without interruption. Hu et al. [17] introduced a Squeeze-and-Excitation module that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. This method has an improvement in calculation and speed. The Bottleneck Attention Module (BAM) [18] and Convolutional Block Attention Module (CBAM) [7] added spatial attention to increase accuracy. These attention models performed well for 2D object detection.

Activation Function in Neural Network
There is now a consensus that for deep networks, rectified linear units (ReLUs) are easier to train than logistic or tanh units, which were used for many years [19,20]. However, Le et al. [21] noticed that ReLUs seem inappropriate for RNNs because of the possibility that large output values may explode out of the bounded values. Ang-bo et al. [22] noticed that tanh alleviates the phenomenon of mean shift. Li et al. [23] noticed that the output of the tanh function can enhance the values activated by ReLU units. This inspired us to use the fusion activation function in the 3D object detection network.
Based on the Frustum architecture [1,2] and the attention module [7], we developed a new 3D object detection network by integrating the attention modules with Frustum ConvNet. Based on the advantages of the two functions, we fuse the ReLUs and tanh functions in the attention module to achieve higher accuracy. Our proposed method achieved competitive results in KITTI detection benchmarks.

Frustum ConvNet with Attention (FCAM)
The architecture of our 3D object detection network using Frustums and attention modules is shown in Figure 1. This network connects discrete disordered points from Frustums by using Fully Convolutional Networks (FCNs) [24], thus achieving 3D boxoriented estimation in a continuous 3D space. We first describe the structure of Frustum ConvNet in Section 3.1. Frustum ConvNet [1,2] uses PointNet [1,2] to extract and to aggregate point-wise features as Frustum-level feature vectors. Section 3.2 describes our improved attention modules by adding Multilayer Perceptron (MLP) and using the tanh activation function in the PointNet architecture.

Frustum ConvNet
We designed the 3D object detector based on the framework of Frustum ConvNet, as shown in Figure 1a. At first, Frustum-level features are obtained from the Frustum through PointNet and attention modules, which are re-formed as a 2D feature map. Next, the PointNets are applied to each Frustum, and the PointNets with shared weights form the parallel streams of Frustum ConvNet [1]. For point cloud classification, the PointNet takes n points as the input, and the output comprises the classification scores for the d classes. The coordinates of the 3D space point clouds minus the centre coordinates of the Frustum to constitute a 1D vector and serve as the input of FCN. L-dimension vectors form a 2D feature map F with the size L × d, which will be used as the input of a subsequent FCN [24]. Finally, the 2D feature maps are used as the input of Fully Convolutional Networks (FCNs) [24] for 3D prediction. Then, we use the detection head for classification and regression.
In the PointNet, we apply our improved Convolutional Block Attention Modules (CBAMs) in the hidden layer of Multilayer Perceptron (MLP), as shown in Figure 1b. Output features of the CBAM attention module are multiplied with the input feature of the MLP to obtain the final fused features.

The Improved CBAM Attention Model for Point Cloud Detection
The original CBAM attention model is shown in Figure 2a, and our improved CBAM attention model is shown in Figure 2b. In this section, we briefly introduce the original CBAM and then explain the improvement of our proposed attention model. The original CBAM attention module consists of the channel attention and the spatial attention blocks. In the channel attention block, the input feature F 1 passes max pooling and average pooling and then through the MLP with a reduction ratio r 1 = 16. They are followed by sigmoid activation to generate the final channel attention feature map. In the spatial attention block, channel-based global max pooling and global average pooling are connected on the input feature F 2 , and then F 2 through a convolution layer with 7 × 7 kernel size. After a convolution layer, the dimension is reduced to one channel. Finally, the attention features are generated after calculation by the sigmoid function.
The challenge of 3D object detection using PointNet [3,4] is the feature missing problem, especially when the characteristics are quite different. In the whole Frustum Con-vNet [1], when a 2D proposal is converted into a 1D vector, some feature information may be lost to some extent. To solve this problem, we designed a new kind of attention module to improve the detection capability of our proposed 3D object detector.
If an input falls into the region where x < 0, the gradient of the neuron becomes 0. This phenomenon is called the Dead ReLU problem [22]. It causes the regression of the model not to converge. To solve the Dead ReLU problem of the ReLU function and enhance the feature extraction ability of attention modules, the tanh function is used as an auxiliary optimization function in our new structure.
Based on the FCN [24] and the thought of the U-net [25] fusion, we added a parallel Multilayer Perceptron (MLP) architecture to the CBAM attention module and used the tanh activation function to enhance the contrast between the object and background. Multilayer Perceptron (MLP) can better fit the nonlinear region and performs well when dealing with deep networks and large amounts of information. To prevent overfitting due to too many parameters, we used different reduction ratios in two MLPs to reduce the input and output channels. Here, the reduction ratios are r 1 = 16 and r 2 = 32. Furthermore, the tanh function can alleviate the mean deviation problem in the ReLU function in [-1, 1], and the tanh function performs better when the feature values are quite different. However, the tanh function shows the gradient disappearance outside of [-1, 1], as shown in Figure 3a. The gradient disappearance problem can be solved by the ReLU function when x > 0, as shown in Figure 3b. To exploit the advantages of the two activation functions in the attention modules, we fused the two average pooling vectors by element-wise summation in the final step, as shown in Figure 3b. Because the sigmoid function has the scope of [0, 1], we used the sigmoid function as the output layer activation function to represent the prediction probability. It is used to filter the unimportant part and retain the important part feature information. In our improved CBAM attention module, feature maps of channel attention are obtained by using max pooling and average pooling. The feature map then passes through two MLPs consisting of two fully connected layers that use the ReLU activation function and tanh activation function, as shown in the channel attention block of Figure 2b. The output feature value F 3 can be computed by using Equations (1)-(4), as follows: where W 0 R C r 1 * c , W 1 R C * C r 1 , W 2 R C r 2 * c and W 3 R C * C r 2 . W 0 , W 1 , W 2 , W 3 are the weights of the four fully connected layers in the two MLP networks. W c1 , W c2 are the weights of the F 1 2 and F 2 2 . C is the number of channels. R is the real number field. F 2 2 , F 1 2 are the element-wise summation features by two different MLP networks. f 7 * 7 represents a convolution operation with a filter size of 7 × 7.

Experimental Results
We evaluated our 3D object detector on KITTI benchmarks [8] for 3D car detection. We performed the experiment on a TITAN X GPU and developed the code with PyTorch version 1.1. For each image, evaluation is required for 0.005 s. Our experiment was based on Frustum ConvNet [1] and tested the vehicles using the KITTI dataset [8]. We applied the attention module in PointNet and added a parallel Multilayer Perceptron (MLP) architecture in the CBAM attention module with the tanh activation function. The number of parameters in Frustum ConvNet [1] is 3,340,089, and the number of parameters in FCAM is 3,351,353. To reduce the network's model size and prevent overfitting, we used two attention modules and increased the reduction ratios from 16 to 32. A larger reduction ratio reduces parameter overhead and improves the speed of our method.
KITTI: The KITTI dataset [8] contains 7481 training pairs, 7518 testing pairs of RGB images, and corresponding point clouds. Following an existing work [15], we split the original training set into the new training and validation sets of 3712 and 3769 samples, respectively. Learning rates start from 0.001 and decay by a factor of 10 every 22nd epoch of the total 50 epochs.
Metrics: We evaluated 3D object proposals using 3D box recall as the metric. For 3D localization, we projected the 3D boxes to the ground plane. We used 3D object detection AP to evaluate the accuracy of the 3D object detection. We used BEV object detection AP to evaluate the accuracy of the BEV object detection.
We will now explain the experimental ablation results. Tables 1 and 2 show the detection performance of the improved CBAM on the KITTI validation set for 3D object detection and Bird's Eye View (BEV) object detection, respectively. T means the running time to process each image. Compared with Frustum ConvNet [1] + CBAM [7], our results showed improved AP by 0.09% in the BEV object detection of easy categories. For the 3D object detection, our results showed improved AP by 0.15%, 0.13% in moderate and hard categories. Because the attention model combines two activation functions, our attention module has slightly improved the accuracy, but the running time is longer than before. Figure 4 shows the convergence curves of Frustum, Frustum + CBAM, and Frustum + Improved CBAM. Frustum + Improved CBAM can achieve the highest accuracy on average among these three methods.   Tables 3 and 4 show the detection performance of our proposed FCAM on the KITTI validation set for 3D object detection and Bird's Eye View (BEV) object detection, respectively. Compared with Frustum ConvNet [1], our results showed improved AP by 0.21%, 0.27%, and 0.01%, respectively, in the BEV object detection of easy, moderate, and hard categories. For the 3D object detection, our results showed improved accuracy in easy, moderate, and hard cases by 0.27%, 0.43%, and 0.36%, respectively. However, it can be seen that our method did not achieve the best results in 3D hard detection. The reason is that the attention model is targeted at objects with obvious image features. When the features are not obvious and occluded, the detection results can be affected. By improving the channel attention block, accuracy is improved at the cost of running time. However, our method is fast enough for real-time applications, being able to process 200 images per second.
We also evaluated our 3D object detector on the indoor SUN-RGBD [9] test set for 3D object detection. Table 5 shows the detection performance of our proposed FCAM on the SUN-RGBD test set for 3D object detection. We tested 5198 images and compared them with Frustum ConvNet [1]. Our method achieved competitive results, with an improvement of 0.46% in Average Precision (AP) for 3D object detection. In four cases, our method achieved the best results in Average Precision (AP), such as the bed (1.67%), chair (0.92%), dresser (1.41%), and sofa (0.3%) for 3D object detection. On average, our method shows the best results, as can be seen in the last column of Table 5.

Conclusions and Future Works
This paper proposed using Frustum ConvNet with an improved CBAM attention model for 3D object detection. We propose an improved attention module by adding Multilayer Perceptron (MLP) and using the tanh activation function to improve the contrast between the object and the surrounding environment. We evaluate the proposed Frustum ConvNet with the attention model (FCAM) in the KITTI dataset and achieve competitive results with the state-of-the-art methods. This Frustum ConvNet with attention architecture can provide applications such as autonomous driving and robotic object manipulation.
In the future, we plan to further improve the performance of our 3D object detector. Our proposed attention model does not perform well when the network architecture is relatively complex. It is difficult for the attention model to focus on occluded objects in a complex environment. We plan to change the network architecture, reduce parameters, and further improve the adaptability of attention modules.   [8,9].