You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

20 June 2023

Boosting 3D Object Detection with Density-Aware Semantics-Augmented Set Abstraction

,
and
1
College of Computer Science and Technology, Jilin University, Changchun 130012, China
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
3
China Automotive Innovation Corporation, Nanjing 210000, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Radar Sensors

Abstract

In recent years, point cloud-based 3D object detection has seen tremendous success. Previous point-based methods use Set Abstraction (SA) to sample the key points and abstract their features, which did not fully take density variation into consideration in point sampling and feature extraction. The SA module can be split into three parts: point sampling, grouping and feature extraction. Previous sampling methods focus more on distances among points in Euclidean space or feature space, ignoring the point density, thus making it more likely to sample points in Ground Truth (GT) containing dense points. Furthermore, the feature extraction module takes the relative coordinates and point features as input, while raw point coordinates can represent more informative attributes, i.e., point density and direction angle. So, this paper proposes Density-aware Semantics-Augmented Set Abstraction (DSASA) for solving the above two issues, which takes a deep look at the point density in the sampling process and enhances point features using onefold raw point coordinates. We conduct the experiments on the KITTI dataset and verify the superiority of DSASA.

1. Introduction

Due to its numerous applications in fields such as robotics, virtual reality, and autonomous vehicles, 3D object detection has drawn significant attention. LiDAR sensors have been broadly employed in autonomous driving systems, which can capture the environment surrounding the host vehicle. Compared to the camera, LiDAR can obtain precise 3D contours of objects, thereby enhancing the performance of 3D object detection.
Point cloud-based 3D object detection methods can be roughly classified into three categories, namely point-based, voxel-based and hybrid-based detectors. Voxel-based methods discretize points to regular grids and use sparse 3D convolution [,] to extract the voxel feature. However, voxel-based methods cannot avoid quantization loss because of voxelization. Point-based methods benefit from the pioneer PointNet series methods [,,], which directly operate on raw point clouds to obtain the point-level features. Hybrid-based methods [,] fuse the aforementioned two means, making full use of the efficiency of voxel-based methods and the highly accurate contextual information extracted by point-based methods.
This paper only focuses on the point-based methods which do not introduce the quantization loss. Multiple point-based methods [,,] use PointNet++ [] and its variants as their backbone, where the SA module is the most important. The SA module can be divided into three steps: sampling, grouping and feature extraction. The Farthest Point Sampling (FPS) [] is commonly used in the sampling process, aiming to sample key points evenly distributed throughout the entire point clouds. Nevertheless, the FPS only considers the distribution balance of the sampling points; thus, it cannot ensure that the sampling points are related to the objects. Previous methods tend to sample more foreground points to increase the recall rate [,,]. Despite the considerable success of these sampling techniques, they cannot avoid the issue of the sampling density imbalance among the foreground objects. For example, they may sample more points in objects with dense points and sample fewer points in objects with sparse points. This leads to sampling variances among foreground points with various density. It is worth noting that objects with dense points often involve sufficient contextual information, so it is unnecessary to sample more points compared to distant or occluded objects, which may have much fewer points. This problem can be alleviated by combining the distance in Euclidean space and feature space. However, this paper argues that the explicit inclusion of density variance is more straightforward. In this paper, we propose a new sampling strategy called Density-Semantics-aware Farthest Point Sampling (DS-FPS), which takes the point-level confidence score and density into account.
In previous work, the SA module primarily focuses on the high-level features of points, only encoding the low-level relative coordinates. However, raw coordinates of the points contain valuable information that expresses spatial position relations. Therefore, we propose the Raw Coordinate Enhancement (RCE) module to further capture the local context with minimal increase in computing resource. To sum up, our contributions are as follows:
  • We propose the DSASA framework, which includes the DS-FPS and the RCE module to balance the foreground points sampling and enhance the point features.
  • We conduct experiments to verify that the DS-FPS can alleviate sampling imbalance, and the RCE module can improve performance with negligible increases in computing resources.
  • The evaluation conducted on KITTI [] 3D benchmarks shows that DSASA outperforms other single-staged point-based detectors under the same experimental environment in the outdoor scenarios.

3. Methods

In this section, we first overview the vanilla SA module in Section 3.1. Then, we introduce the architecture of DSASA in Section 3.2 and describe DS-FPS and RCE in detail.

3.1. Preliminary

The vanilla SA module can be split into three parts: (i) sampling, (ii) grouping and (iii) feature extraction, which is shown in Figure 2.
Figure 2. The overview of the SA module. The SA module first down-samples points; then, it uses different query radius to group points and feature extraction module repeats MLP (linear layer, BatchNorm (BN) layer and ReLU) for n times to better abstract the feature. A single MaxPooling layer is followed by the concatenation of multi-scale point features to obtain the final sampled points feature.

3.1.1. Sampling

FPS is the most commonly used sampling method, which guarantees the sampling points are evenly distributed in 3D space. Many researchers refine the sampling strategy for more reasonable points distribution. FPS and its variants can be generalized as Algorithm 1. Previous sampling methods only vary in the S a m p l e function, d i s t array and U p d a t e function. We take Distance-based FPS (D-FPS), namely vanilla FPS [], F-FPS [] and S-FPS  [], as examples and compare them from the above three perspectives.
D-FPS The S a m p l i n g function in Point-RCNN [] and PointNet++ [] involves randomly selecting a point in the point cloud, which is often the first one saved in data in practice. The d i s t array can be calculated by d i s t k = d k , where d k is the kth point’s minimal distance with the sampled set in Euclidean space. The U p d a t e function can be denoted as Equation (1)
d j = m i n ( d j , x j x k i 2 ) , j { 1 , , N }
where k i is the index of the sampled point in this iteration, N is the total point number in this iteration, and · 2 means the Euclidean norm.
F-FPS The S a m p l i n g function and d i s t array in F-FPS [] are the same as in D-FPS. The only changed U p d a t e function can be formulated as Equation (2)
d j = m i n ( d j , μ x j x k i 2 + f j f k i 2 ) , j { 1 , , N }
where μ is the balance factor to balance the feature distance and the coordinate distance.
S-FPS SASA [] conducts point segmentation to encourage the model to sample more foreground points. The first sampled points can be determined by confidence scores rather than random sampling. The S a m p l i n g method in S-FPS can be determined by a r g m a x function as illustrated in Equation (3).
S a m p l i n g ( I n p u t ) = a r g m a x ( S )
where S is the confidence score set of the input points. The d i s t array can be formulated as the confidence weighted distance as illustrated in Equation (4).
d i s t k = p k γ d k
where p k is the kth point confidence score and γ is the balance factor. The U p d a t e function is consistent with D-FPS.
Algorithm 1:  Generalized Farthest Point Sampling
Input: 
 (required):
 coordinates X = { x 1 , , x n } R N × 3
 (optional):
 features F = { f 1 , , f n } R N × d
 foreground scores S = { s 1 , , s n } R N × 1
Output: 
 sampled key point set K = { k 1 , , k n } R M × 3
1: initialize an empty sampling point set K;
2: initialize a distance array d of length N with all + ;
3: initialize a visit array v of length N with all zeros;
4: for i = 1 to M do
5:  if  i = 1  then
6:    k i = S a m p l e ( I n p u t )
7:  else
8:    D = { d i s t k | v k = 0 }
9:    k i = arg max ( D )
10:  end if
11:  add k i to K, v k i = 1
12:  for j = 1 to N do
13:    U p d a t e ( d j )
14:  end for
15: end for

3.1.2. Grouping

Due to the maldistribution of the point cloud, researchers often use a ball query to group the neighboring points rather than K-Nearest Neighborhood (KNN). To obtain multi-scale features, previous work [,,] uses a different ball radius to group points and aggregates features through concatenation. In the SASA [] source code and MMDetection3d repository [], they use a dilated ball query to group the point features. The input of the grouping module includes the coordinates of current stage points P = { p 1 , , p N } R N × 3 , the features of current stage points F = { f 1 , , f N } R N × d and the coordinates of sampled points C = { c 1 , , c N } R M × 3 . We therefore compare the vanilla ball query and the dilated ball query.
Vanilla Ball Query The first step of the ball query is to determine the grouping points index of sampled points, which can be formulated as below.
g _ i d x s k i = { j | p j c i r a d i u s k , j = 1 , , N , i = 1 , , M , k = 1 , , K } R n s a m p l e
where N and M are the input points number and sampled points number of this SA module, r a d i u s k is the kth ball query radius, and c i is the ith sampled point coordinate. p j is the jth input points coordinate. n s a m p l e denotes the number of neighboring points required to be grouped. If the number of neighboring points is fewer than n a m p l e , we pad by repeating the existing points. Otherwise, if the number of neighboring points is more than n a m p l e , we random sample n s a m p l e points.
Dilated Ball Query The sole difference between the vanilla and dilated ball query is the grouping radius. In the dilated ball query, the grouping ranges have no intersection; the kth grouped points indexes can be formulated as Equation (6)
g _ i d x s k i = { j | r a d i u s k 1 p j c i r a d i u s k , j = 1 , , N , i = 1 , , M , k = 1 , , K } R n s a m p l e
The variables are identical to Equation (5), and it is worth noting that r a d i u s 0 is set to zero as default. Once we obtain the kth group points index, the following steps are consistent in both ball query methods. We can group the points features to obtain the ith ball group feature in the kth level as demonstrated in Equation (7).
g _ f e a t u r e k i = { C o n c a t ( f j ) | j g _ i d x s k i } R n s a m p l e × d
where d is the dimension of input point features, and C o n c a t means the concatenation operation. Afterwards, we can use concatenation to obtain the multi-scale sampled point features as demonstrated in Equation (8).
s i = { C o n c a t ( g _ f e a t u r e k i ) | k = 1 , , K } R n s a m p l e × ( k d )
The sampled point features will be further fed to the feature extraction module.

3.1.3. Feature Extraction

To further extract point features, it is common practice to use the Multi-Layer Perceptron (MLP) to capture more refined sampled point features and use the pooling operation to aggregate the sampled point features, which is formulated as below.
s i = P o o l i n g ( M L P ( s i ) ) R D
where D is the dimension of the output point features.

3.2. Density-Aware Semantics-Augmented Set Abstraction

We make two main modifications to the vanilla SA module, that is the DS-FPS and the RCE module, and we follow SASA to use the dilated ball query. The overall architecture is depicted in Figure 3. We will present the above two modules in Section 3.2.1 and Section 3.2.2
Figure 3. The framework of our DSASA. We repeat Density-aware Semantics-Augmented Set Abstraction three times. The first SA module only use FPS for sampling due to the inaccurate semantic feature in the early stage. The black points are points which are not sampled for the next stage. The blue points are sampled by D-FPS, the red points are sampled by DS-FPS, and the green point in the vote layer is the GT center which the points sampled by DS-FPS need to shift to. We feed the density and point level confidence score to DS-FPS to obtain a more balanced sampling distribution. The Relative Position in Query Ball (RPQB), Relative Direction Angle (RDA) and the point density are the extra input to the feature extraction module, which is detailed in Section 3.2.1.

3.2.1. Density-Aware Semantic Farthest Point Sampling

To encode point density in FPS, there are two issues need to be handled: first, how to represent point density; secondly, how to explicitly add point density to FPS.
How to represent point density? As described in Section 2.3, KDE and a simple logarithm function can be used to represent the point density. We choose the later for simplicity. Using the logarithm function, point density can be represented as below.
d e n s i t y = l o g ( i = 1 K c o u n t i ) , i f d i l a t e d l o g ( c o u n t K ) , o t h e r w i s e
where l o g is the base-10 logarithm function, and c o u n t i means the point number in the ith query space. K is the amount of query space. i f d i l a t e d means if the type of ball query type is a dilated ball query.
How to add point density to FPS? As is described in Section 3.1.1, sampling methods differ in the S a m p l e function, d i s t array and U p d a t e function. We keep the S a m p l e function and U p d a t e function the same as SASA []. For the d i s t array, we expect the points with low density to have farther distance, so we utilize the negative value of the s i g m o i d function to encode the weight, reflecting the inverse relationship between density and distance.
d i s t k = p k γ d k ( 1 s i g m o i d ( d e n s i t y ) ) λ
where s i g m o i d is the Sigmoid function, and d e n s i t y is the same as that defined in Equation (10). In Equation (11), we plus one to let the density weight lie between 0 and 1. γ and λ are the balance factors to balance the confidence weight and the density weight.

3.2.2. Raw Coordinate Enhancement

There are many useful attributes that need to be discovered based on points coordinates. We denote the ball center as ( x 1 , y 1 , z 1 ) and one of the neighboring points in the ball as ( x 2 , y 2 , z 2 ) in this section.
Relative position in the query ball Inspired by the notion of proposal ambiguity put forward in LiDAR-RCNN [], we posit that the relative position within the query ball is crucial in providing the model with additional information about the local context. As illustrated in Figure 4, it is imperative to guide the model in discerning the grouping boundaries effectively. The method of using the normalized relative position in the ball query is effective. Therefore, we encode the relative position within the query ball as described in Equations (12) and (13).
r o f f s e t = r o u t r i n
r e l _ p o s = ( ( x 2 x 1 r i n ) / r o f f s e t , ( y 2 y 1 r i n ) / r o f f s e t , ( z 2 z 1 r i n ) / r o f f s e t )
where r i n and r o u t are the smaller and the larger query radius in dilated ball query, respectively.
Figure 4. The importance of the distance to the query boundary. The red point is the center of the query ball. The black points are the queried points. The red dotted line means the radius of the query ball. The black dotted line means the circumference. (a) We use the fixed radius r 0 to query points, and the points are densely located in the ball. (b) We use a larger radius r 1 to query points, and the points are mainly located in the ball with radius r 0 . They are two different circumstances, but in the vanilla SA module, it will generate the same feature. So, we normalize the relative position based on the radius, e.g., the relative coordinates (0.2, 0.25, 0.25) are converted to ((0.2−0)/(0.4−0), (0.25−0)/(0.4−0), (0.25−0)/(0.4−0)), that is (0.5, 0.625, 0.625).
Relative direction angle The relative direction angle of the neighboring points can be encoded by the relative coordinates as below.
d i s t 1 = ( x 2 x 1 ) 2 + ( y 2 y 1 ) 2
d i s t 2 = ( y 2 y 1 ) 2 + ( z 2 z 1 ) 2
d i s t 3 = ( z 2 z 1 ) 2 + ( x 2 x 1 ) 2
θ 1 = a t a n 2 ( z 2 z 1 , d i s t 1 )
θ 2 = a t a n 2 ( x 2 x 1 , d i s t 2 )
θ 3 = a t a n 2 ( y 2 y 1 , d i s t 3 )
d i r r e l = ( s i n ( θ 1 ) , c o s ( θ 1 ) , s i n ( θ 2 ) , c o s ( θ 2 ) , s i n ( θ 3 ) , c o s ( θ 3 ) )
where a t a n 2 is the inverse tangent function. The illustration of θ 1 is depicted in Figure 5. Although relative direction angles are implicitly contained in relative coordinates, we argue that the explicitly encoding is a more reasonable approach, which helps the network focus on the direction relation.
Figure 5. The illustration of relative direction angle. The original point is the center of the query ball. We can use the a t a n 2 function to obtain θ 1 .
Density Inspired by PDV [], we encode the logarithm amount of neighboring points as density, which helps network realize the distribution of the neighborhood.
Combined with the above three attributes, we add 10 channels to enhance the point features.

4. Experiments

We call our model DSASA. DSASA is evaluated on the challenging 3D object detection benchmark of the KITTI dataset.

4.1. Datasets

The KITTI dataset is a widely used benchmark in 3D object detection. It contains 7481 LiDAR point clouds as well as finely calibrated 3D bounding boxes for training and 7518 samples for testing. Following SECOND [], we split training samples into a training set with 3712 samples and a validation set with 3769 samples; then, we use this partition to find the optimal hyper-parameters. To obtain the final results, which need to be submitted to the KITTI test server, we followed PV-RCNN [] where 80% of the training samples are used for training, and the remaining 20% are used for validation.

4.2. Implementation Details

Most of the architecture is the same as SASA []. We replace SASA with our proposed DSASA. It is worth noting that SASA [] trains on four GPUs with a batchsize of four per GPU. However, due to the limited training resources, we train with a batchsize of eight on a single RTX4090. The learning rate and other hyper-parameters are the same as SASA. We set λ in Equation (11) to 1.0. The reason for choosing 1.0 is detailed in Section 4.4.

4.3. Main Results

Table 1 presents the performance of 3D object detection specifically for the Car class on the KITTI test server. Due to the limited GPU resources and the random testing set partition, we cannot fully reproduce the results mentioned in SASA []. So we keep other training configuration the same as SASA except for batchsize and testing set partition, then set the model trained with batchsize 8 on single RTX4090 as the baseline, and DSASA is better than the baseline and other single stage point-based methods. The qualitative results are depicted in Figure 6.
Table 1. Results on the car class of the KITTI test set. The evaluation metric is the AP calculated on 40 recall points. The best results in each category are shown in bold.
Figure 6. Results of 3D car detection on the KITTI test set. The predictions are labeled by red bounding boxes.

4.4. Ablation Study

Car detection performance on validation set As presented in Table 2, we choose the established outdoor 3D detectors, 3DSSD and PointRCNN, which already contain the SA module, as our baselines. Independently, we incorporate SASA and our proposed method, DSASA, with these baselines and evaluate their performance on the validation set. Our methods demonstrate superior performance compared to the baselines, even when combined with SASA, under the same experimental conditions. Figure 7 showcases the qualitative results.
Table 2. Results on the car class of the KITTI validation set. The evaluation metric is the AP calculated on 40 recall points. The best performance is shown in bold.
Figure 7. Results of 3D car detection on the KITTI validation set. The GTs are annotated by green bounding boxes and the predictions are labeled by red bounding boxes.
Multi-classes detection performance on validation set As shown in Table 3, we conduct a similar experimental setup as in Table 2, with the exception that the models classify three classes. In the single-class detection model, we directly predict the dimensions of the instances. In the multi-class detection model, we modify the detection head to classify three classes and predict the dimension offset between the predictions and the mean size of each class in the KITTI dataset. The qualitative results are depicted in Figure 8. As demonstrated in Table 3, DSASA achieves improved performance especially in detecting small objects. This can be attributed to the fact that small objects typically exhibit lower density, and DSASA effectively addresses this by sampling more points within such objects.
Table 3. Results on the 3 class of the KITTI validation set. The evaluation metric is the AP calculated on 40 recall points. The best performances are shown in bold.
Figure 8. Results of multi-class 3D detection on the KITTI validation set. The GTs are annotated by green bounding boxes and the predictions are labeled by red bounding boxes.
Effects of density balance factor We compare DS-FPS with different balance factors λ in Figure 9. The extremely small or large number will interrupt the final results, so we set λ to be 1.0.
Figure 9. Performance with different balance factor λ .
Effects of Different attributes encoded in RCE We set DSASA without the RCE module as our baseline and compare the performance using various attributes in Table 4. We not only conduct experiments on the three attributes mentioned in Section 3.2.2 but also take the Absolute Direction Angle (ADA) into account, which can be formulated as below, where RPQB means Relative Position in Query Ball, RDA means the Relative Direction Angle, Density means the point density and ADA means the Absolute Direction Angle.
a b s _ d i s t 1 = x 2 2 + y 2 2
a b s _ d i s t 2 = y 2 2 + z 2 2
a b s _ d i s t 3 = z 2 2 + x 2 2
a b s _ θ 1 = a t a n 2 ( z 2 , a b s _ d i s t 1 )
a b s _ θ 2 = a t a n 2 ( x 2 , a b s _ d i s t 2 )
a b s _ θ 3 = a t a n 2 ( y 2 , a b s _ d i s t 3 )
d i r a b s = ( s i n ( a b s _ θ 1 ) , c o s ( a b s _ θ 1 ) , s i n ( a b s _ θ 2 ) , c o s ( a b s _ θ 2 ) , s i n ( a b s _ θ 3 ) , c o s ( a b s _ θ 3 ) )
Table 4. Performance with different attributes encoded in RCE. The best performances are shown in bold.
First, we add a single attribute to the RCE module to test its validity. Next, we combine three attributes together to examine their collective effectiveness. We intentionally exclude the combination of ADA and RDA, since ADA and RDA both pertain to direction angles, and we aim to avoid redundant attributes. Simply adding a single attribute can boost the performance, and combining RPQB, RDA and Density boosts the most. We think that the Absolute Direction Angle is similar among points in the same bounding box, so it is not as distinctive as the other three attributes.
Verify the validation of RCE We have a doubt as to whether the attributes in RCE are really useful or the boost is due to the more learnable parameters introduced by MLP. So, we conduct the experiments with two strategies. One is that we insert linear layer, BN layer and ReLU in the beginning of feature extraction stage, which learns 10 channels (six in RDA, three in RPQB and one in Density) from coordinates. Then, we concatenate the generated features with input features to form features with C + 10 channels, where C is the channel number of the input features, which is fed into the following MLPs. Another strategy adds more learnable parameters to the model. It utilizes an MLP to convert the channels from C + 3 to C + 13 and sends the output features to the following modules. However, these two strategies did not boost more than the RCE module in Table 5, so we are convinced that the performance increase has little relation with the extra learnable parameters but rather affects the meticulous design. We denote the first strategy as Small MLP and the second as Large MLP.
Table 5. Performance with different attributes encoded in RCE. The best performances are shown in bold.
Sampling means and variances The proposed DS-FPS aims to balance the sampling process among multiple instances, so we study the average point number sampled in GT and the standard deviation (Std) among the foreground point amount. The mean and Std can be calculated as below.
M e a n = ( i = 0 N c n t i ) / N
S t d = s q r t ( i = 0 N ( c n t i M e a n ) 2 )
where N is the total GT number, and c n t i is the point number in the ith GT. As described in Table 6, the DS-FPS samples more points than F-FPS and shows less variance than S-FPS.
Table 6. Mean and Std of different sampling methods.
Computing burden introduced by DSASA As shown in Table 2, the P o i n t R C N N + D S A S A takes 2 ms more than P o i n t R C N N + S A S A , and 3 D S S D + D S A S A takes 1 ms more than 3 D S S D + S A S A . We assume that 1 ms and 2 ms are negligible in detection, since the LiDAR frequency is often 20 Hz, so we verify that DSASA can boost the detection performance with little cost.

5. Conclusions

In this article, we propose DSASA. The previous SA module either takes more attention to the even point sampling or purges the model to sample more foreground points. DSASA considers both point density and confidence scores, aiming to achieve a more balanced sampling process. In the second SA module, DS-FPS in DSASA samples 94% more foreground points than F-FPS, and the Std in the sampling process is reduced by 30% compared to S-FPS. Furthermore, the proposed RCE module in DSASA utilizes raw coordinates to extract valuable information, resulting in improved performance with only a 1 ms increase in inference time.
However, the proposed DS-FPS is based on FPS series methods, which have a time complexity of O ( n 2 ) and are not efficient for large-scale point clouds. On the other hand, simply choosing points with top K foreground scores can provide faster processing speed, but it relies heavily on foreground segmentation performance. In the future, it is worth studying how to strike a balance between performance and efficiency in sampling methods. Additionally, although the cascaded ball query expands the receptive field, its range is still limited, so using Transformer to obtain the global receptive fields is a better choice. We only use a single dataset for verification, which makes it less convictive. We will conduct more experiments on diverse datasets to demonstrate the feasibility in our following work.

Author Contributions

Investigation T.Z and X.Y.; methodology, T.Z.; writing—original draft preparation, T.Z; writing—review and editing, T.Z., J.W. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset can be obtained on https://www.cvlibs.net/datasets/kitti/, accessed on 1 September 2022.

Acknowledgments

We thank the Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education in support of the software. In addition, we would like to thank all anonymous reviewers for their helpful suggestions in the improvement of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DSASADensity-Aware Semantics-Augmented Set Abstraction
SASet Abstraction
FPSFarthest Point Sampling
DS-FPSDensity-Semantics-Aware Farthest Point Sampling
S-FPSSemantic-aware FPS
RCERaw Coordinate Enhancement
FPFeature Propagation
D-FPSDistance-Based FPS
F-FPSFeature-Based FPS
GNNGraph Neural Network
KDEKernel Density Estimation
KNNK-Nearest Neighborhood
MLPMulti-Layer Perceptron
RDARelative Direction Angle
ADAAbsolute Direction Angle
RPQBRelative Position in Query Ball
BBoxBounding Box
StdStandard Deviation
GTGround Truth
BEVBird’s-Eye View

References

  1. Graham, B.; Van der Maaten, L. Submanifold sparse convolutional networks. arXiv 2017, arXiv:1706.01307. [Google Scholar]
  2. Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  3. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  4. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5099–5108. [Google Scholar]
  5. Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; Ghanem, B. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2022, 35, 23192–23204. [Google Scholar]
  6. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
  7. Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
  8. Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
  9. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
  10. Chen, C.; Chen, Z.; Zhang, J.; Tao, D. Sasa: Semantics-augmented set abstraction for point-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 221–229. [Google Scholar]
  11. Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
  12. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  13. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
  14. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
  15. Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
  16. He, C.; Li, R.; Li, S.; Zhang, L. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8417–8427. [Google Scholar]
  17. Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2743–2752. [Google Scholar]
  18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  19. Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.X.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8458–8468. [Google Scholar]
  20. Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1711–1719. [Google Scholar]
  21. Qian, R.; Lai, X.; Li, X. BADet: Boundary-aware 3D object detection from point clouds. Pattern Recognit. 2022, 125, 108524. [Google Scholar] [CrossRef]
  22. Guan, T.; Wang, J.; Lan, S.; Chandra, R.; Wu, Z.; Davis, L.; Manocha, D. M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 772–782. [Google Scholar]
  23. Hu, J.S.; Kuai, T.; Waslander, S.L. Point density-aware voxels for lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8469–8478. [Google Scholar]
  24. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
  25. Rosenblatt, M. Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
  26. Contributors, M. MMDetection3D: OpenMMLab Next-Generation Platform for General 3D Object Detection. 2020. Available online: https://github.com/open-mmlab/mmdetection3d (accessed on 1 September 2022).
  27. Li, Z.; Wang, F.; Wang, N. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7546–7555. [Google Scholar]
  28. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
  29. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
  30. Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
  31. Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual Sparse Convolution for Multimodal 3D Object Detection. arXiv 2023, arXiv:2303.02314. [Google Scholar]
  32. He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
  33. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 4–7 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.