You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

30 April 2024

Adaptive Scale and Correlative Attention PointPillars: An Efficient Real-Time 3D Point Cloud Object Detection Algorithm

,
,
and
School of Automobile, Chang’an University, Xi’an 710064, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning in Object Detection

Abstract

Recognizing 3D objects from point clouds is a crucial technology for autonomous vehicles. Nevertheless, LiDAR (Light Detection and Ranging) point clouds are generally sparse, and they provide limited contextual information, resulting in unsatisfactory recognition performance for distant or small objects. Consequently, this article proposes an object recognition algorithm named Adaptive Scale and Correlative Attention PointPillars (ASCA-PointPillars) to address this problem. Firstly, an innovative adaptive scale pillars (ASP) encoding method is proposed, which encodes point clouds using pillars of varying sizes. Secondly, ASCA-PointPillars introduces a feature enhancement mechanism called correlative point attention (CPA) to enhance the feature associations within each pillar. Additionally, a data augmentation algorithm called random sampling data augmentation (RS-Aug) is proposed to solve the class imbalance problem. The experimental results on the KITTI 3D object dataset demonstrate that the proposed ASCA-PointPillars algorithm significantly boosts the recognition performance and RS-Aug effectively enhances the training effects on an imbalanced dataset.

1. Introduction

LiDAR (Light Detection and Ranging) has gained widespread acceptance owing to its capability to capture three-dimensional information on objects regardless of lighting conditions. Consequently, recognizing 3D objects from LiDAR point clouds has attracted significant attention in the field of autonomous driving. Nevertheless, LiDAR point clouds possess unique challenges such as sparsity, disorderliness, and unstructured data, with the sparsity issue posing an even greater hurdle for distant and small objects. This has resulted in the recognition of distant and small objects becoming a noteworthy challenge. Currently, point-based methods [1,2,3,4,5,6,7] and grid-based methods [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22] are two popular categories of recognition algorithms for point clouds.
Point-based methods generally take original point clouds as an input and directly recognize objects from the point cloud. This category maximizes the retention of the original information from the point clouds. However, when dealing with large-scale point cloud data, such methods can potentially consume significant computational resources. The voxelization of point clouds serves as an effective solution to address this challenge.
Grid-based methods typically encode point clouds into voxels or pillars and then process them separately using different approaches. Voxel-based methods [8,9,10,11,12,13,14] usually convert the input point cloud data into a voxel space and then use 3D convolution or sparse convolution [9] to extract features from voxels and complete the recognition task. Pillars are a specialized type of voxel that do not take height information into account. Pillar-based methods [15,16,17,18,19,20,21,22] usually encode point clouds into 2D pillars before projecting the point clouds as a 2D pseudo image. Then, a 2D CNN can be employed to recognize it. However, encoding point clouds into voxels/pillars inevitably results in spatial information loss. Moreover, as point clouds’ locations become more distant and objects get smaller, the number of points within an individual voxel decreases, leading to even more severe information loss and recognition accuracy being affected as a consequence.
Currently, the prevalent methods typically adopt a single-scale voxel/pillar to encode point clouds, rendering them powerless in addressing the exacerbated issue of spatial information loss at long distances. But using multiple multi-scale pillars can help solve this difficulty. Consequently, this article introduces Adaptive Scale and Correlative Attention PointPillars (ASCA-PointPillars). The cornerstone of ASCA-PointPillars lies in the adaptive scale pillars (ASP) module and the correlative point attention (CPA) module. The ASP module employs various sizes of pillar to encode the point clouds based on the sparsity of the point clouds. The smaller the size of the pillar, the stronger the feature representation ability of each point, resulting in less spatial loss. Sparser point clouds in the distance are then encoded using smaller pillars to mitigate the spatial information loss of distant objects and enable the network to capture richer feature information for these objects. After encoding the point clouds into multi-scale pillars, the CPA module is utilized to enhance the feature association within the pillars, providing richer contextual features. The CPA module incorporates a self-attention mechanism that effectively establishes connections between pieces of contextual information [23]. Consequently, the CPA module not only strengthens the feature association within small pillars representing distant objects, but also enhances the feature correlation among point clouds representing small objects, ultimately improving the recognition performance for both distant and small objects.
Moreover, imbalanced amounts of training samples belonging to different categories generally lead to imbalanced recognition performance, and categories with significantly fewer training samples tend to have lower recognition accuracy. Data augmentation, which can overcome this imbalance problem, is therefore critical for the training process. In most algorithms [1,2,3,4,5,6,7,10,11,12,13,14,15,16,18,19,20,21,22], ground truths (bounding boxes and the points inside them) may randomly be inserted into existing samples to augment the training dataset [9]. However, this method may insert ground truths into inappropriate areas. This may create incorrect contextual information for the training of the model [24]. But semantic information of the current scene could help to resolve this problem. So, a random sampling data augmentation algorithm (RS-Aug), based on scene semantic information, is proposed.
RS-Aug can identify reasonable areas for placing ground truths. It initially acquires semantic information regarding obstacles and road surface areas in point cloud scenes. Based on this semantic information, it then removes the areas on the road surface occupied by obstacles. The remaining road surface areas are therefore suitable spaces for placement. By identifying these reasonable areas, RS-Aug can strategically position the ground truths of less frequently occurring categories in appropriate areas, such as the road surface, effectively balancing the number of categories within the dataset.
The contributions of this paper are summarized as follows:
  • This study proposes an object recognition algorithm called ASCA-PointPillars. In the algorithm, we design an ASP module that innovatively utilizes multi-scale pillars to encode point clouds, effectively reducing the loss of spatial information. Subsequently, a CPA module is employed to establish contextual associations within a pillar’s point cloud, thereby enhancing the point cloud’s features.
  • A data augmentation algorithm called RS-Aug is proposed, leveraging semantic information to identify reasonable areas for placing ground truths, addressing the issue of imbalanced categories in datasets.
The first section of this paper is an introduction, indicating the contributions of this article. The second section presents a literature review of related works. The third section is the methodology, which introduces the algorithm of this paper in detail. The fourth section presents the experiment, evaluating the algorithm of this paper and analyzing the experimental results. The fifth section is the conclusion, summarizing the method of this paper and looking forward to the future.

3. Method

3.1. Overall Architecture of ASCA-PointPillars

Fixed-size pillars are adopted in most pillar-based methods [15,16,17,18,19,20]. However, fixed-size pillar sampling inevitably results in spatial information loss, which is exacerbated for distant objects. Therefore, this section proposes ASCA-PointPillars.
The overall architecture of ASCA-PointPillars is shown in Figure 1, and consists of three blocks called Feature Encoding Network (FEN), Detection Neck, and Detection Head. ASCA-PointPillars uses RPN [25] as its Detection Neck, as in PointPillars [15], and uses SSD [27] as its Detection Head.
Figure 1. The architecture of ASCA-PointPillars. Within this framework, FEN serves to generate a pseudo image from point clouds, while the Detection Neck is responsible for extracting and fusing the features of the pseudo image. Finally, the Detection Head produces classification results and regresses the bounding boxes, enabling accurate object detection.

3.2. Feature Encoding Network

The role of the FEN is to first encode the point clouds into multi-scale pillars and then convert them into pseudo images so that they can be extracted by the 2D CNN backbone of the Detection Neck for feature extraction. Our work is mainly focused on ASP and CPA in the FEN, as shown in Figure 2.
Figure 2. Feature Encoding Network of ASCA-PointPillars. This section begins with the utilization of ASP to encode point clouds into multi-scale pillars, generating tensor data. Subsequently, a simplified version of PointNet is employed to extract high-dimensional features. Then, CPA is applied to enhance these features. Finally, the enhanced features are projected as a pseudo image, facilitating accurate object detection.
The closer an object is, the larger the point cluster it will form. Consequently, the closer the object is, the larger the pillar used to abstract larger-scale features. Conversely, a smaller pillar is used for distant objects to provide a more focused abstract to reduce the spatial information loss caused by distant objects. Below are the details of the ASP module.
  • ASP module
In this module, the input point clouds are first divided into pillars, whose sizes are V x × V y on the X-Y plane (the Z-axis is ignored). Here, V x and V y are the sizes of the pillars along the X and Y-axis, respectively, and the value of V y is 0.16 m. The V x value for the pillars nearest to the LiDAR along the X-axis is 0.32 m, which is the biggest among all pillars, and V x is adaptively decreased with the increasing size of X. Therefore, the ASP module forms bigger pillars near the LiDAR and smaller pillars in distant areas. Figure 3 shows the schematic diagram of the ASP module.
Figure 3. Schematic diagram of ASP module. The horizontal axis D represents the distance from a specific row of pillars to the first row of pillars, which is the row closest to the LiDAR, and the vertical axis V x represents the size of the pillars along the X-axis. It indicates that the size of the pillars along the X-axis, denoted as V x , decreases as the distance D increases.
In Figure 3, d m i n is defined as the shortest distance along the X-axis between the pillars and the LiDAR, while d m a x is the longest. D m a x stands for the greatest distance between any two pillars along the X-axis, as determined by Equation (4). d refers to the distance from a specific row of pillars to the LiDAR, and D denotes the distance from a specific row of pillars to the first row of pillars, which is the row closest to the LiDAR. The relationship between d and D is depicted in Equation (3).
In the coordinate system, V x decreases as D increases. V m a x is the value of V x for the first row of pillars. The value V x for the nth size of the pillar can be derived from Equation (1). Equation (2) provides the value of n, which is associated with the distance ratio a. Equation (5) reveals that a is the ratio of D to D m a x . According to Equation (2), there can be at most K different sizes of pillars.
V x = 1 2 n 1 V m a x
n = 1 , a 0 , 1 K 2 , a [ 1 K , 2 K ) K , a K 1 K , 1 ,   K N +
D = d d m i n
D m a x = d m a x d m i n
a = D D m a x
Upon encoding the point clouds into pillars, the data are converted into tensors of size (B, P, N). Here, B signifies the data dimension of each point and is set to 9, including coordinates x, y, z, reflectivity r, distance to the arithmetic mean of all points inside the pillar x r , x r , x r , and the offset to the center of the pillar’s X-Y plane x p , y p . P denotes the count of non-empty pillars in each sample, and N represents the number of points within each pillar. Subsequently, a simplified version of PointNet (a linear layer with 64 output channels followed by BatchNorm [28] and ReLU [29]) is employed to generate a high-dimension feature metric (C, P, N), followed by a max pooling in the N dimension, generating a tensor of size (C, P) [15]. This tensor, referred to as the feature matrix F, will be input into the CPA module subsequently.
2.
CPA module
After converting the point clouds into tensors, the CPA module is employed to enhance the feature association among points within the pillar. Since the self-attention mechanism [23] can effectively establish contextual information association, it is introduced here. Ultimately, the features of points within the pillar are strengthened. Figure 4 illustrates the principle of the CPA module.
Figure 4. Schematic diagram of CPA module. It consists of a self-attention module and residual connection, which are used to establish feature associations among points within a pillar.
As shown in Figure 4, the input for the CPA module is feature metric F. According to [15], F is sent into a Multilayer Perceptron (MLP) layer to obtain matrices Q (query), K (Key), and V (value), as shown in Equations (6)–(8). Then, Q, K, and V are input into Equation (9) to obtain attention weight matrix A d q , d k , and d v in the following equation represent the dimensions of Q, K, and V, respectively.
Q = M L P Q F , Q R C × d q
K = M L P K F , K R C × d k
V = M L P V F , V R C × d v
A = s o f t m a x Q K T d k V , A R C × C
Then, MLP is used to restore A to the original dimension P and obtain the self-attention feature F′, as shown in Equation (10).
F = M L P ( A ) , F R C × P
Because the self-attention might increase the complexity of the model, it might not outperform Convolutional Neural Networks (CNNs) on small-scale datasets. To circumvent this issue, residual connection is employed. This connection adds the original feature F to the self-attention feature F′, as shown in Equation (11). As a result, the correlation between points within local spaces is enhanced to improve the network’s recognition capability.
F = F + F , F R C × P
Because the improved self-attention can enhance the feature correlation within each pillar, it can thus enhance robustness against sparse points.
As shown in Figure 2, the enhanced features obtained from the CPA module are redistributed to their original positions based on the pillar index, resulting in a 2D pseudo image [15]. As shown in Figure 1, the pseudo image is then input into the Detection Neck to further extract features, after which the Detection Head obtains the detection result.

3.3. Detection Neck

The Detection Neck adopts the classic RPN [25] structure. As shown in Figure 5, it includes three consecutive convolutional blocks (Block1, Block2, and Block3) to obtain feature maps with three different resolutions. Then, these three feature maps are up-sampled using three transposed convolutional blocks (DeBlock1, DeBlock2, and DeBlock3) to obtain three feature maps with the same resolution, and finally, these three feature maps are concatenated together to obtain the final feature map. The details of the blocks and deblocks are shown in Table 1. In Table 1, Conv2d represents a 2D convolutional layer, while DeConv2d represents a 2D transposed convolutional layer. The numbers enclosed in parentheses indicate the number of input channels, the number of output channels, and the size of the convolution kernels, stride, and padding, respectively. Furthermore, the output of the Detection Neck is fed into the Detection Head to produce the final recognition results.
Figure 5. Structure of RPN. Three consecutive convolutional blocks are employed to obtain feature maps of three different resolutions. Subsequently, these feature maps are up-sampled to match a common resolution, and then concatenated to obtain the final feature output.
Table 1. Details of blocks and deblocks.

3.4. Loss Function

In this paper, the same loss function as [15] is adopted. Ground truth bounding boxes (GT boxes) and anchor boxes are represented as (x, y, z, w, l, h, θ ), where x, y, and z represent the coordinates, w, l, and h represent the width, height, and length, respectively, and θ indicates the angle offset. Therefore, the offsets of each parameter between the GT boxes and anchor boxes are calculated accordingly:
x = x g t x a d a ,   y = y g t y a d a ,   z = x g t x a d a
w = log w g t w a d a ,   l = log l g t l a d a ,   h = log h g t h a d a
θ = sin θ g t θ a
In Equation (12), the superscripts “gt” and “a” indicate the GT boxes and anchor boxes, respectively, and d a = ( w a ) 2 + ( l a ) 2 . Consequently, the localization loss is
L l o c = b x , y , z , w , l , h , θ S m o o t h L 1   b
The classification loss adopts focal loss [30]. The classification loss is calculated as below, where p a represents the predicted class score for an anchor box. According to [30], α a = 0.25 and γ = 2 .
L c l s = α a   1 p a γ log p a
Based on [9], a SoftMax function is utilized as the angle classification loss ( L c l s ) to enable the learning of object orientations. Consequently, the overall loss is shown in Equation (15):
L o s s = 1 N p o s   β 1 L l o c + β 2 L c l s + β 3 L d i r  
In the equation above, N p o s represents the number of positive anchor boxes. β 1 , β 2 , and β 3 are the weights for the three types of losses, where β 1 = 2 , β 2 = 1 , and β 3 = 0.2 [9].

3.5. RS-Aug Algorithm

The widely used GT-Aug [9] overlooks the semantic information of the scene during random sampling and augmentation. Consequently, it might introduce ground truths into unreasonable areas, resulting in unreasonable scenes [24]. These unreasonable augmentations could mislead networks to learn incorrect information.
Here, we propose RS-Aug, which considers the semantic information of the scene in augmenting the samples, as depicted in Figure 6a.
Figure 6. Flow charts of RS-Aug (a) and GT-Aug (b). The bold boxes indicate the differences between the two algorithms, including “Points segmentation”, “Points clustering”, and “Determine the placement area” versus “Randomly select the placement area”.
As shown in Figure 6a, RS-Aug includes 6 steps.
  • Establish the database. Establish a database that includes all the ground truths (bounding boxes and the points inside them). For instance, in our following experiment, three categories of ground truths are included. They are vehicles, cyclists, and pedestrians.
  • Segment the ground and non-ground points. Input a training sample, then apply the RANSAC algorithm to segment the non-ground points and ground points and obtain ground fitting parameters.
  • Points clustering. Utilize the DBSCAN algorithm to cluster non-ground points, thereby obtaining clusters of non-ground points.
  • Determine the placement area. Obtain the semantic information of the current scene through Steps 2 and 3, classifying the point cloud into ground and non-ground points. Subsequently, apply the Minimum Bounding Rectangle algorithm to fit the bounding boxes of the non-ground point clusters, acquiring the position and size of the bounding boxes. Using the size and position information, exclude the regions occupied by the bounding boxes from the ground area identified in Step 2, leaving the remaining space as the designated area for placement.
  • Construct a new sample. Randomly select the ground truths from the database based on the proportions of different categories appearing in the training dataset. Then, randomly insert ground truths into the designated placement area.
  • Collision checking. Check whether the newly placed point cluster collides with the existing point clusters. If a collision is detected, repeat step (5) and then perform the collision check again. If no collision is detected, the enhanced data will be fed into the network for training, and this process ends.
By following the steps above, in a road environment, RS-Aug places ground truths on paved surfaces while avoiding existing objects. Conversely, in non-road environments, ground truths are placed on flat terrain while avoiding the occlusion of existing objects. Benefiting from these measures, RS-Aug ensures that the augmented samples are reasonable.
There are differences between RS-Aug and GT-Aug. As shown in Figure 6a,b, GT-Aug lacks the operations of “Points segmentation”, “Points clustering”, and “Determine the placement area”, which correspond to Steps 2, 3, and 4 mentioned previously. Therefore, RS-Aug can selectively place ground truths into a scene based on the semantic information of the point cloud scene, whereas GT-Aug can only randomly place ground truths into a scene without being able to judge the reasonableness of the placement.
To demonstrate the effects of RS-Aug and the shortcomings of GT-Aug, a comparison is presented based on PointPillars. This comparison is conducted on a single sample from the KITTI 3D object dataset [31]. In this instance, six cyclists and pedestrians are added to the current scene. Figure 7a,b show the image and point cloud of the original scene, respectively. Figure 7c,d illustrate the augmented results. It can be seen that some pedestrians and cyclists (green and blue boxes) have been inserted into the scene. However, as shown in Figure 7c, GT-Aug places some pedestrians and cyclists outside the road (highlighted with black boxes), failing to accurately reflect actual driving scenarios. Consequently, this could lead to incorrect information being utilized during the training. In contrast, RS-Aug can strategically place ground truths onto the drivable road area of the scene while avoiding occlusion among objects. Thus, a reasonable augmentation to the training dataset is achieved.
Figure 7. Comparison of visualization results of GT-Aug and RS-Aug: (a) point cloud collection scenarios; (b) ground truth; (c) visualization result of GT-Aug; (d) visualization result of RS-Aug. Cars are represented by red bounding boxes, cyclists by blue bounding boxes, and pedestrians by green bounding boxes.

4. Experiment and Result Analysis

4.1. Experimental Setup

The experiment was initially conducted on the KITTI 3D object dataset: 7481 training samples were divided into a training set containing 3712 samples and a validation set containing 3769 samples. A deep learning server was set up for the experiment, the configuration of which is shown in Table 2.
Table 2. Deep learning server.
The training epoch was set to 160 with a batch size of 6. The deep learning optimizer utilizes Adam (Adaptive Moment Estimation) with an initial learning rate of 2 × 10−4. The learning rate decays by 0.8 times every 15 epochs. Pass-through filtering is used to intercept regions of interest, with a specific range as shown in Equation (12).
0 x 69.12 39.68 y 39.68 3 z 1
In this approach, the maximum number of pillars (denoted as P) in each sample is set to 12,000, with each pillar containing a maximum of 64 points. If the number of pillars in each sample and the number of points in each pillar exceed the preset threshold, random sampling will be adopted. Conversely, if the number is too small to form a tensor, zero padding will be utilized instead. When calculating the 2D Intersection over Union (IoU) metric, positive matches usually choose the highest values or those marked values that surpass the positive match threshold. Conversely, negative matches consider marked values below the negative threshold. Redundant anchor points are excluded during loss calculation. Following the methodology of VoxelNet [8], our study defines overlap thresholds for the car category at 0.7 IoU across easy, moderate, and hard recognition scenarios. For cyclists and pedestrians, the overlap thresholds are uniformly set at 0.5 IoU across easy, moderate, and hard recognition scenarios.

4.2. Experimental Results

Average Precision (AP) was adopted as the evaluation metric for this study. As demonstrated in Table 3, ASCA-PointPillars outperforms other algorithms in terms of pedestrian recognition accuracy. This highlights the effectiveness of the ASP module and RS-Aug in improving the recognition performance of small objects. Although the recognition accuracy for cars and cyclists may not exceed that of other algorithms, ASCA-PointPillars achieves a higher Frames Per Second (FPS) than all other algorithms (except PointPillars), ensuring real-time recognition. Moreover, ASCA-PointPillars exhibits superior recognition accuracy across all three categories compared to PointPillars, demonstrating its ability to maintain high accuracy in real-time scenarios.
Table 3. Results on the KITTI 3D object dataset (%). NaN indicates that there are no relevant data in the KITTI test benchmark.
ASCA-PointPillars was also implemented on NVIDIA Xavier AGX, an edge computing device, and achieved a frame rate of 21.82 FPS. This demonstrates that the proposed algorithm can provide real-time recognition capabilities in vehicle terminals.
Table 4 shows the recognition results of PointPillars and the proposed ASCA-PointPillars on objects at two distance ranges in the KITTI dataset, using mean Average Precision (mAP) as the evaluation metric. As can be seen from the table, the recognition accuracy of both algorithms inevitably decreases as the distance increases. However, compared to PointPillars, the proposed ASCA-PointPillars exhibits a higher recognition accuracy for objects at longer distances. Specifically, it achieves accuracy improvements of 2.94% and 3.02% in the ranges of 0–40 m and 40–80 m, respectively. This demonstrates that the proposed ASP module can effectively enhance the recognition accuracy of objects at long distances.
Table 4. Distance-wise recognition results on the KITTI 3D object validation dataset (%).
Figure 8 offers a qualitative analysis of the recognition results, with cars represented by red bounding boxes, cyclists by blue bounding boxes, and pedestrians by green bounding boxes. As shown in Figure 8c,d, the recognition results clearly show that while PointPillars may fail to detect distant objects, ASCA-PointPillars can accurately recognize certain severely occluded or sparse objects at a distance. Only a few distant objects and those with excessively sparse point clouds remain undetected.
Figure 8. Visualization of recognition results for PointPillars and ASCA-PointPillars: (a) scene; (b) ground truth; (c) prediction of PointPillars; (d) prediction of ASCA-Pointpillars. Cars are represented by red bounding boxes, cyclists by blue bounding boxes, and pedestrians by green bounding boxes.
As shown in Figure 9, in order to demonstrate the effectiveness of RS-Aug, this article compares the PointPillars of GT-Aug and RS-Aug, respectively. Red bounding boxes represent cars, blue bounding boxes represent cyclists, and green bounding boxes represent pedestrians. From Figure 9c, it is evident that the PointPillars using GT-Aug exhibit some misdetections (in the red boxes) and missed detections in certain scenarios. Conversely, as shown in Figure 9d, the PointPillars based on RS-Aug demonstrate superior performance in detecting pedestrians and cyclists with sparse and fewer surface point clouds, effectively reducing the instances of missed detections.
Figure 9. Recognition results of PointPillars using different data augmentation algorithms: (a) scene; (b) ground truth; (c) prediction of GT-Aug-PointPillars; (d) prediction of RS-Aug-PointPillars. Cars are represented by red bounding boxes, cyclists by blue bounding boxes, and pedestrians by green bounding boxes.

4.3. Ablation Experiment

This section analyzes the impact of each component on recognition accuracy in the KITTI 3D object validation dataset. The replication results of PointPillars on the validation dataset are taken as the baseline. As shown in Table 5, the average recognition accuracy under three levels of difficulty for cars, pedestrians, and cyclists improved by 1.14%, 0.96%, and 1.2%, respectively, after implementing the ASP module. This suggests that multi-scale pillar sampling is effective for recognizing small and distant objects. After implementing the CPA module, the average recognition accuracy for the three categories increased by 3.03%, 2.09%, and 2.74%, respectively. This indicates that the CPA module can effectively strengthen the feature correlation of point clouds in each pillar, enabling the network to learn richer contextual features, and thereby improving recognition accuracy. When combining these two modules, ASCA-PointPillars achieved an average improvement of 4.23%, 3.1%, and 3.58% in the three categories, respectively, achieving the highest average recognition accuracy for the car category. This demonstrates the effectiveness of the ASP and CPA modules. Using the RS-Aug algorithm alone increased the average recognition accuracy of the three categories by 2.59%, 2.49%, and 2.96%. When ASCA-PointPillars was combined with RS-Aug, it achieved the highest average recognition accuracy for pedestrians and cyclists, with improvements of 4.09%, 5.08%, and 5.07% in the three categories, respectively. This shows that RS-Aug can effectively improve the recognition performance of categories that have fewer instances in a dataset.
Table 5. The effect of each component on accuracy (%). “√” indicates that the component is used.

5. Conclusions

This study introduces ASCA-PointPillars, a novel object recognition algorithm. The algorithm leverages an ASP module sampling point clouds with multi-scale pillars to mitigate the spatial information loss typically associated with single-scale pillar sampling. Furthermore, a CPA module is proposed to establish interconnections among points within the same pillars. To address the issue of the imbalanced distribution of various categories in a dataset, an RS-Aug algorithm is also proposed. The experimental results show that the proposed ASCA-PointPillars can effectively improve the recognition performance of distant and smaller objects, and the proposed RS-Aug algorithm can effectively improve the recognition performance of categories that have fewer instances in a dataset. The recognition accuracy of ASCA-PointPillars for pedestrians exceeds that of other comparison algorithms, demonstrating its advantage in identifying small objects. However, the recognition accuracy of the algorithm for cars and cyclists does not reach the highest level. This may be because encoding point clouds of larger objects such as cars and cyclists into pillars leads to greater information loss. Therefore, future efforts will continue to focus on implementing measures to reduce information loss during the encoding process for large objects.

Author Contributions

Methodology, X.Z. and S.C.; validation, X.Z. and S.C.; formal analysis, Y.G. and J.Y.; resources, Y.G.; writing—original draft preparation, X.Z.; writing—review and editing, Y.G. and J.Y.; project administration, J.Y.; funding acquisition, Y.G and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Xi’an Scientific and Technological Projects (grant numbers: 23ZDCYJSGG0024-2022, 23ZDCYJSGG0011-2022, and 21RGZN0005) and by Key Research and Development Program of Shaanxi Province (grant number: 2024GX-YBXM-530).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  2. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
  3. Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  4. Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  5. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  6. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020. [Google Scholar]
  7. Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020. [Google Scholar]
  8. Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  9. Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  10. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35. [Google Scholar]
  11. Hu, J.S.; Kuai, T.; Waslander, S.L. Point density-aware voxels for lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  12. Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-equivariant 3d object detection for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 13–14 February 2023; Volume 37. [Google Scholar]
  13. Rong, Y.; Wei, X.; Lin, T.; Wang, Y.; Kasneci, E. DynStatF: An Efficient Feature Fusion Strategy for LiDAR 3D Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  14. Wang, H.; Shi, C.; Shi, S.; Lei, M.; Wang, S.; He, D.; Schiele, B.; Wang, L. Dsvt: Dynamic sparse voxel transformer with rotated sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  15. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  16. Li, J.; Luo, C.; Yang, X. PillarNeXt: Rethinking network designs for 3D object detection in LiDAR point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  17. Shi, G.; Li, R.; Ma, C. Pillarnet: Real-time and high-performance pillar-based 3d object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]
  18. Shi, G.; Li, R.; Ma, C. Pillar R-CNN for point cloud 3D object detection. arXiv 2023, arXiv:2302.13301. [Google Scholar]
  19. Lozano Calvo, E.; Taveira, B. TimePillars: Temporally-recurrent 3D LiDAR Object Detection. arXiv 2023, arXiv:2312.17260. [Google Scholar]
  20. Zhou, S.; Tian, Z.; Chu, X.; Zhang, X.; Zhang, B.; Lu, X.; Feng, C.; Jie, Z.; Chiang, P.Y.; Ma, L. FastPillars: A deployment-friendly pillar-based 3D detector. arXiv 2023, arXiv:2302.02367. [Google Scholar]
  21. Fan, L.; Yang, Y.; Wang, F.; Wang, N.; Zhang, Z. Super sparse 3d object detection. arXiv 2023, arXiv:2302.02367. [Google Scholar] [CrossRef]
  22. Fan, L.; Wang, F.; Wang, N.; Zhang, Z. Fsd v2: Improving fully sparse 3d object detection with virtual voxels. arXiv 2023, arXiv:2308.03755. [Google Scholar]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  24. Hu, X.; Duan, Z.; Huang, X.; Xu, Z.; Ming, D.; Ma, J. Context-aware data augmentation for lidar 3d object detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
  25. Ross, G. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  27. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision, Proceedings of the ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  28. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015. [Google Scholar]
  29. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010. [Google Scholar]
  30. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  31. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.