1. Introduction
With the ongoing advancements in automatic driving, virtual reality, and intelligent manufacturing, among other application scenarios, the detection of three-dimensional object targets has emerged as a favored technique in the realm of computer vision [
1]. The advancement in LiDAR manufacturing processes has led to a gradual increase in the density of radar output point clouds, resulting in improved measurement accuracy. As a result, the point cloud 3D object detection algorithm, which utilizes point clouds as the input, has become mainstream technology in the field of vision. Currently, 3D target detection in point clouds relies primarily on point- and voxel-based methods. While the PointNet [
2] and VoxelNet [
3] approaches have adequately prepared for feature extraction from point clouds, the data’s sparsity and complexity present difficulties in capturing the target, leading to poor detection accuracy [
4,
5].
The PointPillars 3D target detection algorithm has gained immense popularity in the industry owing to its high efficiency and accuracy [
6]. The algorithm converts three-dimensional point cloud voxel [
3] features into two-dimensional images for detection, thereby greatly reducing operating costs. However, the main challenge presently is to enhance detection accuracy while maintaining efficiency. Several scholars from various countries have proposed different methods to address this issue. Ryota et al. suggest varying voxel sizes and carrying out feature fusion after extracting the features [
7]. Li et al. recommend incorporating the attention mechanism within the voxel and introducing relationship features between diverse point clouds [
8]. Konrad et al. [
9] aim to replace the backbone network with alternative feature extraction networks. However, although these approaches have been attempted, they only result in marginal improvement of the detection accuracy. In light of this, they suggested that a considerable amount of parameters in PointPillars are concentrated in the backbone network. As a solution, novel backbone networks like MobileNet and DarkNet have been employed as substitutes to the original convolution.
The results showed that this modification greatly enhances detection efficiency. After a meticulous investigation, this study concludes that enhancing the backbone network (2D Convolutional Neural Networks) is the most effective way to maximize detection accuracy and efficiency. Down-sampling the model and implementing the attention mechanism, along with sampling the backbone network, can significantly enhance the detection precision and efficiency. Compared with the existing literature, the major contributions of this work can be summarized as follows, and the new backbone network steps are shown in
Figure 1.
Enhancing features of pseudo-images generated through the algorithm via channel attention, spatial attention, and 1 × 1 2D convolution.
FasterNet [
10], a more lightweight network, replaces the feature extraction network, resulting in a significant reduction in module parameters. This replacement not only enhances detection accuracy but also decreases the number of parameters required.
Replacing the original model’s inverse convolution technique involves the implementation of two up-sampling methods and feature enhancement using proximity scale sampling methods.
2. Related Work
LiDAR data are commonly processed with Deep Convolutional Neural Networks (DCNNs), which integrate the entire processing flow, resulting in high computational and storage complexity. Target detection algorithms based on DCNNs outperform traditional methods in terms of detection accuracy and recognition rate [
9]. Traditional 2D image detection algorithms, which utilize a camera as a data source, rely on an external light source and cannot precisely determine information such as distance, position, depth, and angle of targeted vehicles and individuals. In contrast, LiDAR generates three-dimensional point cloud data that provide details such as the object’s position, distance, depth, and angle, making the data representation more realistic. LiDAR provides the benefits of precise ranging and does not require visible light [
11].
The unstructured and non-fixed-size characteristics of the point cloud hinder its direct processing by 3D target detectors; therefore, it necessitates transcription into a more condensed structure through some sort of expression form [
12]. Currently, there are two primary expression forms: point-based and voxel-based procedures. The transcribed point cloud can undergo feature detection via convolutional or established backbone networks. Networks vary in their feature extraction capabilities and parametric quantities; thus, network choice should be evaluated on a case-by-case basis.
2.1. Point-Based, Voxel-Based Approach to Target Detection
Qi et al. first extracts the PointNet network to derive features directly from the disordered point cloud. The T-Net network is then employed to predict the affine transformation matrix to align all the points with features. The symmetric function (MaxPooling) is utilized to address the disorderedness of the point cloud, while the multilayer perceptron resolves the problem of point cloud order invariance [
2]. PointNet fails to address the issue of uncertain local feature extraction. To obtain more efficient features, PointNet++ [
13] employs a Hierarchical Set Abstraction layer, with each module using several Set Abstraction modules to extract features. To enhance feature extraction capability, PointNet++ utilizes Hierarchical Set Abstraction (HSA). HSA leverages various Set Abstraction modules for extracting features. Each module differs in the number of sampling points and the sampling radius. Consequently, this approach effectively improves local feature extraction. PointRCNN [
14] utilizes PointNet++ for feature extraction, alongside foreground and background segmentation based on the aforementioned extracted features, 3D frame prediction on each foreground point, and, ultimately, further refinement on the object of interest. Whilst these approaches allow for maximization of geometric point cloud features and consequently improve detection performance, they do demand extensive time and computational resources during the feature extraction phase [
15].
The process of transforming point clouds into voxels involves dividing the space occupied by the point cloud into blocks of a fixed size. The features of the point clouds within these voxel blocks are then extracted. The VoxelNet algorithm utilizes the Voxel Feature Encoder (VFE) to measure and standardize the features across voxels. Subsequently, a 3D convolutional neural network is employed to extract the features of the voxels, and finally, an RPN network is used to generate the detection frame. Due to numerous voxels, feature extraction is sluggish. Yan et al. [
16] proposed utilizing 3D sparse convolution to conduct feature extraction of regulated voxels based on VoxelNet due to the rarity of the point cloud, which significantly improved the processing speed in contrast to VoxelNet. Lang et al. [
17] recommended PointPillars, which effectively columnate the voxel points and change the height of the voxel to correspond to that of the point cloud space. A straightforward PointNet is applied to convert the voxel into a pseudo-image, allowing 2D convolution to extract features while maintaining 3D characteristics, thus significantly increasing the operational speed. PointPillars is the most prevalent 3D detection algorithm in the industry due to its rapid operation and exceptional accuracy. Voxel-based techniques have lower detection accuracy than point methods. How to ensure the detection efficiency on the basis of improving the detection accuracy has become a research hotspot.
2.2. Feature Extraction Networks
Different blocks and depths are employed by feature extraction networks to extract features. To prevent the issue of gradient explosion and gradient disappearance from more extensive models, He et al. [
18] proposed ResNet, which uses residual networks that allow network layers to reach significant depth. The MobileNet [
19] algorithm reduces the number of network parameters by using depth-separable convolution and introducing shrinkage hyperparameters. This approach provides excellent support for real-world scenarios in devices with limited computational power. DarkNet employs a repeated stacked down-sampled convolution and residual block architecture, renowned for its speed and efficiency, particularly in YOLO model series. This structure enables DarkNet to perform real-time or near real-time target detection, making it ideal for use on devices with limited computational power [
20]. Extraction networks are progressing towards achieving greater accuracy with fewer parameters, and the utilization of appropriate feature extraction networks can significantly enhance target detection efficiency and accuracy.
4. Results
In this study, we conducted experimental evaluations utilizing the extensive KITTI dataset, which is publicly available. The dataset comprises 7481 training samples and 7518 test samples captured from autonomous driving scenarios. We partitioned the training data into a training set of 3712 frames and a validation set of 3796 frames using the PointPillars algorithm. The training set was utilized for testing, while the validation set was utilized for experimental studies. The KITTI dataset encompasses three classifications: car, cyclist, and pedestrian are each classified into three levels of difficulty: easy, medium, and hard, which are determined by various factors such as 3D object size, occlusion level, and truncation level [
28].
The experimental environment used in this study comprises Ubuntu 18.04 LTS, CUDA 11.4, Python 3.8, and pytorch 1.13.1. To achieve end-to-end training, the Adam optimizer was applied, with a batch size of 8, a maximum of 100 iterations, and no ground truth frame. In the experiments, each Pillar was set to a size of [0.16, 0.16] in the X–Y dimensions, the maximum number of columns was 12,000, and the global point cloud was randomly scaled in the range of [0.95, 1.05].
4.1. Experimental Results and Analysis
The algorithms in this paper were tested on the KITTI dataset and evaluated for accuracy using 40 predefined locations from the official KITTI test. The IOU (Intersection over Union, one of the evaluation indicators) threshold was set at 0.7 for cars and 0.5 for pedestrians and cyclists.
Table 2,
Table 3 and
Table 4 compare the results of this paper with those of the PointPillars algorithm on the KITTI dataset for the car, pedestrian, and cyclist categories. This paper analyzes the detection effect. This paper examines the topic from three different perspectives: BEV (accuracy of detection frames in BEV view), 3D (accuracy of 3D inspection frames), and AOS (accuracy of detecting target rotation angle).
From the data in the above table, it is evident that up-sampling in Dysample or Carafe improves the detection of the KITTI dataset in any difficulty, although certain types of objects are biased towards improvement. Notably, Carafe up-sampling shows a more substantial improvement in object types. Specifically, in BEV, there is an improvement of cyclist from 81.44 to 85.64 in easy difficulty, and in 3D mode, there is a 6.51 improvement in easy difficulty. In AOS mode, the most significant enhancement is observed for pedestrian, with a maximum improvement of 6.24 in medium difficulty. The improvement effect is also more apparent in easy and hard modes. Dysample displays the most noticeable improvement in pedestrian. On the other hand, the improvement of Dysample is less pronounced, which could be attributed to its lack of dynamic feature extraction. Meanwhile, the precision of this optimized algorithm has significantly increased compared to other algorithms, such as VoxelNet, SECOND, and others.
It is apparent that the up-sampling techniques of Dysample and Carafe differ in their effect on target feature reconstruction. If detection accuracy is a priority, Carafe up-sampling is a suitable option. However, if detection efficiency is required, then Dysample may be more appropriate.
4.2. Ablation Experiments
In this section, we provide the results of ablation experiments to evaluate the key factors that affect the accuracy of the experiments.
The results of the ablation experiments, shown in
Table 5 and
Table 6, indicate that replacing each module separately led to an improvement beyond the set parameters. The replacement of the backbone network showed the most significant improvement, with FasterNet demonstrating the most noticeable increase in detection accuracy. The Attention Mechanism module showed an improvement of nearly 1 in detection at all difficulties. Both up-sampling methods, Carafe and Dysample, demonstrated significant improvement. However, the improvement was more pronounced for Carafe up-sampling.
4.3. Visualization Analysis
To provide evidence of the method’s effectiveness in improving detection accuracy, this section presents visualized and analyzed results of the algorithm. Due to the use of pure radar data in the algorithm, the relevant information cannot be visually represented. Therefore, the front-view camera image is referred to as a comparison for the radar detection results in the BEV view.
After analyzing the four scenes in
Figure 5, it is evident that the PointPillars network misidentifies point cloud shapes reflected from objects such as tree trunks and line poles as cars or pedestrians. However, the model with the improved backbone network has a significantly lower misdetection and error detection rate. The improved backbone network strengthens the features of the point cloud pseudo-images, while the lightweight network and proximity scale sampling method retain more semantic information and improve the module’s ability to extract features. As a result, the improved PointPillars outperform the pre-improved version by reducing misdetection and omission, ultimately improving network detection performance.
In the initial scene, PointPillars identifies the far-off iron house as a vehicle, which the algorithm in the optimization is able to avoid. Additionally, there are more trees on the left side, and PointPillars recognizes the number as people, which is significantly reduced compared to PointPillars in the optimized algorithm, although there is also a misdetection.
In the second scenario, PointPillars made a mistake during multi-check by identifying the iron ladder near the car as a car. This error was avoided in the optimized scenario. Multi-checks occurred near the point in PointPillars and PP Carafe, mistaking the road sign for a person. The optimized scheme shows the direction of the two cars in the upper left at a more accurate angle. The three scenarios resulted in a false pickup on the right against the red streetlight army.
The third scenario is straightforward. All three scenarios detected the bicycle following the car. PointPillars produced one multi-detection (a lateral car) for the car directly in front and none in the optimized scenario. PointPillars is especially problematic for the false detection of trees on the left side. In the optimized scenario, PP Carafe produces only one multidetection, and PP Dysample identifies the number on the left side intact.
The fourth scenario is complex, involving multiple object types. PointPillars still experiences significant multi-detection issues on the left side. In all three scenarios, there are missed detections directly in front. PP Dysample has more severe detection errors on the left side, and one vehicle lacks direction discrimination. For the nearby vehicles, the optimized scheme improves the vehicle facing angle significantly, resulting in much better direction prediction than PointPillars.
In summary, this paper’s optimization scheme produces a superior visualization effect compared to PointPillars. However, the scheme does have some limitations. For instance, it enhances near objects more than distant targets, possibly due to the limited number of point clouds for distant objects, resulting in errors. To address this, we suggest exploring the use of a multi-frame aggregation method to increase the number of radar point clouds for distant objects. The optimization scheme for enhancing small targets still has certain defects. After processing the point cloud into an image, it may ignore the connection between different voxels, resulting in a loss of contextual information. The next step of this paper will focus on studying point cloud coding.
5. Conclusions
In this paper, we present a 3D target detection algorithm with an enhanced backbone network based on PointPillars for object detection. Initially, we generate a point cloud pseudo-image and subsequently employ channel spatial attention to consolidate contextual information, optimize image features, and establish connections across channels and locations. Additionally, a modified lightweight network, FasterNet, and proximity scale up-sampling have been implemented to enhance the feature extraction capabilities of the convolutional neural network and maintain the integrity of deep point cloud features. The outcome shows significant improvement in comparison to the PointPillars algorithm. The comparison of the algorithm highlights that by introducing the attention mechanism, replacing the feature extraction network, and utilizing a new up-sampling method, target detection accuracy can be significantly enhanced.
However, this experiment has a slight limitation. Despite the improved detection accuracy, it is unable to differentiate false detections and omissions at a glance. Additionally, in the presence of numerous pedestrians or bicycles, it may cause interference, which can influence the results of the detection.