SAE3D: Set Abstraction Enhancement Network for 3D Object Detection Based Distance Features

With the increasing demand from unmanned driving and robotics, more attention has been paid to point-cloud-based 3D object accurate detection technology. However, due to the sparseness and irregularity of the point cloud, the most critical problem is how to utilize the relevant features more efficiently. In this paper, we proposed a point-based object detection enhancement network to improve the detection accuracy in the 3D scenes understanding based on the distance features. Firstly, the distance features are extracted from the raw point sets and fused with the raw features regarding reflectivity of the point cloud to maximize the use of information in the point cloud. Secondly, we enhanced the distance features and raw features, which we collectively refer to as self-features of the key points, in set abstraction (SA) layers with the self-attention mechanism, so that the foreground points can be better distinguished from the background points. Finally, we revised the group aggregation module in SA layers to enhance the feature aggregation effect of key points. We conducted experiments on the KITTI dataset and nuScenes dataset and the results show that the enhancement method proposed in this paper has excellent performance.


Introduction
With the development of unmanned driving and other technologies, understanding 3D scenes based on the point cloud has become a popular research topic.Compared to traditional images, point cloud data have unique advantages.The strong penetration of LiDAR makes the point cloud less susceptible to external factors such as weather and light.However, point clouds are also characterized by sparseness and disorder, and the reflectivity of LiDAR decreases as the measurement distance increases.This leads to poor characterization of objects at a distance, causing a drop in detection accuracy.How to deal with these characteristics of point clouds has become the key to improving accuracy in 3D detection tasks.
In recent years, to efficiently utilize the information provided by the point cloud, researchers have proposed a number of schemes, as shown in Figure 1.These are mainly divided into two types based on different processing methods: (a) Grid-based methods, which partition the sparse points into regular voxel or pillar grids, and process them through 3D or 2D convolutional networks.(b) Point-based methods, which directly perform feature learning on point sets with SA which are most often utilized to sample the key points and aggregate features.
Compared to the point-based methods, the grid-based methods increase the computational speed of network inference, but also cause information loss during the voxelization.Therefore, to ensure the full utilization of information in point sets, a point-based methods enhancement network is proposed in this paper.
Sensors 2024, 24, 26 2 of 14 ule to better access key points.However, the point cloud features used by these algorithms only utilize the raw features of the point cloud, i.e., the reflectivity and 3D coordinates that reveal spatial information of each point in the point cloud, and distance characteristics are not taken into consideration.In the actual measurement, due to the attenuation characteristics of LiDAR and the limitation of the observation angle, the reflectivity of the measured point decreases as the object moves further away, and the projection of the object in the point cloud also decreases.Therefore, based on the distance characteristics related to the point cloud, we propose three feature enhancement modules to more efficiently utilize the semantic information contained within the point cloud.Firstly, we propose the initial feature fusion module, in which the distance feature is extracted from the point cloud and incorporated with the raw features of each point.Secondly, we introduce a key point feature enhancement module.During the group aggregation in SA, the self-characterization of the key points will be weakened, but it is crucial for distinguishing whether the key point is a foreground or background point.Therefore, after each sampling aggregation, the multi-attention mechanism is used to strengthen the features of key points and fuse them with the aggregated features.Finally, to enhance the effect of group aggregation in SA, we revised the original grouping module.In the original module, multiple points nearest to the key points are taken to participate in feature aggregation after sampling over the key points.However, only the spatial location is considered, which may result in features belonging to different categories being mixed together during the aggregation process.This leads to a decrease  The core of the point-based 3D object detection methods is the SA layer, which was first proposed by Qi et al. [1].In prior research, the SA layer has been revised using many methodologies, and how to fully utilize the information of each point and reduce inference time has become a priority in point-based methods.In 3DSSD [2], to speed up the inference, researchers first adopted a 3D single stage object detector and proposed a feature-based farthest point sample module (F-FPS).This module utilizes the feature information of the point sets to sample key points in order to maintain adequate interior points of different foreground instances.SASA [3] uses a semantic-segmentation-based farthest point sample module (S-FPS), which utilizes point cloud features distinguish the foreground points from the background points through a small semantic segmentation module to better access key points.However, the point cloud features used by these algorithms only utilize the raw features of the point cloud, i.e., the reflectivity and 3D coordinates that reveal spatial information of each point in the point cloud, and distance characteristics are not taken into consideration.In the actual measurement, due to the attenuation characteristics of LiDAR and the limitation of the observation angle, the reflectivity of the measured point decreases as the object moves further away, and the projection of the object in the point cloud also decreases.
Therefore, based on the distance characteristics related to the point cloud, we propose three feature enhancement modules to more efficiently utilize the semantic information contained within the point cloud.Firstly, we propose the initial feature fusion module, in which the distance feature is extracted from the point cloud and incorporated with the raw features of each point.Secondly, we introduce a key point feature enhancement module.During the group aggregation in SA, the self-characterization of the key points will be weakened, but it is crucial for distinguishing whether the key point is a foreground or background point.Therefore, after each sampling aggregation, the multi-attention mechanism is used to strengthen the features of key points and fuse them with the aggregated features.Finally, to enhance the effect of group aggregation in SA, we revised the original grouping module.In the original module, multiple points nearest to the key points are taken to participate in feature aggregation after sampling over the key points.However, only the spatial location is considered, which may result in features belonging to different categories being mixed together during the aggregation process.This leads to a decrease in the performance of the semantic segmentation module before S-FPS, which in turn degrades the sampling effect of S-FPS.Therefore, we optimize the grouping module by selecting the points with the closest features as the aggregation points from among multiple points closest to the key points.
In summary, the main contributions of this article are summarized as follows: • We propose a key points self-features enhancement module to enhance the self-features of the key points.In this module, we introduce the multi-attention mechanisms to enhance the raw features and distance features to retain the semantic information of the key points as much as possible during each SA layer.• We propose an initial feature fusion module to extract the distance features of the point cloud and fuse the distance features into the raw features of the point sets.This module makes the features of the distant points more significant and thus improves the detection accuracy of the distant instances.• We revise the group aggregation module in the set abstraction.We make a second selection after the first selection of points within a fixed distance around the key point.
In second selection, we take the features into account to enhance the sampling effect of S-FPS.

Related Work
Since the growing data on point clouds bring huge challenges to existing point cloud processing networks, it is important to compress the point cloud before processing it.Different compression algorithms used may affect the subsequent detection effect.For example, Sun X et al. [4] optimized the processing of large-scale point cloud data and their algorithm [5] further streamlines the network for point cloud processing.The algorithm [6] makes the spatial distribution of the compressed point cloud more similar to the original point cloud, which is very useful for subsequent point cloud processing.
The point cloud compression algorithms mentioned above play a significant role in the point cloud detection algorithms we will discuss next.

Grid-Based Methods
Grid-based methods are mainly divided into two categories: voxel-based methods and pillar-based methods.In voxel-based methods, an irregular point cloud is first con-verted into regular voxels, which are then fed into the network.Voxel-Net [7] is a pioneering network that converts point cloud into voxels, and then utilizes 3D convolutional networks to predict 3D bounding boxes.Yan et al. [8] proposed 3D sparse convolution, which reduces the computation of traditional 3D convolution and greatly improves the detection efficiency of voxel-based detection networks.Voxel-Transformer [9] and Voxel Set Transformer [10] introduce modules such as Transformer [11] and Set Transformer [12], respectively, on the basis of voxels to improve the detection accuracy.SFSS-Net [13] is a unique algorithm to filter background points before the voxelization to reduce computational complexity.Pillarbased methods such as Point Pillars [14] divide the space into regular pillars, which are compressed and then fed into a 2D convolutional network, greatly increasing the network inference speed.Pillar Net [15] uses a sparse convolutional-based encoder network for spatial feature learning, and the Neck module for high-level and low-level feature fusion to improve the accuracy of pillar-based detection methods.Pillar Next [16] first compares different local point aggregators (pillar, voxel and multi-view fusion) from the perspective of computational budget allocation.Research shows that pillars can achieve better performance compared to voxels.Grid-based methods lose more semantic information in the process of converting an irregular point cloud into regular voxels or pillars.This may lead to poor performance in the final detection accuracy.

Point-Based Methods
Point-based methods generally perform feature extraction directly on the point sets.This approach obtains key points and aggregates points around them by means of sampling and group aggregation for feature extraction.Point-based methods were first proposed by Qi et al. [17] and later improved and refined by Qi et al. [1].Shi et al. [18] first proposed to extract the foreground points by segmentation and utilize the features of these points for the bounding box regression to improve the detection accuracy.Yang et al. [2] utilized one-stage detection to improve the inference speed and proposed the F-FPS, to make the sampled key points closer to the foreground instances.SASA [3] is used to predict scores of each point by a small semantic segmentation module to make abstracted point sets focus on object areas.Chen et al. [19] introduced density information from point clouds using the Multilayer Perceptron (MLP) and integrated it with features extracted by grouping operations in the point-based method.
Since point-based 3D object detection is directly processing the point sets, point-based methods result in high computational consumption and long network inference time.However, relative to the voxel-based methods, point-based methods can maximize the retention of the semantic information of the point cloud and achieve higher detection accuracy.Therefore, this paper adopts the point-based object detection network and aims to utilize the original information of the point cloud more efficiently.

Proposed Methods
In this section, we will introduce in detail the network architecture of the SAE3D proposed in this paper.This enhancement network consists of three main parts: an initial feature fusion module, a key points self-features enhancement (KSFE) module and a revised group aggregation (RGA) module.
As shown in Figure 2, the overall architecture is a one-stage point-based 3D object detection network.Firstly, we define the raw points fed into the network as P; the initial feature fusion module extracts the distance features and integrates initial features of each point in P.After integration, we feed P into the backbone, which contains three SA layers, and we refer to the input of each SA layer as P 1 .In SA layers, we first sample the key points K from P 1 , and then we feed K into the key points feature enhancement module to enhance the self-features of K.After enhancement, we obtain K 1 .Finally, the revised group aggregation module is used to aggregate the points around K 1 to obtain the aggregated key points K 2 .K 2 is the final output of each SA layer.

Initial Features Fusion Module
Before the SA layers, we utilize the initial features fusion module to extract the distance features and integrate these with features of the raw point sets.The relevant features of the raw point cloud are very sensitive to the measurement distance.In the actual measurement, as the distance increases, the reflectivity of LiDAR decreases, which leads to the problem that the features of the long-distance points are not obvious and thus reduce the detection accuracy.Therefore, we believe that distance features are very important for improving target detection accuracy.

Distance Features
Traditional algorithms often involve calculations such as squares and roots when calculating distance.This costs a lot of computational resources if we directly let distance represent the distance feature of each point in the point clouds.Therefore, our distance feature is defined as follows:  The raw point cloud P goes through the initial feature fusion module to get P 1 , P 1 is input to the backbone, backbone consists of three SA (set abstraction) layers.P 1 is first put through the down sampling and then through the KSFE (key points self-feature enhancement module) to get K 1 , and finally through the RGA (revised group aggregation module) to get K 2 .
After the backbone is complete, to improve the prediction accuracy, this paper adopts the bounding box prediction mechanism in Vote-Net [20] to predict the bounding box similarly to SASA [3].
We will explain each module in detail below.

Initial Features Fusion Module
Before the SA layers, we utilize the initial features fusion module to extract the distance features and integrate these with features of the raw point sets.The relevant features of the raw point cloud are very sensitive to the measurement distance.In the actual measurement, as the distance increases, the reflectivity of LiDAR decreases, which leads to the problem that the features of the long-distance points are not obvious and thus reduce the detection accuracy.Therefore, we believe that distance features are very important for improving target detection accuracy.

Distance Features
Traditional algorithms often involve calculations such as squares and roots when calculating distance.This costs a lot of computational resources if we directly let distance represent the distance feature of each point in the point clouds.Therefore, our distance feature is defined as follows: where P is the raw point set, DF p and (x p , y p , z p ) represent the distance feature and the coordinates of the p, respectively.Scale is the scaling factor.We utilize the sum of absolute values of the three-axis coordinates of p to represent the distance of a point.The Scale will be set in the experiment.

Feature Fusion
We process the initial feature fusion as shown in Figure 3. Since the reflectivity of each point decreases with the increase of the measurement distance, we adopt the approach of adding distance features with the initial features of the point cloud to strengthen the representation of long-distance points.Finally, we perform fusion operations on the coordinates and related features of the point cloud through the splicing operation.However, these features have not been processed enough, so we use the Multilayer Perceptron (MLP) to further extract the depth features.

Key Points Features Enhancement Module
In SA layers, the key points obtained from the sampling will undergo feature aggregation with the surrounding points, and the self-features of the key points will be diminished after aggregation with max or average pooling.However, each key point has its own unique features in the point cloud data, and these features contain important information included where the key point is located in the point cloud and what kind of object the key point represents.However, the feature aggregation will cause the loss of such information.Therefore, we propose a key points self-feature enhancement module as shown in Figure 4, which enhances the distance features and raw features of the key points, integrating them into the aggregated features.

Key Points Features Enhancement Module
In SA layers, the key points obtained from the sampling will undergo feature aggregation with the surrounding points, and the self-features of the key points will be diminished after aggregation with max or average pooling.However, each key point has its own unique features in the point cloud data, and these features contain important information included where the key point is located in the point cloud and what kind of object the key point represents.However, the feature aggregation will cause the loss of such information.
Therefore, we propose a key points self-feature enhancement module as shown in Figure 4, which enhances the distance features and raw features of the key points, integrating them into the aggregated features.

Key Points Features Enhancement Module
In SA layers, the key points obtained from the sampling will undergo feature aggregation with the surrounding points, and the self-features of the key points will be diminished after aggregation with max or average pooling.However, each key point has its own unique features in the point cloud data, and these features contain important information included where the key point is located in the point cloud and what kind of object the key point represents.However, the feature aggregation will cause the loss of such information.Therefore, we propose a key points self-feature enhancement module as shown in Figure 4, which enhances the distance features and raw features of the key points, integrating them into the aggregated features.

MutiHead
Figure 4. Key points self-features enhancement module (where N 1 is the number of key points after sampling, and C i is the number of feature channels in each stage."C" stands for the stitching operation and "+" stands for the numerical summing operation.).

Feature Enhancement Module
In order to make the self-features of the key points distinctive, we adopt the multiattention mechanism to enhance the distance features and raw features of the key points.The features are strengthened through the multi-head self-attention mechanism; the selfattention algorithm essentially uses matrix multiplication to calculate the relationship between each patch and the other patches in the query.The specific formulas are as follows: where F is the self-features of the key points, W q , W k and W v are the learnable weight matrices.Equations ( 3)-( 5) represent that F obtains Q, K, and V through three separate MLPs.After obtaining Q, K, and V, we use Equation (2) to finally obtain the attention features.After that we employ the splicing method to combine them together.Finally, we carry out the integration of the aggregated features of the key points with their self-features using the MLP to accomplish the enhancement of self-features of key points.

Revised Group Aggregation Module
In the process of sampling key points, we follow the S-FPS and D-FPS combined sampling strategy, which is similar to that of SASA [3].A small semantic segmentation module is adopted in the network structure to compute the classification score for each point to distinguish between foreground and background points in the point cloud.The input features to the segmentation network are those obtained from grouping of the point sets.In the general grouping operation, the selection of points used for aggregation around the key points only considers the spatial location from the key points, not taking into account the feature distance from the key points.In this paper, it is argued that this aggregation operation diminishes the borderlines of the different instances and reduces the effectiveness of the segmentation module in predicting the classification scores of each point in the network, thus affecting the sampling performance of S-FPS.To avoid these problems, we perform a second selection after selecting points within a certain distance from each key point.In the second selection, we introduce the feature distance to ensure that the features of the selected points are similar to those of the key point.By doing so, we can enhance the performance of the segmentation in this network.

Group Aggregation Method
The particular operation is shown in Figure 5. First, we select N points as a point set P N within the sphere with radius R around the key point, and calculate the feature distance D f between the points and the key points, which we define as follows: where f keypoints and f n separately represent the features of the key points and the features of the points around the key points.Before calculation, these features will go through a simple MLP to ensure that the features channel is one-dimensional.After obtaining D f , we select the N k points with the smallest D fk (k = 1, 2, . .., N k ) in P N as a point set P Nk .The P Nk will be used for subsequent features aggregation.In this way, we further strengthen the semantic information of the key points.This can help S-FPS to better distinguish the foreground points from the background points before sampling.
Sensors 2024, 24, x FOR PEER REVIEW 8 of 15 will be used for subsequent features aggregation.In this way, we further strengthen the semantic information of the key points.This can help S-FPS to better distinguish the foreground points from the background points before sampling.

Prediction Head
The overall architecture in this paper consists of three SA layers with a bounding box prediction network.Similarly, our bounding box prediction network adopts the bounding box prediction mechanism from Vote-Net [20].The voting point indicating the center of mass of the corresponding object is first computed from the candidate point features, and then the points in the vicinity of each voting point are aggregated to estimate the bounding box of the detected target.

Loss
The loss function in SAE3D is inherited from SASA [3].The overall loss function is expressed as follows: where Lc and Lr are the losses for the classification and regression, Lv is the loss generated when calculating the vote in the point voting head proposed in Vote-Net [20].Lseg is the total segmentation loss proposed in SASA [3].Lc and Lr are the traditional losses for object detection.They can help the network better predict the bounding box and classification of the object to be detected.Lv mainly serves to predict the center point of the object to improve the accuracy of bounding box prediction.Lseg mainly serves to perform semantic segmentation before the S-FPS to better differentiate between foreground and background points.This can improve the sampling capability of the S-FPS.Therefore, we adopt these loss functions to better train our model.

Prediction Head
The overall architecture in this paper consists of three SA layers with a bounding box prediction network.Similarly, our bounding box prediction network adopts the bounding box prediction mechanism from Vote-Net [20].The voting point indicating the center of mass of the corresponding object is first computed from the candidate point features, and then the points in the vicinity of each voting point are aggregated to estimate the bounding box of the detected target.

Loss
The loss function in SAE3D is inherited from SASA [3].The overall loss function is expressed as follows: where L c and L r are the losses for the classification and regression, L v is the loss generated when calculating the vote in the point voting head proposed in Vote-Net [20].L seg is the total segmentation loss proposed in SASA [3].
L c and L r are the traditional losses for object detection.They can help the network better predict the bounding box and classification of the object to be detected.L v mainly serves to predict the center point of the object to improve the accuracy of bounding box prediction.L seg mainly serves to perform semantic segmentation before the S-FPS to better differentiate between foreground and background points.This can improve the sampling capability of the S-FPS.Therefore, we adopt these loss functions to better train our model.

Datasets
The network we proposed is validated on the KITTI dataset and nuScenes dataset.

KITTI Dataset [21]
The KITTI dataset is a widely used public dataset in the field of computer vision, which is primarily utilized to study and evaluate tasks such as autonomous driving, scene understanding, and target detection.The dataset is based on the streets of Karlsruhe, Germany, and comprises a wide range of urban driving scenarios.The KITTI dataset has become the mainstream standard for 3D object detection in traffic scenes due to its provision of data from real-world scenarios with a high level of realism and representative value.
In the original KITTI dataset, each sample comprises multiple consecutive frames of point cloud data.In our experiment, a total of 7481 point clouds are included along with 3D bounding boxes for training purposes, and 7581 samples are allocated for testing.We adopt a general setup where the training samples are further subdivided into 3712 training samples and 3769 testing samples.Our experimental network is trained on the training samples and validated on the testing samples.

NuScenes Dataset [22]
The nuScenes dataset is one of the more challenging datasets for autopilot research.It comprises 380,000 LiDAR scans from 1000 scenes and is labeled with up to 10 object categories, including 3D bounding boxes, object velocities, and attributes.The detection range is 360 degrees.The nuScenes dataset is evaluated using metrics such as the commonly used mean Average Precision (mAP) and the novel nuScenes Detection Score (NDS), which reflects the overall quality of measurements across multiple domains.
When transferring the nuScenes dataset, we combine LiDAR points from the current key frame and previous frames within 0.5 s, which involves up to 400 k LiDAR points in a single training sample.We then reduce the number of input LiDAR points.Specifically, we voxelize the point cloud from the key frame as well as the stacked previous frames with pixel sizes of (0.1 m, 0.1 m, 0.1 m), then randomly select 16,384 and 49,152 voxels from the key frame and the previous frames, respectively.For each selected voxel, we randomly select one internal LiDAR point.A total of 65,536 points were fed into the network with 3D coordinates, reflectivity, and timestamps.

Evaluation Indicators
In the experiment on the KITTI dataset, two precision metrics are used.One is the 11-point interpolated average precision (AP) proposed by Gerard et al. [23], and the other is the average precision AP| R40 for 40 recalled positions proposed by Simonelli et al. [24].The Intersection over Union (IoU) threshold for all precision calculations is 0.70.The specific formulas of AP| R are as follows: Sensors 2024, 24, 26 9 of 14 where p(r) gives the precision at recall r.AP applies exactly 11 equally spaced recall levels: R 11 = {0, 0.1, 0.2, . ..,1} and AP| R40 applies recall levels: R 40 = {1/40, 2/40, 3/40,. .., 1}.We mainly use AP as an accuracy indicator and AP| R40 will be applied in the ablation experiment in Section 4.5.
In the nuScenes dataset, as mentioned above, we apply the NDS and mAP as the evaluation indicator.The specific formulas for NDS are expressed as follows: where mTP is the mean True Positive metrics and consists of 5 metrics: average translation error, average scale error, average orientation error, average velocity error, and average attribute error.

Experimental Setting
SAE3D is implemented based on the Appended [25] and is trained on a single GPU.All experiments were performed on Ubuntu 16.04 and NVIDIA RTX-2080Ti.

Setting in KITTI Dataset
During the training process, the batch size takes the value of 2, and 16,384 points are randomly selected from the remaining points in each batch to input into the detector.In terms of network parameters, the number of key points in the three SA layers is set to 4096, 1024, and 512, respectively, and the scaling factor Scale for the distance feature takes the value of 120.
Adam optimizer [26] and a periodically varying learning rate were adopted in the training for a total 80 epochs, with the initial value of the learning rate set to 0.001.Additionally, we used three commonly used data augmentation methods during training: randomly flipping the X-axis with respect to the Y-axis, random scaling, and randomly rotating the Z-axis.

Setting in nuScenes Dataset
During the training process, the batch size takes the value of 1. Adam optimizer and a periodically varying learning rate were adopted in the training for a total of 10 epochs, with the initial value of the learning rate set to 0.001.
To handle the huge number of points in the nuScenes dataset, four SA layers are adopted.The number of key points in these four SA layers is set to 16,384, 4096, 3072, and 2048, respectively.

Results
The detection performance of the SAE3D model is evaluated on the KITTI dataset and nuScenes dataset against some existing methods proposed in the literature.
In the KITTI dataset, the test set is categorized into three levels of difficulty, i.e., "Easy", "Moderate", and "Hard", based on the difficulty of detection.We take the 3D bounding box average precision (3D AP) of the "Car" category as the main evaluation, as this is usually adopted as the main indicator in KITTI datasets.As shown in Table 1, compared with the baseline network SASA, 3D AP is improved by 0.544% and 0.648% in the difficulty levels of "Moderate" and "Hard", respectively.The detailed precision improvements will be shown in Section 4.5.
In the nuScenes dataset, as shown in Table 2, compared with the baseline network, SAE3D achieved 3.3% and 1.7% improvement in the indicators of NDS and mAP, respectively.

Enhancement Validation
To verify the enhancement effect of the proposed network in this paper, we utilize SASA [3] and Point-RCNN [18] as two baseline networks for testing in the KITTI dataset.Both baseline networks are point-based 3D object detection networks, where SASA [3] is a one-stage object detection network and Point-RCNN [18] is a two-stage object detection network.The experiments introduce the enhanced network proposed in this paper into both of these networks, effectively improving the detection performance of the original benchmark network.
Table 3 shows the improvement in the accuracy of the 3D detection frames of the "Car" category in the enhanced networks of SASA [3] and Point-RCNN [18], respectively.After the introduction of the enhanced network in SASA [3], the 3D AP of the "Car" decreases slightly in the "Easy" difficulty, but increases by 0.544% and 0.648% in the "Moderate" and "Hard" difficulties, respectively.

Ablation Experiment
In this paper, ablation experiments are designed to verify the actual effect of each module.All modules are trained on the training set of the KITTI dataset and evaluated on the validation set for the "Car" category of the KITTI dataset.
In this section we added BBox AP, BEV AP, and AOS AP alongside 3D AP as the evaluation indicator.BBox AP represents the average precision of the 2D bounding box, while BEV AP denotes the average precision of the detection boxes in bird's-eye view.These two indicators provide detection box accuracy from different perspectives, aiding in a better understanding of the spatial precision of the detection boxes predicted by our model.AOS AP stands for the average precision of the detected target's rotation angle, indicating the accuracy of the object orientation predicted by our model.

Initial Feature Fusion Module
As shown in Table 4, the initial feature fusion module in this paper is of great help to improve the precision of 3D bounding box.The improvement of this module is most evident in the difficulty levels of "Moderate" and "Hard".Compared to the baseline network used in this paper, in the "Moderate" and "Hard" difficulty levels, the 3D bounding box accuracy improvement of this module is 0.551% and 0.811%, respectively.Additionally, the improvement in 2D bounding box accuracy is 0.186% and 0.811%, while the bounding box accuracy improvement in BEV view is 0.257% and 1.048%, respectively.As shown in Table 5, when using the AP| R40 , the improvement in the accuracy of the 3D bounding box is 2.549% and 2.582% for the difficulties of "Moderate" and "Hard", respectively.The improvement in the accuracy of 2D bounding box is 1.976% and 0.533%, respectively, and the improvement in the accuracy of bounding box in BEV view is 0.295% and 2.257%, respectively.As shown in Table 4, this module improves the detection accuracy of the 3D bounding box and the accuracy of bounding box detection in BEV view.The detection accuracy of the 3D bounding box is improved by 0.339% and 0.746% under the difficulty levels of "Moderate" and "Hard", respectively, and the detection accuracy of the bounding box in BEV view is improved by 0.118%, 0.565%, and 1.349% in "Easy", "Moderate", and "Hard" levels of difficulty, respectively.
As shown in Table 5, the accuracy of the 3D bounding box is improved by 2.362% and 2.467% for the "Moderate" and "Hard" levels of difficulty, respectively, when using AP| R40 .
The accuracy of the bounding box in BEV view is improved by 1.778%, 1.906% and 2.367% for "Easy", "Moderate", and "Hard" levels of difficulty, respectively.

Revised Group Aggregation Module
As shown in Table 4, the detection accuracy of this module on BBOX is improved by 0.316% and 0.376% under the difficulty levels of "Moderate" and "Hard", respectively.Additionally, compared with the baseline network, the module improves other metrics such as 3D bounding box and steering angle accuracies.
As shown in Table 5, when AP| R40 is used, the detection accuracy improvement on BBOX is 2.051% and 0.555% at "Moderate" and "Hard" levels, respectively.

Detection Effect
Figure 6 shows the actual detection effect.Although there is still a small part of the missed detection problem, most of the vehicles are detected and the accuracy of the 3D bounding box is high.

Discussion
In this paper, we continue to explore the possibilities of the point-based 3D object detection.Point cloud data are vast and contains a wealth of information, both useful and redundant.We believe that there is still underutilized information within the point cloud.Therefore, we proposed the SAE3D.The results demonstrate that extracting more useful information and enhancing the relevant information in the point cloud can improve the final detection accuracy.

Conclusions
In this paper, we proposed SAE3D with three enhancement modules: an initial feature fusion module, a key points self-feature enhancement module, and a revised group aggregation module.We provide a detailed description of the design ideas and implementation of these modules in this paper.We conducted testing using the KITTI and nuScenes datasets and designed ablation experiments on the KITTI dataset to analyze the enhancement of each module in detail.The results demonstrate that all three enhancement modules we propose contribute to enhancing detection accuracy.Our SAE3D suggests that there are still useful characteristics in point clouds that are not fully utilized, and some of them can assist in extracting information from the point clouds more effectively.We believe that exploring additional potential characteristics of point clouds can further enhance 3D scene understanding.

Discussion
In this paper, we continue to explore the possibilities of the point-based 3D object detection.Point cloud data are vast and contains a wealth of information, both useful and redundant.We believe that there is still underutilized information within the point cloud.Therefore, we proposed the SAE3D.The results demonstrate that extracting more useful information and enhancing the relevant information in the point cloud can improve the final detection accuracy.

Conclusions
In this paper, we proposed SAE3D with three enhancement modules: an initial feature fusion module, a key points self-feature enhancement module, and a revised group aggregation module.We provide a detailed description of the design ideas and implementation of these modules in this paper.We conducted testing using the KITTI and nuScenes datasets and designed ablation experiments on the KITTI dataset to analyze the enhancement of each module in detail.The results demonstrate that all three enhancement modules we propose contribute to enhancing detection accuracy.Our SAE3D suggests that there are still useful characteristics in point clouds that are not fully utilized, and some of them can assist in extracting information from the point clouds more effectively.We believe that exploring additional potential characteristics of point clouds can further enhance 3D scene understanding.

Figure 1 .
Figure 1.Overview of related work.

Figure 1 .
Figure 1.Overview of related work.

Figure 2 .
Figure 2. Overall flowchart.The raw point cloud P goes through the initial feature fusion module to get P1, P1 is input to the backbone, backbone consists of three SA (set abstraction) layers.P1 is first put through the down sampling and then through the KSFE (key points self-feature enhancement module) to get K1, and finally through the RGA (revised group aggregation module) to get K2.

Figure 2 .
Figure 2.Overall flowchart.The raw point cloud P goes through the initial feature fusion module to get P 1 , P 1 is input to the backbone, backbone consists of three SA (set abstraction) layers.P 1 is first put through the down sampling and then through the KSFE (key points self-feature enhancement module) to get K 1 , and finally through the RGA (revised group aggregation module) to get K 2 .

Figure 3 .
Figure 3. Initial features fusion module (where N stands for the number of input point clouds and F stands for the number of feature layers for each point in the output."C" represents the stitching operation and "+" represents the numerical summing operation.).

Figure 3 .
Figure 3. Initial features fusion module (where N stands for the number of input point clouds and F stands for the number of feature layers for each point in the output."C" represents the stitching operation and "+" represents the numerical summing operation.).

Figure 5 .
Figure 5. Revised group aggregation module (points with similar colors in the figure represent similar features, PN is the point obtained by the first selection around the key point, PNk is the point obtained by the second selection, and Poutput is the final point output after group aggregation).

Sensors 2024 , 15 Figure 6 .
Figure 6.Actual detection effect diagram in KITTI dataset (left are the pictures of the real scenes, right are the detection 3D bounding boxes predicted in the point cloud).

Figure 6 .
Figure 6.Actual detection effect diagram in KITTI dataset (left are the pictures of the real scenes, right are the detection 3D bounding boxes predicted in the point cloud).

Table 1 .
The detection results of 3D AP for "Car" in KITTI.

Table 3 .
Enhancement effectiveness.Abbreviations: Distance features-based enhancement network proposed in this paper (SAE3D).

Table 4 .
Comparison table of the general accuracy enhancement effect of different modules.Abbreviations: initial feature fusion module (I), KSFE module (K), and RGA module (F).

Table 5 .
Comparison table of the AP| R40 enhancement effect of different modules.Abbreviations: initial feature fusion module (I), KSFE module (K), and RGA module (F).