AV PV-RCNN: Improving 3D Object Detection with Adaptive Deformation and VectorPool Aggregation

.


Introduction
With the continuous development of application fields such as autonomous driving, robotics and augmented reality, the demand for accurate, fast and reliable 3D object detection technology is becoming more and more urgent. Three-dimensional object detection aims to identify and locate different types of objects from a three-dimensional scene. Compared with traditional 2D object detection, 3D object detection can provide more accurate object position information and is also more robust because it is not easily disturbed by factors such as light and shadow. However, 3D object detection faces many technical challenges. Due to the sparsity and irregularity of point cloud data, it is very challenging to directly apply 2D object detection technology to process point clouds.
In order to deal with these challenges, most of the existing 3D detection methods can be divided into three categories according to the representation of point clouds: grid-based methods, point-based methods, and combination of grid-based and point-based methods. Grid-based methods [1][2][3][4][5][6][7] usually convert irregular point clouds into regular representations, such as 3D voxels or 2D bird's eye views, which can be efficiently processed by 3D or 2D convolutional neural networks (CNNs) to learn point features for 3D detection. Point-based methods [8][9][10][11][12][13][14] usually directly extract discriminative features from raw point clouds for 3D detection. Grid-based methods are generally more computationally efficient, but they suffer from information loss, which may reduce localization accuracy for the location of fine details. Different from this, point-based methods are usually more computationally expensive, but with point set abstraction, a larger receptive field is easily • Replacing the set abstraction of the Voxel Set Abstraction Module in PV-RCNN with the adaptive deformation module. Through the adaptive deformation module, the keypoints can be aligned with the most distinctive areas, and the most prominent features of objects of different scales can be adaptively gathered and focused, so that the model can detect uneven point cloud density better. • Replacing the set abstraction in the RoI-grid pooling module in PV-RCNN with the Vec-torPool aggregation module. The VectorPool aggregation module uses independent kernel weights and channels to encode position-sensitive local features in different regions centered on grid points, which not only preserves the spatial structure of grid points well, but also saves the consumption of computing resources. • Using the context fusion module to perform feature selection on keypoints obtained directly from PointNet++ in PV-RCNN. Representative and discriminative features can be dynamically selected from local evidence through the context gating mechanism, which adaptively highlights relevant contextual features, thus facilitating the optimization of more accurate 3D candidate boxes.

Related Work
3D Object Detection with Grid-based Methods. In order to deal with irregular point clouds, most methods usually map the point cloud to a grid of bird's eye view or voxelization. MV3D [1] projects point clouds into a 2D bird's eye view grid and places a large number of predefined 3D anchor points for generating 3D bounding boxes; AVOD [2] extends MV3D by introducing image features in the proposal generation stage, improve 3D detection accuracy; VoxelNet [3] first voxelizes the entire point cloud, then uses PointNet [8] to extract the features of each voxel, and finally generates a detection frame through RPN; compared to VoxelNet, SECOND [4] applies sparse convolution to replace 3D convolution operation, so as to achieve the effect of reducing the amount of calculation and improving the training speed; PointPillars [5] divides the point cloud into columns, reduces the number of voxels that need to be processed, and improves the detection speed; VoTr [6] is a transformer-based 3D backbone network that uses a deformed attention mechanism that can effectively operate on empty and non-empty voxel positions, preserves the spatial position encoding, and can be used as an alternative to standard sparse convolutional layers; CenterPoint [7] uses the idea of Anchor-free to predict the center point of the 3D frame in the first stage and regress its size, direction and speed, while the second stage uses the center point feature to return the score of the detection frame and optimize it.
3D Object Detection with Point-based Methods. In addition to grid processing of point clouds, PointNet [8] and PointNet++ [9] directly process the point cloud for point cloud classification and segmentation, and can also be used as the backbone network for feature extraction; F-PointNet [10] first proposed the method of using cones to achieve 3D target detection; firstly, the candidate area is generated by an excellent 2D target detection algorithm, and the 3D view cone is extracted by combining the depth information of the candidate area, then the 3D instance is segmented by PointNet [8], and finally the coordinates are transformed by the T-Net network, so that the central axis of the viewing frustum orthogonal to the image plane and predict the final 3D bounding box; the first stage of PointRCNN [11] uses PointNet++ [9] to extract point cloud features, divides the point cloud of the entire scene into foreground points and background points, and generates a small number of 3D candidate boxes from the foreground points, while the second stage converts the pooled points of each candidate box into canonical coordinates, so as to better learn local spatial features to optimize 3D boxes; STD [12] uses pointbased spherical anchors boxes to generate more accurate candidate frames, which reduces the number of generated anchor boxes and greatly reduces the amount of calculation; VoteNet [13] proposed a Hough voting strategy to better group object features; 3D-SSD [14] adopts a farthest point sampling method based on feature distance, and excludes a large number of background points by combining semantic information, in order to avoid certain redundancy caused by completely using feature distance-based sampling; therefore, they choose to combine the Euclidean distance and the method of sampling the farthest point based on feature distance, and remove the very time-consuming FP module and optimization module in PointNet++, which greatly reduces the calculation loss.
3D Object Detection with combination of grid-based and point-based methods. PV-RCNN [15] combines voxel-based feature learning and point-network-based feature learning to generate high-quality 3D pre-selection boxes and capture more accurate contextual information through flexible receptive fields, thereby improving 3D detection performance. Deformable PV-RCNN [16] improves the detection performance of longdistance targets by collecting unevenly distributed context information. PV-RCNN++ [17] uses sector-proposal-centered sampling and local point feature aggregation to improve the model. The performance and inference speed not only speed up the running speed of the model, but also improve the detection performance of the model.

Method
In this section, we focus on introducing the network structure of our model, which is improved based on PV-RCNN, as shown in Figure 1.  Figure 1. AV PV-RCNN network structure diagram.
We designed a network that can fully adapt to different object sizes and reduce computing resource consumption. First, the set abstraction for voxel feature extraction in PV-RCNN is replaced with an adaptive deformation module, which can aggregate the instance features of object features of different scales on the keypoints; secondly, the set abstraction in the aggregation operation of the keypoints features in the second stage of PV-RCNN is replaced by the VectorPool aggregation module to display and encode the spatial structure information of the keypoints features; finally, the context fusion module is used to filter the keypoints features obtained directly from PointNet++ in PV-RCNN, dynamically select representative and discriminative features from local evidence, and adaptively highlight relevant contextual features. In this section, a brief introduction to the original PV-RCNN model is given first, followed by a detailed description of the voxel set abstraction module, deformable convolution, adaptive deformation module, context fusion module, roI-grid pooling module, and VectorPool aggregation module.

PV-RCNN
PV-RCNN is the benchmark model of our work. As shown in Figure 2, it is a twostage 3D object detection model based on the combination of grid and points. PV-RCNN uses 3D sparse convolution as the backbone for feature extraction and 3D candidate boxes generation, in order to make full use of the characteristics of the entire scene, PV-RCNN proposes two methods, namely voxel-to-keypoint feature encoding and keypointto-grid point feature extraction. Voxel-to-keypoint feature encoding is mainly realized through the Voxel Set Abstraction Module, 3D sparse convolution performs 1, 2, 4, and 8 times downsampling processing on the point cloud, and each downsampled voxel feature represents the whole scene, the keypoints are obtained by directly performing the farthest point sampling algorithm (FPS) on the point cloud, the keypoints use the set abstraction in PointNet++ to extract the voxel features in the whole scene, so that the voxels obtained by each downsampling feature are encoded into a set of keypoints. The keypoint-to-grid point feature extraction is mainly realized through the RoI-grid pooling module. According to the generated 3D candidate boxes, 6 × 6 × 6 grid points are selected, and the grid points use the set abstraction in PointNet++ to extract the keypoints features around the grid points, so that the grid points have rich features of the entire scene. Finally, PV-RCNN predicts the final 3D boxes and confidence according to the features of the grid points.

Voxel Set Abstraction Module
The voxel set abstraction (VSA) module is used to encode the voxel features in the scene in the 3D sparse convolution into a set of keypoints, that is, each keypoint uses the set abstraction operation proposed by PointNet++ [9] to aggregate voxel features at multiple scales. Specifically, the FurthestPoint-Sampling (FPS) algorithm is used to sample n keypoints K = {p 1 , · · · , p n } from the entire point cloud P, where in the KITTI dataset n = 2048, the keypoints can be evenly distributed in the entire point by the FurthestPoint-Sampling (FPS) algorithm, it can represent the entire point cloud scene, the keypoints are surrounded by voxel features obtained through 3D sparse convolution, and the keypoints directly use the set abstraction in PointNet++ to perform multi-scale feature extraction on the surrounding voxel features.
N k } is represented as the non-empty voxel feature vector set of the k-th layer of the 3D sparse convolution, and V l k = {v N k } is represented as the three-dimensional coordinates of the non-empty voxel of the k-th layer of the 3D sparse convolution, where N k is the number of non-empty voxels in the k-th layer of 3D sparse convolution, for each keypoint p i , we first search for non-empty voxels within the radius r k of the k-th layer to obtain the voxel-level feature vector set of the keypoint p i as: It concatenates the local relative coordinates v j . The voxel features in the adjacent voxel set S (l k ) i of p i are then transformed by Set Abstraction in PointNet++ [9] to generate the features of the keypoint p i : where M(·) represents random sampling of at most T k voxels from the adjacent voxel set S l k i to save computation, and G(·) represents a simple MLP network. max{·} represents the maximum pooling operation on all adjacent voxel features S l k i along all channels of the voxel.
Finally, by concatenating the keypoint features of all layers, the final feature of the keypoint p i is obtained: The above is the whole content of the voxel set abstraction module. When using the set abstraction of PointNet++ [9] to extract the surrounding neighborhood features, multiple scales are used to extract the surrounding voxel features. Although good results have been achieved, it cannot fully adapt to problems such as different object scales, different point cloud densities, clutter, etc., resulting in some objects not being detected. For example, sometimes the keypoints are far from the object or the center of the object, and the features extracted by the key points cannot well-represent the shape of the object and cannot fully adapt to objects of different sizes, which may easily cause wrong detection results and lead to a decrease in accuracy.

Deformable Convolution
In 2D object detection, deformable convolution has shown its powerful ability. Deformable convolution can adaptively shift the position of sampling points to a place with richer feature information, so as to sample richer feature information and fully adapt to objects of different sizes, as shown in Figure 3.  The core of 2D convolution is the convolution kernel R, which is used to sample the feature map x. The convolution kernel determines the size of the receptive field, such as the 3 × 3 kernel: In order to achieve each position p 0 on the output feature map y, we have where p n is each position in the convolution kernel R.
In deformable convolution, the convolution kernel R uses the offset {∆p n |n = 1, · · · , N} to obtain a new sampling position, where N = |R|. Equation (5) becomes: Now, the position of the convolution kernel sampling is p n + ∆p n . Since the offset ∆p n is usually a fraction, Equation (6) is realized by bilinear interpolation as: where p represents any (fractional) position (p = p 0 + p n + ∆p n in Equation (6)), q enumerates all integral spatial positions in the feature map x, and E(·, ·) is the bilinear interpolation kernel. Note that E is two-dimensional. It is split into two 1D kernels: The offset is obtained by applying a convolutional layer, and the offset has many different forms, as shown in Figure 4, which lists three different offset forms. The output offset has 2N dimensions, corresponding to N two-dimensional offsets. a b c d

Adaptive Deformation Module
The adaptive deformation module extends the core principle of deformable convolution to 3D, and the keypoints can adaptively learn the characteristics of objects of different scales through the adaptive deformation module. As shown in Figure 5, in 3D, the keypoints replace the sampling positions of the regular grid in two dimensions. First, the keypoints collect the non-empty voxels in the surrounding neighborhood, and then obtain the offset and new features by adaptively learning the features in the non-empty voxels in the surrounding neighborhood, this new feature is the feature of the deformed keypoints, and then add the learned offset to the original keypoint coordinates to obtain the deformed keypoint coordinates, and then perform feature extraction according to the set abstraction in PointNet++ in PV-RCNN to obtain the final deformed keypoint features, as shown in Figure 5. Specifically, the sampled n keypoints [v i , f i ] n i=1 have 3D positions v i and feature vectors f i corresponding to each layer of Conv1, Conv2, Conv3 or Conv4, and our module computes the updated feature f i as follows: where N (i) refers to the number of non-empty voxels in the neighborhood around the i-th keypoint, and W o f f is a weight matrix for learning keypoint offsets. Then, we obtain new deformed keypoint positions as where W align is a weight matrix for learning keypoint position alignment. This is similar to the alignment in Mesh R-CNN [19] and PointDAN [20]. After obtaining the new deformation key points and their features, we use the set abstraction in PointNet++ [9] in PV-RCNN to perform feature extraction on the new deformation key points.

Context Fusion Module
The context fusion module uses the context gating mechanism to select relatively representative point cloud features, and uses two independent linear layers on the key points, one of which uses the Sigmoid function on the linear layer, and the other linear layer does not, and then multiplying these two streams can strengthen those relatively prominent features and suppress those inconspicuous features, which can provide more representative features for the subsequent refinement of candidate boxes, as shown in Figure 6. Specifically, the key point feature f i is given, the gating weights are obtained as g = σ(W gate f i + b gate ), and the context gating features are obtained as f g i = g W f c f i , where W gate , b gate , W f c are the weight parameter learned from the data.

RoI-Grid Pooling Module
RoI-grid pooling module uses the 3D candidate boxes generated by 3D sparse convolution to select a certain number of grid points. Each grid point uses the set abstraction in Pointnet++ [9] to obtain the keypoint features of the surrounding neighborhood, and the keypoint features contain very rich point cloud scene information, so each grid point has very rich information, as shown in Figure 7.
Specifically, a 6 × 6 × 6 grid point is uniformly sampled in each 3D candidate box, denoted as G = {g 1 , · · · , g 216 }. Through the set abstraction in PointNet++, the keypoint n } are aggregated into the grid points. More precisely, we first determine the adjacent keypoints of grid point g i within radius r as: where p j − g i denotes the local relative position of the feature f (p) j starting from keypoint p j . Then, we use the set abstraction in PointNet++ [9] to aggregate the adjacent key point feature set Ψ to generate the features of the grid point g i : where M(·) represents random sampling of at most T k voxels from the keypoint neighborhood set Ψ to save computation, and G(·) represents a simple MLP. max{·} represents the maximum pooling operation on all adjacent keypoint features Ψ along all channels of the keypoint features. After each grid point obtains rich features from surrounding keypoints, all grid point features in the same candidate box can obtain a 3D prediction box representing the entire scene through a two-layer MLP with 256 dimensions.
In this module, the set abstraction operation in PointNet++ is used to capture richer context information with a flexible receptive field, and even the receptive field exceeds the boundary of the 3D candidate boxes to capture the surrounding keypoint features outside the 3D candidate boxes. However, the set abstraction operation is very time-consuming and resource-consuming in large-scale point clouds, because it applies several shared parameter MLP layers on each local point, and the maximum pooling operation in set abstraction abandons the local points. Spatial distribution information greatly impairs the ability of grid points to gather local features.

VectorPool Aggregation Module
The VectorPool aggregation module is very suitable for local feature aggregation of large-scale point cloud scenes. Firstly, by collecting the keypoints in the cube neighborhood centered on the grid point, and then dividing the cube neighborhood into multiple subvoxels, each sub-voxel feature is extracted, and then each sub-voxel feature is assigned independent kernel weights and channels to generate local features sensitive to local position information. Finally, all channel features are concatenated into a single vector, which not only preserves the local information of the cubic neighborhood of the grid points, but also avoids the use of MLP with shared parameters, reducing the consumption of computing resources, as shown in Figure 8.
is the set of keypoints after Voxel Set Abstraction, M is the number of keypoints, C in is the number of feature channels of keypoints, Q = {q k |q k ∈ R 3 } M k=1 is the set of grid points generated by using 3D candidate boxes, and N is the number of grid points. Given a grid point q k , first determine the set of keypoints in its cubic neighborhood, which can be expressed as: (12) where δ is half the length of the cube space, max(a j − q k ) ∈ R obtains the maximum axis alignment value of the 3D distance. We double the half-length of the cubic space of grid points to include more keypoints, which is beneficial for the local feature aggregation of this grid point.
In order to generate position-sensitive features in a local 3D neighborhood centered on q k , we split its adjacent 3D space into n x × n y × n z small local sub-voxels. Inspired by PointNet++, we use an inverse distance weighting strategy to interpolate the features of the t-th sub-voxel by considering its three nearest neighbors to y k , where t ∈ {1, . . . , n x × n y × n z } represents the index of each sub-voxel, and we assign its corresponding sub-voxel. The center is denoted as v t ∈ R 3 . We can then generate the features of the t-th subvoxel as: where [h i , a i ] ∈ y k , G t refers to the set of indices of v t 's three nearest neighbors (i.e., |G t = 3|) in the neighbor set y k . The result h t is the local feature encoding for a specific -th local sub-voxel in the local cube.
Features in different local sub-voxels may represent very different local features. Therefore, instead of encoding local features using a shared parameter MLP as in Point-Net++, we use separate local kernel weights to encode different local sub-voxels to capture position-sensitive features: where {a i − v t } i∈G t ∈ R (3×3=9) represents the relative position of the three nearest neighbors of v t , Concat(·) is the concatenation operation that fuses the relative position and features, W t ∈ R (9+C in )×C mid is the learnable kernel weight value of the t-th local sub-voxelspecific feature encoded by the feature channel C mid , and the different positions encode position-sensitive local features that have different learnable kernel weights. Finally, we directly sort the spatial order of the local sub-voxel features U t along each 3D axis, and concatenate their features in order to generate the final local vector expressed as: U = MLP(Concat(U 1 , U 2 , . . . , U n x ×n y ×n z )) (15) where U ∈ R C out . Intra-sequence stitching encodes structure-preserved local features by simply assigning features at different locations to corresponding feature channels, naturally preserving the spatial structure of local features in the adjacent space centered at q k , and finally for this local vector, representation performs multiple MLPs processing, and encodes the local features into the C out feature channel for subsequent processing. Compared with set abstraction, the VectorPool aggregation module performs positionsensitive local feature encoding on different regions centered on grid points through independent kernel weights and channels, which not only preserves the spatial structure of grid points well, but also saves on computation resource consumption.

Dataset Description
To verify the performance of the AV PV-RCNN network structure proposed in this study, we conduct all experiments on the KITTI [21] dataset. The KITTI dataset is currently one of the most popular 3D detection datasets for autonomous driving. The KITTI dataset consists of 7481 samples for training and validation, and 7518 samples for testing, containing more than 200,000 3D target annotations. The dataset divides 3D objects into 8 categories, such as cars, pedestrians, and cyclists. The label information includes category, 2D detection frame coordinates, 3D center point coordinates, 3D size, occlusion, truncation, and heading angle. According to different degrees of occlusion and truncation in each scene, we classify them into three categories: easy, medium and difficult. The 7481 training samples are further divided into 3712 samples as the training set and 3769 samples as the validation set. All models in this paper are trained on a training set of 3712 samples, then tested and visualized on a validation set of 3769 samples.

Evaluating Metrics
We use average precision (AP) to measure the performance of different methods. In the evaluation process, we follow the KITTI official evaluation protocol, that is, cars with an IoU threshold of 0.7, and pedestrians and bicycles with an IoU threshold of 0.5. There are three difficulty levels under each object category, which are easy, medium and difficult, with decreasing AP values, and all average precision (AP) results are calculated with 40 recall positions. In addition, we also used the recall rate, inference speed (FPS), and memory (MB) to measure the performance of the model.

Other Setting
All experiments in this paper are based on the OpenPCDet framework. We trained PointPillar [5], SECOND [4], PointRCNN [11], PartA2-Net [22], PV-RCNN [15], Deformable PV-RCNN [16], PV-RCNN++ [17], and AV PV-RCNN, eight network structure models, and all models are trained from scratch, the initial learning rate of PV-RCNN and AV PV-RCNN (ours) is set to 0.01, using ADAM optimizer, batch_size = 6, training 80 epochs. We use Ubuntu 20.04 as our operating system, Python 3.7 as our programming language, PyTorch 1.7.1 as our deep learning framework, and an NVIDIA RTX3090 with CUDA version 11.0 for training. For the KITTI dataset, the x-axis detection range is within [0, 70.4] m, the y-axis detection range is within [−40, 40] m, and the z-axis detection range is within [−3, 1] m, and then according to each, the voxel size of the axis (0.05 m, 0.05 m, 0.1 m) is voxelized.

Algorithm Comparison and Analysis
To verify the detection performance of our AV PV-RCNN network model, in this section, we compare our method with popular two-stage object detection algorithms, such as PartA2-Net [22], PV-RCNN [15],PV-RCNN++ [17], Deformable PV-RCNN [16], singlestage target detection algorithm PointPillar [5], SECOND [4], and PointRCNN [11]. The experimental results are shown in Table 1. From the data in this table, it can be seen that compared with PV-PCNN, our model has increased by 5.23%, 3.57%, and 3.48% on the three difficulty levels of the bicycle category, and have increased by 3.76%, 4.59%, and 4.22% on the three difficulty levels of pedestrians. Our model improves PV-RCNN performance on almost all classes at all three difficulty levels, especially on bicycles and pedestrians. Compared to all the remaining models, our model achieves relatively good results on all classes except cars and bicycles, which are difficult levels. The above results fully demonstrate that our network model can fully adapt to objects of different sizes, not only showing better detection accuracy on larger objects such as cars, but also showing greater accuracy improvements on smaller objects such as bicycles and pedestrians. In order to further evaluate the performance of our model, we also analyzed indicators such as recall rate, memory, and inference speed (FPS). For the recall rate, as shown in Table 2, our model improves on PV-RCNN in almost all categories at three difficulty levels, and improves the middle and difficulty levels of the car class by 3.05% and 1.28%, respectively, and increased by 1.73% on the simple level of bicycles, and increased by 4.12%, 4.08%, and 3.39% on the three levels of difficulty for pedestrians. The recall of our model is significantly improved in almost all categories of the three difficulty levels. For memory, it requires computing resources. We analyzed the computing resource usage of our model and PV-RCNN, as shown in Table 3. We checked the usage of video memory under batch_size = 1, and we took the usage of computing resources when the number of iterations was 100, 200, 300, and 400, respectively, the average resource usage of PV-RCNN is 8162.25 MB, and the average resource usage of our model is 7232 MB, saving about 11.4% of computing resource consumption, so it can be shown that our model can reduce the consumption of computing resources. For the inference speed (FPS), the inference speed is a measure of the speed of the model training data. The larger the FPS, the faster the data processing speed of the model. Both our model and PV-RCNN are trained on an RTX3090. The inference speed (FPS) of our model and PV-RCNN is compared, as shown in Table 4, our model processes one more sample of data per second than PV-RCNN, and the inference speed is 33% higher than that of PV-RCNN.

Ablation Study
In order to verify the effectiveness of the adaptive deformation module, VectorPool aggregation module and context fusion module in improving the accuracy of the object detection algorithm, we conducted two sets of ablation experiments, as shown in Table 5 and Table 6, respectively. In Table 5, we mainly focus on the adaptive deformation module and the VectorPool aggregation module, while the second row shows that after adding the adaptive deformation module, the three difficulty levels of the bicycle class have increased by 1.65%, 3.39% and 4.13%, respectively, and the three levels of difficulty for pedestrians have increased by 3.50%, 4.27%, and 3.92%, respectively; the third row shows that after adding the VectorPool aggregation module, the three difficulty levels for pedestrians have increased by 2.22%, 3.16% and 3.25%, respectively, the fourth line shows that after adding the adaptive deformation module and the VectorPool aggregation module, the three difficulty levels for bicycles have been increased by 5.33%, 3.57%, and 3.48%, respectively, and the three difficulty levels for pedestrians have been increased 3.76%, 4.59%, 4.22%, respectively. The experimental data fully prove the effectiveness of the adaptive deformation module and the VectorPool aggregation module, especially on bicycles and pedestrians. When the adaptive deformation module is added, the average increase in the three difficulty levels of the bicycle class is 3.05%, and the average increase in the three difficulty levels of the pedestrian class is 3.89%. After adding the adaptive deformation module and VectorPool aggregation module, the average increase in three difficulty levels for bicycles is 4.09%, and the average increase in the three difficulty levels of the pedestrian class is 4.19%. Therefore, our adaptive deformation module can adaptively gather and focus on the most salient features of objects of different scales, so that the model can detect uneven point cloud density better, and the VectorPool aggregation module can spatially encode local features. It also plays an important role in improving the performance of the model. In order to verify whether the context fusion module has improved the performance of the model, we conducted an ablation experiment, as shown in Table 6. After adding the context fusion module, our model improved by 2.00%, 2.86%, and 0.62% on the moderate level of all categories of cars, bicycles, and pedestrians, respectively, which effectively improved the performance of the model, and verified that the context-gating mechanism can dynamically select representative and discriminative features from local evidence, highlighting object features, for refinement stage to filter out relevant contextual information.
In order to test the actual detection effect of our model, we use our model and PV-RCNN to compare the detection on the KITTI validation set. As shown in Figure 9, we randomly sampled two samples for detection, and the left is the detection result of PV-RCNN, and the right is the detection result of our model. The blue box indicates the ground true boxes, and the green box indicates the predicted boxes. It can be clearly seen from

Conclusions
In this paper, we find that the refinement methods used by current two-stage detectors cannot adequately adapt to different object scales, different point cloud densities, partial deformations and clutter, and have excessive resource consumption, so we propose a 3D object detection method that can adapt to different object scales and aggregate local features with less resources. Specifically, we use three modules to improve the performance of the model and reduce the consumption of computing resources, the adaptive deformable module, the VectorPool aggregation module, and the context fusion module. First of all, the adaptive deformation module can align the key points with the most distinctive areas, and adaptively gather and focus on their most prominent features for objects of different scales, so that the model can better detect uneven point cloud densities. Secondly, the VectorPool aggregation module displays encoded spatial structure information, which not only improves model performance, but also consumes less computing resources; finally, the context fusion module can dynamically select representative and discriminative features from raw points, highlight object features, and filter out relevant contextual information for the refinement stage. In order to test the effectiveness and versatility of these modules, we conducted experiments on single-stage and two-stage object detection algorithms. The experimental results show that the three modules proposed in this paper can effectively solve the problems that cannot be fully adapted to different scales in current two-stage detectors, including different point cloud densities, partial deformations and clutter, and issues with excessive resource consumption.
Funding: This work was supported by 2023 Central Government guidance for local science and technology development funds (basic research of free exploration) 2023JH6/100100066.

Data Availability Statement:
The data in this paper can be obtained by contacting the authors.

Conflicts of Interest:
The authors declare no conflict of interest.