3.1. Voxel Feature Encoding
The role of voxel feature encoding is to convert raw point cloud data into voxel features. The design paradigm in [
8] divides voxelization into two stages: grouping and sampling. Given a point cloud set
, this process assigns
point clouds to a buffer of size
, where
represents the maximum number of voxels,
is the maximum number of point clouds in each voxel, and
denotes the channel size of feature dimensions. If voxels or point clouds within a voxel exceed the fixed capacity, subsampling is performed. Conversely, if the number of point clouds/voxels is less than the fixed capacity
, the unused positions in the buffer are zero-padded, a process also referred to as fixed voxelization. Based on the above definition of fixed voxelization, its limitations can be summarized as follows: (1) When point clouds and voxels exceed the specified capacity, they are discarded. Due to the characteristics of LiDAR, point cloud density is higher at locations with key targets; so, fixed voxelization forces the model to discard potentially useful information for detection. (2) The random discarding of point clouds and voxels can lead to unstable or jittery detection results. Therefore, by assigning voxel grid coordinate indices to each point cloud and combining it with a deduplication function, we implemented dynamic voxelization [
25]. This method abandons the approach of sampling point clouds into a fixed number and fixed capacity of voxels, instead preserving the complete mapping between point clouds and voxels. Consequently, the number of voxels and the number of points in each voxel are dynamic, depending on the specific mapping function and the actual distribution of point cloud data.
In addition, to obtain more fine-grained information from the point cloud, we introduced additional point cloud features during the voxel feature encoding stage. Assuming that the coordinate information of a certain point cloud in the coordinate system is
, the mean of
for all points in the voxel it belongs to is
.
can also be understood as the centroid of a voxel. Subtracting
from
yields
:
where
is the difference between the point cloud and the centroid of its own voxel. Assuming the physical position center of the voxel to which a point cloud belongs is
(which can be derived based on the point cloud range and the set voxel size), subtract
from
to obtain
:
where
is the difference between the position of the point cloud itself and the physical center position of the voxel it belongs to. Calculate the Euclidean distance
from the point cloud to the origin of the coordinate system:
where
represents the norm of a specified dimension, and 2 denotes the L2 norm, which is the Euclidean distance. We concatenated the additional point cloud features with the original point cloud features to obtain the final point cloud feature
:
where
is the channel concatenation operation along the feature dimension, and
,
,
,
are the original features representing the
-coordinate,
-coordinate,
-coordinate, and the reflectivity (the intensity value of the laser pulse emitted by the LiDAR being reflected back after encountering the surface of an object. This value is stored in floating-point form and reflects the laser reflection characteristics of the object’s surface) of the point cloud point
in space, respectively. The ultimate goal of voxel feature encoding is to obtain the features of each voxel. After obtaining the complete point cloud feature
, a Feed-Forward Network was used to derive the corresponding voxel features, with the specific approach similar to the max-pooling operation in [
20].
3.2. Variable Sparse Convolution Network VS-Conv
(1) Residual Structure Design: VS-Conv’s underlying structure is still composed of regular sparse convolution operators and submanifold sparse convolution operators, ensuring the applicability and efficiency of the convolution backbone network. However, to address the training challenges caused by the excessive depth of sparse convolution networks and enable the network to learn more complex feature representations, we introduced a residual structure [
41]. Assuming the input convolution voxel features
(with
being the number of features and
being the feature channel length), the specific computation process of the residual regular sparse convolution is as follows:
where
represents the regular sparse convolution operation,
represents the residual connection, and
represents the features obtained after computing through the residual regular sparse convolution operator. Regular sparse convolution can not only alter the number of feature channels of the input
but also perform spatial downsampling on the original data and dilute the original data’s sparsity. Therefore,
with a kernel size of 1 is needed to perform spatial downsampling with the same stride on the original data
and scale the feature dimension channel number. Since the storage method of sparse convolution is a hash table rather than a tensor of the same spatial size, during the residual connection, the features cannot be directly added. Instead, a sparse addition function must be used to add features at the same positions based on the coordinate indices of the sparse convolution tensor.
Similarly, the specific calculation process of residual submanifold sparse convolution is as follows:
Let represent the submanifold sparse convolution operation, represent the residual connection, and represent the features obtained after computing through the residual submanifold sparse convolution operator. Submanifold sparse convolution does not alter the number of feature channels of the input , nor does it change the spatial size or sparsity of the input data. Based on this characteristic, during the residual connection, and can be directly added together.
(2) Variable Sparse Convolution Computation Module: Variable sparse convolution is an extension of the submanifold sparse convolution functionality. Before performing submanifold sparse convolution computations, variable sparse convolution expands the effective input locations through a preprocessing step. This creates bridges between non-adjacent positions, enabling feature exchange and expanding the receptive field during subsequent convolution calculations.
Assuming the input convolution’s voxel feature corresponds to the sparse tensor coordinate index , with a kernel size of , the variable sparse convolution is performed using a submanifold sparse convolution to expand the effective input positions. The output feature channel count of is . The output obtained from the convolution has feature dimensions where different channels represent different meanings: channels 1~ represent the probability values that the surrounding -sized cube positions in 3D space around the current effective input position can be expanded into effective positions (no need to predict the effective input position itself; so, the count is ), channels ~ represent the importance values of the surrounding -sized cube positions in 3D space around the current effective input position (no need to predict the effective input position itself; so, the count is ), and the last channel represents the probability value that the current effective input position is allowed to be expanded.
Extract
in channel order:
where
represents the operation to retrieve the positional feature data from the corresponding channel.
Pass
,
, and
through the
function to convert them into values between 0 and 1. Using the topk method, sort the values of
in descending order and select the boolean masks
and
for the top 50% and bottom 50% of the sequences, respectively. The indices where
has True values represent the positions identified as necessary for expansion, effectively dividing the original voxel features into two parts: those to be expanded and those not to be expanded:
where
is the value extracted using a Boolean mask.
Clearly, the
-sized cube positions surrounding an original valid input position that needs to be expanded are not necessarily all worth expanding into additional valid positions. In this case,
represents the probability value used to screen these positions. By separating
using
, we obtain the values corresponding to positions that need to be expanded. Furthermore, by applying a threshold
, we select the Boolean mask
representing the truly expanded positions:
The indices in
where the value is True represent valid positions for additional expansion. The importance values of these positions are obtained using the Boolean mask
:
where
is the value extracted using a Boolean mask,
.
represents the number of additional extended valid input positions. Although new valid input positions have been extended, these positions do not have their own features; so, it is acceptable to fill them with 0 values. Similarly, the original valid input positions have features but do not have corresponding importance values because
is converted into a value between 0 and 1 through the
function. However, the original valid input features are clearly the most important; so, they are filled with 1 values. This process of extending additional input positions is illustrated in
Figure 3.
After obtaining the complete and , it is also necessary to perform deduplication on the data. This is because the original valid input positions may be either discrete or adjacent. When some valid input positions are close to each other, the additional expansion positions within their surrounding -sized cubes are likely to result in duplicate expansions. This occurs because VS-Conv, after supervised learning, tends to predict similar expansion positions. In addition to duplicate issues between additional expansion positions, there is also a possibility of duplicate predictions between original valid input positions and additional expansion positions, and these situations need to be handled individually. For , since zero-value padding was used during the completion stage, even if there is an overlap between the original valid input positions and the additional expansion positions, no additional numerical processing is required. However, for , since one-value padding was used during the completion stage, special handling is necessary if duplicate expansions occur.
Assuming a position
is extended additionally
times, then summing the repeated
values belonging to the same position and then dividing by the number of repetitions can achieve deduplication:
where
is the set of
individuals whose position index
all have the same value
. The essence of the aforementioned approach is to calculate the average of all
values belonging to the same position. However, this simplistic method overlooks an important piece of information inherent in the data: the larger the
value, which represents positions that have been extended multiple times, the higher their importance compared to positions where
. To better utilize this property, we introduced a weight value for the deduplicated
, which is calculated through the
function:
where
is the formula for the function of
,
is the number of different additional extension positions, and
is the number of repeated extensions at the corresponding positions. The purpose of the above operation is to transform the vector
into a probability distribution vector, so that the weights corresponding to the larger values of
for
are greater. A key characteristic of
is that it amplifies the differences between different elements. To make the output probability distribution smoother, a hyperparameter of the temperature value is introduced here.
After removing duplicates, multiply the importance values by the features. This not only emphasizes the significance of the original feature learning space structure but also serves the function of network supervision.
where
represents the concatenation operation in the feature dimension, and
represents the voxel features after expanding the valid positions. The additionally expanded valid positions can only serve as a bridge. Therefore, after completing the expansion of the valid input positions, the last step requires appending a submanifold sparse convolution as the output layer to facilitate the data exchange between spatially non-contiguous features. However, this form also has drawbacks. The effect of local convolution is overly dependent on the additionally expanded valid positions, which causes the overall performance of the variable sparse convolution to be largely limited by the last feature channel of
, that is, the learning outcome of
. To address this issue, this study proposes the Down2Up module to replace the last submanifold sparse convolution layer in the variable sparse convolution computation module, as shown in
Figure 4.
To capture long-range convolutional dependencies, the most intuitive approach is to shorten the distances between sparse voxel features. Regular sparse convolution achieves this by reducing the resolution, but it also dilutes the sparsity at the same time. To address this issue, regular sparse deconvolution can be used to restore the sparsity. Due to its computational characteristics, regular sparse deconvolution will fully restore the number of valid computation positions while restoring the resolution. Leveraging this characteristic, the Down2Up module consists of a regular sparse convolution, a submanifold sparse convolution, and a regular sparse deconvolution. Assuming the input is
, then the output is
:
represents the regular sparse deconvolution operation.
To supervise the learning of and , which guide the expansion of effective positions, we leverage the 3D ground-truth boxes. Specifically, a binary mask G is generated for all voxel positions, where G[i] = 1 if the voxel i falls inside any ground-truth box, and 0 otherwise. This mask serves as the supervision signal. The predictions and are trained using a binary cross-entropy (BCE) loss.
3.3. Spatial-Aware Density-Based Local Aggregation (SDLA)
The two-stage network fine-tunes the proposal boxes obtained from the one-stage Region Proposal Network (RPN). To obtain the features that can represent the prediction boxes where the detection targets are located, an operation of aggregating multiple local features is involved, so as to more comprehensively capture the information dispersed in the space. Since set abstraction [
3] is commonly used to accomplish this task, this step is also referred to as voxel set abstraction.
Set abstraction can be subdivided into a sampling layer, a grouping layer, and a PointNet [
20] layer. We followed the design paradigm of [
12], dividing the proposal regions into
uniformly-distributed grid regions of the same size. A grid point was generated at the center of each region as a key point to perform the sampling task. The function of the grouping layer was to take the sampled points as the center points and generate corresponding point sets through a query algorithm, where each group corresponds to a local area. A common query algorithm is the Ball Query, which has a time complexity of
and often requires complex pre-work, such as VSA. Therefore, we adopted the voxel query method [
11], which calculates starting from the voxel positions requiring range queries. This reduces the complexity and at the same time preserves valuable spatial structure information.
(1) Design of the Kernel Density Estimation Feature Module: Different from the ordinary PointNet layer, in order to enable the downstream network to better learn the point patterns of LiDAR and the density features of each detection target category, SDLA additionally introduces density information to enrich the semantic features of voxels.
The formula for kernel density estimation can be expressed as:
represents the density estimation value at point
.
is the position of the
-th sample point (there are a total of
sample points in all).
is the bandwidth parameter, which controls the smoothness of the estimation, and
is the kernel function. In this study, the Gaussian Kernel was adopted, and its calculation formula is as follows:
The closer the distances between points within a certain region are, the greater the density at the corresponding positions and the higher the KDE value will be. Assume that the centroid position of the voxel query set of a certain grid point in the coordinate system is
, and the corresponding voxel feature is
, where
is the size of the point set obtained from the voxel query and
is the number of channels in the feature dimension. To estimate the local point cloud density, we performed KDE on the set of 3D points defined by
. Specifically, we treated the K centroids
as the sample points. For each centroid
x, we calculated its density
to all other centroids in the set. Substitute the distances between
into Formulas (15) and (16) to obtain
, and concatenate
and
:
where
represents the concatenation operation in the feature dimension,
. By appending the estimated KDE values to the local voxel features, the density information is implicitly encoded into features, enabling downstream tasks to learn the density features.
(2) Spatial-aware Network Aggregation Network: The PointNet network fuses features through a fully-connected MLP and extracts the unique maximum value from the same feature channel of each local feature via the max-pooling operation as the final result. This method only retains the features of the most important positions in the local space while discarding the information of other positions. We addressed this issue by concatenating the features in the spatial dimension in sequence.
Suppose the feature set of a certain grouped voxel obtained at the grouping layer is . First, perform a channel summation operation on the features. This operation aims to reduce the number of channels in the feature dimension and decrease the computational workload of the subsequent neural network. Assume that the number of channels is reduced to , then reshape the tensor to generate , and is the size of the newly added dimension after reshaping the tensor, . Then, sum along the newly added feature dimension to obtain .
The set
is obtained through voxel query. The feature set obtained by voxel query is acquired by sequentially traversing in the 3D space along the coordinate axis directions. Compared with the spherical neighborhood query, the set obtained by voxel query has spatial orderliness. This means that the set itself has spatial semantics, and the features with different index numbers represent the information of a certain orientation of the queried key points. Therefore, SDLA uses the spatial concatenation method to directly concatenate
sets of
into a vector, and uses independent parameters to learn different regions.
is the independent parameter matrix for a certain feature in the set , represents the concatenation operation in the feature dimension, and finally the aggregated feature is obtained. Specifically, channel summation reduces the feature dimension from F to rc by splitting the feature vector into K groups of rc channels and summing the values within each group. This significantly reduces the number of parameters required for the independent weight matrices in the spatial concatenation step (Equation (18)), which would otherwise be K × F, a potentially prohibitive number. Therefore, channel summation serves as a crucial efficiency-enabling module for spatial concatenation. Without channel summation, the parameter count and computational cost of spatial concatenation would become unmanageable, especially with a large K.
(3) RoI Grid-pooling: For the grid point features obtained from a certain proposal region, we did not directly flatten them and feed them into a fully-connected MLP to obtain the final features. The traditional RoI grid-pooling treats each grid as an independent unit, resulting in the lack of long-range dependency relationships among the features.
Therefore, we introduced the self-attention mechanism from the Transformer architecture to explicitly model the interaction relationships among grid points. Additionally, we incorporated point cloud density information during positional encoding, which enhances the model’s adaptability to regions with different densities and its precise perception of target categories.
Specifically, first, we used the position index of a grid region, the number of point clouds within the grid region, and the average value of the KDE of the point clouds within the grid region as the input for positional encoding. Then, we increased the dimensionality through a Feed-Forward Network. After that, we added the positional encoding to feature and fed the result into a standard Transformer Encoder for computation to obtain the final outcome.