Next Article in Journal
ForSOC-UA: A Novel Framework for Forest Soil Organic Carbon Estimation and Uncertainty Assessment with Multi-Source Data and Spatial Modeling
Next Article in Special Issue
LargeStitch: Efficient Seamless Stitching of Large-Size Aerial Images via Deep Matching and Seam-Band Fusion
Previous Article in Journal
A Dual-Modal Framework Integrating SAR-Based Change Screening and Optical-Scene-Informed Identification for High-Frequency Monitoring of Construction-Ready Bare Land
Previous Article in Special Issue
Application of High-Precision Classification Method Based on Spatiotemporal Stable Samples and Land Use Policy in Oasis–Desert Mosaic Landscape Areas
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SpaA: A Spatial-Aware Network for 3D Object Detection from LiDAR Point Clouds

School of Computer Science and Technology, Xidian University, Xi’an 710126, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(8), 1104; https://doi.org/10.3390/rs18081104
Submission received: 7 February 2026 / Revised: 4 April 2026 / Accepted: 5 April 2026 / Published: 8 April 2026

Highlights

What are the main findings?
  • A novel SpaA network architecture is proposed, integrating two core innovations: Variable Sparse Convolution network (VS-Conv) and Spatial-aware Density-based Local Aggregation (SDLA), collectively enhancing spatial awareness in LiDAR-based 3D object detection.
  • Comprehensive experiments on the KITTI benchmark validate SpaA’s effectiveness, achieving 67.23% 3D mAP.
What are the implications of the main findings?
  • VS-Conv and SDLA establish a new method for feature extraction and detection in 3D object detection. This design is transferable to broader 3D vision tasks and may inspire efficient feature learning strategies based on point cloud data.
  • The marked improvement in detecting vulnerable road users—a term in traffic safety that refers to non-motorized road users such as pedestrians and cyclists, who are more susceptible to injury in traffic collisions—directly strengthens perception reliability in autonomous driving systems, advancing real-world safety in complex urban environments.

Abstract

Grid-based 3D object detection methods effectively leverage mature point cloud processing techniques and convolutional neural networks for feature extraction and object localization. However, unlike the 2D object detection domain, the unique characteristics of point cloud data being unevenly and sparsely distributed in space necessitate that detection networks possess a certain level of spatial structural perception. Learning spatial information such as point cloud density and distribution patterns can significantly benefit 3D detection networks. This paper proposes a Spatial-aware Network for 3D object detection (SpaA). Based on the 3D sparse convolution network, we designed a Variable Sparse Convolution network (VS-Conv) capable of perceiving the importance of locations. To address the issue of set abstraction operations completely ignoring spatial structure during local feature aggregation, we proposed a Spatial-aware Density-based Local Aggregation (SDLA) method. Experiments demonstrate that enhancing the spatial-awareness capability of detection networks is crucial for complex 3D object detection. Detection results on the KITTI dataset validate the effectiveness of our method. The test set results of SpaA achieved 3D AP values of 82.20%, 44.04%, and 70.34% for the Car, Pedestrian, and Cyclist categories, respectively, and a competitive 3D mAP of 67.23%, outperforming several published methods.

1. Introduction

In the architecture of contemporary autonomous driving systems, the environment perception module plays a critical role, primarily responsible for accurately acquiring surrounding environmental information and providing precise data support for path recognition and driving decisions. To accurately and promptly obtain environmental information around the vehicle, the perception module typically needs to perform a large number of visual tasks, among which 3D object detection is an indispensable one [1]. In the context of autonomous driving, the primary objective of 3D object detection is to identify and localize potential obstacles—such as vehicles, pedestrians, and cyclists—to enable safe and reliable obstacle avoidance. Both cameras and LiDAR, as two mainstream sensors, can provide raw perception data. However, the detection method using cameras from images is often susceptible to extreme weather and temporal conditions. As a superior solution, LiDAR sensors can directly obtain the fine-grained 3D structure of the scene by emitting laser beams and measuring their reflected information, offering more robust detection results [2].
In the process of 3D object detection based on LiDAR data, how to more effectively extract spatial features of the detection targets from sparse raw point cloud data has always been a technical challenge that requires focused efforts. One direct approach is to use PointNet++ [3] for grouping and sampling operations to directly process point cloud data [4,5,6,7]. However, this method incurs high computational costs, leading to alternative approaches that convert point cloud data into voxels [8] and utilize 3D convolutional networks for feature extraction [9,10,11,12].
The structure of 3D sparse convolution is similar to 2D convolution, encompassing several stages of downsampling operations. However, tailored to the characteristics of point cloud data, 3D sparse convolution undergoes computational optimizations, typically consisting of regular and submanifold sparse convolutions [13]. Most existing sparse convolution backbone networks adopt the structure from SECOND [14], completing downsampling for each stage by stacking modules composed of one regular sparse convolution and two submanifold sparse convolutions [15]. Regular sparse convolution is only used in downsampling layers since it expands all sparse features, inevitably resulting in significant computational overhead. Submanifold sparse convolution ensures that input and output features maintain the same sparsity, but it hinders the exchange of information between spatially disconnected features [16], as illustrated in Figure 1.
These limitations stem from traditional convolutional priors: during convolution, all input features are treated equally. While this is logical for 2D convolutional neural networks, it is not well-suited for 3D sparse features. Two-dimensional convolution is designed for structured data, where all pixels in the same layer typically share the same receptive field size. However, 3D sparse data exhibits varying sparsity and importance across space, making uniform processing of non-uniform data suboptimal.
Enabling convolutional networks to perceive spatial structure is essential. On the one hand, due to the LiDAR point pattern, targets at different distances exhibit different sparsity levels. On the other hand, foreground and background points differ significantly in importance, and convolutional networks should allocate more resources to extracting features from foreground points. The success of Focal Conv [17] has validated this claim, but there is still room for improvement, as we will detail in Section 2.
The concept of set abstraction was first introduced in PointNet++ [3] for hierarchical point set feature learning. Its purpose is to perform feature aggregation on points or voxels within a neighborhood or spatial region. Due to its simplicity, this method has been widely adopted in grid-based 3D object detection networks [11,12]. Set abstraction involves sampling via the Farthest Point Sampling (FPS) [18] algorithm or k-nearest neighbors (k-NN) [19] algorithm, followed by ball queries for grouping. The resulting point sets are then fed into PointNet [20] for final feature extraction. However, the MLP in PointNet consumes significant computational resources while also losing spatial structural information.
Set abstraction’s inability to perceive spatial information manifests in two key aspects. First, the max-pooling operation in PointNet discards spatial distribution information of local features, impairing the representational capacity of aggregated local features [21]. Second, it overlooks the importance of point cloud density information. The LiDAR point pattern diverges with increasing distance, leading to non-uniformly sampled point clouds that are ill-suited for discrete voxel feature extraction. Point density also affects the detection of smaller objects like pedestrians and cyclists. These objects intersect with fewer LiDAR beams, resulting in smaller surface areas and poorer localization [22].
To address these issues, we designed a Spatial-aware Network for 3D object detection (SpaA). By learning importance through additional convolutional layers, the network dynamically adjusts its processing based on input features, assigning higher weights to particularly critical locations to enhance the proportion of valuable information in the features. The Spatial-aware Density-based Local Aggregation (SDLA) abandons the PointNet structure, which abstracts individual features via max-pooling. Instead, it leverages voxel properties to concatenate features along spatial dimensions, using independent parameters for different spatial locations to preserve spatial structure. To tackle the neglect of point cloud density, our method incorporates kernel density estimation (KDE) during local aggregation, computing density information among local features and adding it to the aggregated features. This enables downstream networks to learn the density characteristics of target objects for improved detection.
Our contributions are summarized as follows:
  • We propose the Variable Sparse Convolution Network (VS-Conv). The variable sparse convolution computation module in the backbone network enhances the network’s ability to extract point cloud features by dynamically expanding effective input locations, strengthening spatial information extraction and improving detection accuracy particularly for categories sensitive to spatial information.
  • We propose the Spatial-aware Density-based Local Aggregation (SDLA) method. By incorporating kernel density estimation and a spatial-aware aggregation network, we introduce additional density information. During local aggregation, features from different spatial locations are directly concatenated, enabling the network to better capture the distributional structure of local point clouds and improving the robustness of detection results.

2. Related Work

2.1. Grid-Based 3D Object Detection from Point Clouds

Grid-based 3D object detection methods divide the 3D space into regular grid representations. The core idea of these methods is to convert point cloud data into structured representations to facilitate feature extraction and object detection using deep learning models. The grid is typically represented in three forms: voxels, cylinders, and bird’s eye view (BEV) feature maps.
Each voxel corresponds to a discrete spatial unit. When point cloud data are sampled into a specific grid unit, that voxel is marked as an occupied unit. Due to the sparse nature of point clouds, most grid units are not actually occupied by point cloud data. VoxelNet [8], a pioneering work, utilizes sparse voxel grids and introduces a novel Voxel Feature Encoding (VFE) layer to extract features from points within each voxel unit.
Similar voxel encoding strategies have been adopted by subsequent research [23,24]. Other studies have attempted to optimize the representation of voxels, such as multi-view voxels [25,26,27] and multi-scale voxels [28,29].
BEV feature maps are 2D dense representations constructed from a bird’s eye view, where each pixel unit corresponds to a specific range in the real space and contains features of the 3D point cloud data within that region. Common regional statistical features include binary occupancy [30], local point cloud height, and density [31,32].
Cylindrical grid representation divides the 3D space into infinitely tall cylindrical regions. Cylindrical features can be aggregated using PointNet [20] from point cloud data and then scattered back to construct a 2D BEV image for feature extraction. PointPillars [10], a groundbreaking work, introduced the cylindrical grid representation.
Determining the optimal grid cell size is a core challenge for all grid-based detection algorithms. Smaller cell sizes generate higher resolution grid representations, which help preserve finer structural features and are crucial for achieving high-precision 3D object detection. However, reducing the physical size of grid cells significantly increases the memory consumption of the grid representation. Effectively balancing the precision gains from smaller grid cells with the surge in memory usage remains a key issue that needs to be addressed in this field.

2.2. Convolutional Networks in 3D Object Detection

To address the drawbacks of sparse 3D convolution networks and capture long-range dependencies, some studies have directly transplanted 2D convolution techniques into the 3D object detection domain, such as large kernel convolutional neural networks or Transformers [33]. However, applying ordinary large kernel convolution networks to 3D feature learning may lead to issues like overfitting and reduced efficiency. Some research [34,35] has attempted to solve this problem through weight-sharing patterns, but efficiency remains low. The authors of [36,37] have directly replaced sparse convolutions with Transformers, but they do not hold advantages in terms of performance and efficiency.
Another branch of research has attempted to solve the problem by introducing dynamic mechanisms on the basis of sparse convolutions. Since sparse convolutions are designed for grid-based 3D object detection methods, these methods inherently possess performance advantages. Deformable PV-RCNN [38] achieves offset prediction for feature sampling in 3D object detection by applying Deformable Convolution [39]. In contrast, Focals Conv [17] improves the spatial sparsity of output features and enables their learning. However, Focals Conv focuses entirely on distinguishing foreground points from background points, neglecting the importance of distinctions among foreground points, which is necessary for classifying foreground objects. Focals Conv employs attention multiplication and objective loss to supervise importance values, an idea that fully utilizes prior conditions but has the drawback of causing the network to tend to increase the weights of positions considered as foreground points, thereby blurring the boundaries between different detection object categories. While Focals Conv successfully enhances foreground feature extraction by learning importance weights, its supervision signal is directly tied to foreground–background classification. This can lead to a bias where importance is conflated with semantic class, potentially blurring the distinctions within foreground objects. To address these issues, our VS-Conv has made corresponding improvements and designed additional modules to allocate different weights to extended feature outputs. These operations enable our convolution network to have more refined learning capabilities.

3. Method

SpaA is a grid-based two-stage 3D object detection network, with an overall network structure similar to some classical methods [11,12]. In the first stage, a RPN network [40] similar to SECOND [14] was used to extract proposal regions, with the distinction being the introduction of a novel sparse convolution backbone network, VS-Conv, which can more effectively extract deep voxel features, as shown in Figure 2. In the second stage network, we employed the efficient SDLA method to replace the traditional set abstraction operation, preserving local spatial structural information. We will now detail the structure of SpaA in sequence.

3.1. Voxel Feature Encoding

The role of voxel feature encoding is to convert raw point cloud data into voxel features. The design paradigm in [8] divides voxelization into two stages: grouping and sampling. Given a point cloud set P = p 1 , , p N , this process assigns N point clouds to a buffer of size K × T × F , where K represents the maximum number of voxels, T is the maximum number of point clouds in each voxel, and F denotes the channel size of feature dimensions. If voxels or point clouds within a voxel exceed the fixed capacity, subsampling is performed. Conversely, if the number of point clouds/voxels is less than the fixed capacity T / K , the unused positions in the buffer are zero-padded, a process also referred to as fixed voxelization. Based on the above definition of fixed voxelization, its limitations can be summarized as follows: (1) When point clouds and voxels exceed the specified capacity, they are discarded. Due to the characteristics of LiDAR, point cloud density is higher at locations with key targets; so, fixed voxelization forces the model to discard potentially useful information for detection. (2) The random discarding of point clouds and voxels can lead to unstable or jittery detection results. Therefore, by assigning voxel grid coordinate indices to each point cloud and combining it with a deduplication function, we implemented dynamic voxelization [25]. This method abandons the approach of sampling point clouds into a fixed number and fixed capacity of voxels, instead preserving the complete mapping between point clouds and voxels. Consequently, the number of voxels and the number of points in each voxel are dynamic, depending on the specific mapping function and the actual distribution of point cloud data.
In addition, to obtain more fine-grained information from the point cloud, we introduced additional point cloud features during the voxel feature encoding stage. Assuming that the coordinate information of a certain point cloud in the coordinate system is P x y z R 1 × 3 , the mean of P x y z for all points in the voxel it belongs to is V m e a n R 1 × 3 . V m e a n can also be understood as the centroid of a voxel. Subtracting P x y z from V m e a n yields P c l u m p R 1 × 3 :
P c l u m p = V m e a n P x y z
where P c l u m p is the difference between the point cloud and the centroid of its own voxel. Assuming the physical position center of the voxel to which a point cloud belongs is V c e n t e r R 1 × 3 (which can be derived based on the point cloud range and the set voxel size), subtract P x y z from V c e n t e r to obtain P c e n t e r R 1 × 3 :
P c e n t e r = V c e n t e r P x y z
where P c e n t e r is the difference between the position of the point cloud itself and the physical center position of the voxel it belongs to. Calculate the Euclidean distance P d i s from the point cloud to the origin of the coordinate system:
P d i s = n o r m 2 ( P x y z )
where n o r m 2 ( ) represents the norm of a specified dimension, and 2 denotes the L2 norm, which is the Euclidean distance. We concatenated the additional point cloud features with the original point cloud features to obtain the final point cloud feature P f e a R N × 11 :
P f e a = [ x , y , z , r , P c l u m p , P c e n t e r , P d i s ]
where [ , ] is the channel concatenation operation along the feature dimension, and x , y , z , r are the original features representing the x -coordinate, y -coordinate, z -coordinate, and the reflectivity (the intensity value of the laser pulse emitted by the LiDAR being reflected back after encountering the surface of an object. This value is stored in floating-point form and reflects the laser reflection characteristics of the object’s surface) of the point cloud point p in space, respectively. The ultimate goal of voxel feature encoding is to obtain the features of each voxel. After obtaining the complete point cloud feature P f e a , a Feed-Forward Network was used to derive the corresponding voxel features, with the specific approach similar to the max-pooling operation in [20].

3.2. Variable Sparse Convolution Network VS-Conv

(1) Residual Structure Design: VS-Conv’s underlying structure is still composed of regular sparse convolution operators and submanifold sparse convolution operators, ensuring the applicability and efficiency of the convolution backbone network. However, to address the training challenges caused by the excessive depth of sparse convolution networks and enable the network to learn more complex feature representations, we introduced a residual structure [41]. Assuming the input convolution voxel features X R M × F (with M being the number of features and F being the feature channel length), the specific computation process of the residual regular sparse convolution is as follows:
X ^ = C o n v s p 1 ( X ) C o n v s p 2 ( X )
where C o n v s p represents the regular sparse convolution operation, represents the residual connection, and X ^ R M × F represents the features obtained after computing through the residual regular sparse convolution operator. Regular sparse convolution can not only alter the number of feature channels of the input X but also perform spatial downsampling on the original data and dilute the original data’s sparsity. Therefore, C o n v s p 2 ( ) with a kernel size of 1 is needed to perform spatial downsampling with the same stride on the original data X R M × F and scale the feature dimension channel number. Since the storage method of sparse convolution is a hash table rather than a tensor of the same spatial size, during the residual connection, the features cannot be directly added. Instead, a sparse addition function must be used to add features at the same positions based on the coordinate indices of the sparse convolution tensor.
Similarly, the specific calculation process of residual submanifold sparse convolution is as follows:
X ^ = C o n v s u b m ( X ) X
Let C o n v s u b m ( ) represent the submanifold sparse convolution operation, represent the residual connection, and X ^ R M × F represent the features obtained after computing through the residual submanifold sparse convolution operator. Submanifold sparse convolution does not alter the number of feature channels of the input X , nor does it change the spatial size or sparsity of the input data. Based on this characteristic, during the residual connection, C o n v s u b m ( X ) and X can be directly added together.
(2) Variable Sparse Convolution Computation Module: Variable sparse convolution is an extension of the submanifold sparse convolution functionality. Before performing submanifold sparse convolution computations, variable sparse convolution expands the effective input locations through a preprocessing step. This creates bridges between non-adjacent positions, enabling feature exchange and expanding the receptive field during subsequent convolution calculations.
Assuming the input convolution’s voxel feature X R M × F corresponds to the sparse tensor coordinate index C o o r d s R M × 3 , with a kernel size of k s , the variable sparse convolution C o n v I m p is performed using a submanifold sparse convolution to expand the effective input positions. The output feature channel count of C o n v I m p is ( k s 3 1 ) × 2 + 1 . The output I m p s R M × ( 2 k s 3 1 ) obtained from the C o n v I m p convolution has feature dimensions where different channels represent different meanings: channels 1~ k s 3 1 represent the probability values that the surrounding k s 3 -sized cube positions in 3D space around the current effective input position can be expanded into effective positions (no need to predict the effective input position itself; so, the count is k s 3 1 ), channels k s 3 ~ 2 ( k s 3 1 ) represent the importance values of the surrounding k s 3 -sized cube positions in 3D space around the current effective input position (no need to predict the effective input position itself; so, the count is k s 3 1 ), and the last channel represents the probability value that the current effective input position is allowed to be expanded.
Extract I m p s in channel order:
M a s k p r o = I m p s 1 : k s 3 1 M a s k i m p = I m p s k s 3 : 2 k s 3 1 M a s k v = I m p s [ 1 ]
where [ : ] represents the operation to retrieve the positional feature data from the corresponding channel.
Pass M a s k p r o , M a s k i m p , and M a s k v through the S i g m o i d function to convert them into values between 0 and 1. Using the topk method, sort the values of M a s k v in descending order and select the boolean masks I n d i c e s f o r e R M × 1 and I n d i c e s b a c k R M × 1 for the top 50% and bottom 50% of the sequences, respectively. The indices where I n d i c e s f o r e has True values represent the positions identified as necessary for expansion, effectively dividing the original voxel features into two parts: those to be expanded and those not to be expanded:
X f o r e = X I n d i c e s f o r e X b a c k = X [ I n d i c e s b a c k ]
where [ ] is the value extracted using a Boolean mask.
Clearly, the k s 3 -sized cube positions surrounding an original valid input position that needs to be expanded are not necessarily all worth expanding into additional valid positions. In this case, M a s k p r o represents the probability value used to screen these positions. By separating M a s k p r o using I n d i c e s f o r e , we obtain the values corresponding to positions that need to be expanded. Furthermore, by applying a threshold t h r e s h o l d , we select the Boolean mask M a s k p r o f o r e representing the truly expanded positions:
M a s k p r o f o r e = { x M a s k p r o | x > t h r e s h o l d }
The indices in M a s k p r o f o r e where the value is True represent valid positions for additional expansion. The importance values of these positions are obtained using the Boolean mask M a s k p r o f o r e :
M a s k i m p f o r e = M a s k i m p [ I n d i c e s f o r e ] [ M a s k p r o f o r e ]
where [ ] is the value extracted using a Boolean mask, M a s k i m p f o r e R M × 1 . M represents the number of additional extended valid input positions. Although new valid input positions have been extended, these positions do not have their own features; so, it is acceptable to fill them with 0 values. Similarly, the original valid input positions have features but do not have corresponding importance values because M a s k i m p is converted into a value between 0 and 1 through the S i g m o i d function. However, the original valid input features are clearly the most important; so, they are filled with 1 values. This process of extending additional input positions is illustrated in Figure 3.
After obtaining the complete X f o r e and M a s k i m p f o r e , it is also necessary to perform deduplication on the data. This is because the original valid input positions may be either discrete or adjacent. When some valid input positions are close to each other, the additional expansion positions within their surrounding k s 3 -sized cubes are likely to result in duplicate expansions. This occurs because VS-Conv, after supervised learning, tends to predict similar expansion positions. In addition to duplicate issues between additional expansion positions, there is also a possibility of duplicate predictions between original valid input positions and additional expansion positions, and these situations need to be handled individually. For X f o r e , since zero-value padding was used during the completion stage, even if there is an overlap between the original valid input positions and the additional expansion positions, no additional numerical processing is required. However, for M a s k i m p f o r e , since one-value padding was used during the completion stage, special handling is necessary if duplicate expansions occur.
Assuming a position j is extended additionally c o u n t j times, then summing the repeated M a s k i m p f o r e values belonging to the same position and then dividing by the number of repetitions can achieve deduplication:
M a s k i m p f o r e j = i S j M a s k i m p f o r e i c o u n t j S j = { M a s k i m p   i f o r e M a s k i m p f o r e | C o o r d s i = j }
where S j is the set of M a s k i m p f o r e individuals whose position index C o o r d s i all have the same value j . The essence of the aforementioned approach is to calculate the average of all M a s k i m p f o r e values belonging to the same position. However, this simplistic method overlooks an important piece of information inherent in the data: the larger the c o u n t value, which represents positions that have been extended multiple times, the higher their importance compared to positions where c o u n t = 1 . To better utilize this property, we introduced a weight value for the deduplicated M a s k i m p f o r e , which is calculated through the S o f t m a x   w i t h   T e m p e r a t u r e function:
s o f t m a x T c o u n t i = e c o u n t i T j = 1 N e c o u n t j T M a s k i m p f o r e = s o f t m a x T ( c o u n t j ) × M a s k i m p f o r e j
where s o f t m a x T ( ) is the formula for the function of S o f t m a x   w i t h   T e m p e r a t u r e , N is the number of different additional extension positions, and C o u n t = { c o u n t i , , c o u n t n } is the number of repeated extensions at the corresponding positions. The purpose of the above operation is to transform the vector c o u n t into a probability distribution vector, so that the weights corresponding to the larger values of c o u n t for M a s k i m p f o r e are greater. A key characteristic of S o f t m a x is that it amplifies the differences between different elements. To make the output probability distribution smoother, a hyperparameter of the temperature value is introduced here.
After removing duplicates, multiply the importance values by the features. This not only emphasizes the significance of the original feature learning space structure but also serves the function of network supervision.
X o u t = [ X f o r e × M a s k i m p f o r e , X b a c k ]
where [ , ] represents the concatenation operation in the feature dimension, and X o u t represents the voxel features after expanding the valid positions. The additionally expanded valid positions can only serve as a bridge. Therefore, after completing the expansion of the valid input positions, the last step requires appending a submanifold sparse convolution as the output layer to facilitate the data exchange between spatially non-contiguous features. However, this form also has drawbacks. The effect of local convolution is overly dependent on the additionally expanded valid positions, which causes the overall performance of the variable sparse convolution to be largely limited by the last feature channel of C o n v I m p , that is, the learning outcome of M a s k v . To address this issue, this study proposes the Down2Up module to replace the last submanifold sparse convolution layer in the variable sparse convolution computation module, as shown in Figure 4.
To capture long-range convolutional dependencies, the most intuitive approach is to shorten the distances between sparse voxel features. Regular sparse convolution achieves this by reducing the resolution, but it also dilutes the sparsity at the same time. To address this issue, regular sparse deconvolution can be used to restore the sparsity. Due to its computational characteristics, regular sparse deconvolution will fully restore the number of valid computation positions while restoring the resolution. Leveraging this characteristic, the Down2Up module consists of a regular sparse convolution, a submanifold sparse convolution, and a regular sparse deconvolution. Assuming the input is X R M × F , then the output is X ^ R M × F :
X ^ = C o n v s p I n v ( C o n v s u b m ( C o n v s p ( X ) ) )
C o n v s p I n v ( ) represents the regular sparse deconvolution operation.
To supervise the learning of M a s k p r o and M a s k v , which guide the expansion of effective positions, we leverage the 3D ground-truth boxes. Specifically, a binary mask G is generated for all voxel positions, where G[i] = 1 if the voxel i falls inside any ground-truth box, and 0 otherwise. This mask serves as the supervision signal. The predictions M a s k p r o and M a s k v are trained using a binary cross-entropy (BCE) loss.

3.3. Spatial-Aware Density-Based Local Aggregation (SDLA)

The two-stage network fine-tunes the proposal boxes obtained from the one-stage Region Proposal Network (RPN). To obtain the features that can represent the prediction boxes where the detection targets are located, an operation of aggregating multiple local features is involved, so as to more comprehensively capture the information dispersed in the space. Since set abstraction [3] is commonly used to accomplish this task, this step is also referred to as voxel set abstraction.
Set abstraction can be subdivided into a sampling layer, a grouping layer, and a PointNet [20] layer. We followed the design paradigm of [12], dividing the proposal regions into 6 × 6 × 6 uniformly-distributed grid regions of the same size. A grid point was generated at the center of each region as a key point to perform the sampling task. The function of the grouping layer was to take the sampled points as the center points and generate corresponding point sets through a query algorithm, where each group corresponds to a local area. A common query algorithm is the Ball Query, which has a time complexity of O ( N ) and often requires complex pre-work, such as VSA. Therefore, we adopted the voxel query method [11], which calculates starting from the voxel positions requiring range queries. This reduces the complexity and at the same time preserves valuable spatial structure information.
(1) Design of the Kernel Density Estimation Feature Module: Different from the ordinary PointNet layer, in order to enable the downstream network to better learn the point patterns of LiDAR and the density features of each detection target category, SDLA additionally introduces density information to enrich the semantic features of voxels.
The formula for kernel density estimation can be expressed as:
f ^ h ( x ) = 1 n h i = 1 n K ( x x i h )
f ^ h ( x ) represents the density estimation value at point x . x i is the position of the i -th sample point (there are a total of n sample points in all). h is the bandwidth parameter, which controls the smoothness of the estimation, and K is the kernel function. In this study, the Gaussian Kernel was adopted, and its calculation formula is as follows:
K ( u ) = 1 2 π exp ( 1 2 u 2 )
The closer the distances between points within a certain region are, the greater the density at the corresponding positions and the higher the KDE value will be. Assume that the centroid position of the voxel query set of a certain grid point in the coordinate system is V m e a n R K × 3 , and the corresponding voxel feature is G f e a R K × F , where K is the size of the point set obtained from the voxel query and F is the number of channels in the feature dimension. To estimate the local point cloud density, we performed KDE on the set of 3D points defined by V m e a n . Specifically, we treated the K centroids x i V m e a n as the sample points. For each centroid x, we calculated its density f ^ h ( x ) to all other centroids in the set. Substitute the distances between V m e a n into Formulas (15) and (16) to obtain K D E f e a R K × 1 , and concatenate G f e a R K × F and K D E f e a R K × 1 :
G ^ f e a = [ G f e a , K D E f e a ]
where [ , ] represents the concatenation operation in the feature dimension, G ^ f e a R K × ( F + 1 ) . By appending the estimated KDE values to the local voxel features, the density information is implicitly encoded into features, enabling downstream tasks to learn the density features.
(2) Spatial-aware Network Aggregation Network: The PointNet network fuses features through a fully-connected MLP and extracts the unique maximum value from the same feature channel of each local feature via the max-pooling operation as the final result. This method only retains the features of the most important positions in the local space while discarding the information of other positions. We addressed this issue by concatenating the features in the spatial dimension in sequence.
Suppose the feature set of a certain grouped voxel obtained at the grouping layer is G f e a R K × F . First, perform a channel summation operation on the features. This operation aims to reduce the number of channels in the feature dimension and decrease the computational workload of the subsequent neural network. Assume that the number of channels is reduced to r c , then reshape the tensor to generate G f e a R K × T × r c , and T is the size of the newly added dimension after reshaping the tensor, T = F / r c . Then, sum G f e a along the newly added feature dimension to obtain G f e a R K × r c .
The set G f e a is obtained through voxel query. The feature set obtained by voxel query is acquired by sequentially traversing in the 3D space along the coordinate axis directions. Compared with the spherical neighborhood query, the set obtained by voxel query has spatial orderliness. This means that the set itself has spatial semantics, and the features with different index numbers represent the information of a certain orientation of the queried key points. Therefore, SDLA uses the spatial concatenation method to directly concatenate K sets of G f e a into a vector, and uses independent parameters to learn different regions.
G ^ f e a   i = W i × G f e a   i , G f e a   i G f e a i = 1,2 , , K S A f e a = [ G ^ f e a   1 , G ^ f e a   2 , , G ^ f e a   K ]
W i is the independent parameter matrix for a certain feature G f e a   i in the set G f e a , [ , ] represents the concatenation operation in the feature dimension, and finally the aggregated feature S A f e a is obtained. Specifically, channel summation reduces the feature dimension from F to rc by splitting the feature vector into K groups of rc channels and summing the values within each group. This significantly reduces the number of parameters required for the independent weight matrices W i in the spatial concatenation step (Equation (18)), which would otherwise be K × F, a potentially prohibitive number. Therefore, channel summation serves as a crucial efficiency-enabling module for spatial concatenation. Without channel summation, the parameter count and computational cost of spatial concatenation would become unmanageable, especially with a large K.
(3) RoI Grid-pooling: For the 6 × 6 × 6 grid point features S A f e a obtained from a certain proposal region, we did not directly flatten them and feed them into a fully-connected MLP to obtain the final features. The traditional RoI grid-pooling treats each grid as an independent unit, resulting in the lack of long-range dependency relationships among the features.
Therefore, we introduced the self-attention mechanism from the Transformer architecture to explicitly model the interaction relationships among grid points. Additionally, we incorporated point cloud density information during positional encoding, which enhances the model’s adaptability to regions with different densities and its precise perception of target categories.
Specifically, first, we used the position index of a grid region, the number of point clouds within the grid region, and the average value of the KDE of the point clouds within the grid region as the input for positional encoding. Then, we increased the dimensionality through a Feed-Forward Network. After that, we added the positional encoding to feature S A f e a and fed the result into a standard Transformer Encoder for computation to obtain the final outcome.

4. Experiments

4.1. Datasets and Evaluation Metrics

KITTI dataset: The dataset includes nearly 15,000 finely annotated stereo images and their corresponding point cloud data. Among them, the training set contains 7481 sets of samples, and the test set has 7518 sets of samples.
The dataset conducts 3D annotation for three types of targets: Car, Pedestrian, and Cyclist. In total, it includes 80,256 annotation instances that have undergone strict quality control. The KITTI dataset was captured using a Velodyne HDL-64E LiDAR sensor (Velodyne LiDAR, Inc., San Jose, CA, USA), which operates at a scan rate of 10 Hz, emits approximately 1.3 million points per second, and has a vertical field of view of 26.8° with 64 beams (≈0.4° angular resolution) and a horizontal resolution of 0.08–0.35° depending on rotation speed. The effective range is up to 120 m. Based on the training set annotations, the three target classes exhibit distinct physical dimensions and point cloud statistics: a typical Car measures about 4.7 m (L) × 1.8 m (W) × 1.5 m (H) and receives 150–200 LiDAR points at a distance of 20 m; a Pedestrian is roughly 0.8 m × 0.7 m × 1.7 m and yields only 20–30 points at the same range; a Cyclist (bicycle with rider) is about 1.7 m × 0.6 m × 1.6 m and produces 40–60 points. The point density on object surfaces decays approximately with the inverse square of the distance, dropping from about 0.5 points/cm2 at 30 m to 0.1 points/cm2 at 60 m for a car. These sensor and target characteristics directly influence the design choices of our method, such as voxel size, KDE bandwidth, and convolution kernel size.
The KITTI dataset divides the difficulty level of 3D object detection into three grades: Easy, Moderate, and Hard. The division of these grades is mainly based on the visibility, occlusion degree, and truncation degree of the targets in the images.
The average precision (AP) is obtained by calculating the mean of the precision values on the precision–recall (PR) curve. This metric is commonly used in the field of object detection. The formula is as follows:
A P = 0 1 P ( r ) d r
P ( r ) represents the PR curve. When calculating the AP value, due to the complexity of integral calculation, interpolation methods are often used in practical applications to calculate the AP value:
A P | R = 1 R r R ρ i n t e r p r ρ i n t e r p ( r ) = max P ( r ) r > r
where P ( r ) is the precision when the recall rate is r , and R takes values uniformly between 0 and 1. The officially recommended evaluation metric is the 3D AP with 40 recall sampling points (R40), and we adopted this setting in our experiments.

4.2. Setting Detail

(1) The baseline model used throughout this paper for comparison is a re-implementation of the PV-RCNN [12] architecture, which serves as a strong representative of grid-based two-stage methods. The baseline employs a standard sparse convolutional backbone as described in SECOND [14], consisting of four stages of residual blocks (each comprising one regular sparse convolution followed by two submanifold sparse convolutions) for downsampling. It uses a Voxel Region Proposal Network (RPN) identical to [14] to generate initial proposals. For the second stage, it uses the standard set abstraction operation [3] with max-pooling for feature aggregation. All training hyperparameters (e.g., learning rate schedule, optimizer, data augmentation) are kept identical for both the baseline and SpaA to ensure a fair comparison. The specific modifications from this baseline—namely, the introduction of VS-Conv in the backbone and SDLA in the second stage—constitute our proposed SpaA network.
(2) Details related to the method: In the SDLA method, we introduced a multi-scale fusion operation, that is, voxel set abstraction was performed on the outputs of the convolutional network at different down-sampling stages. The voxel query involved has a query range of 4 on the x , y , z axes. The query radius in the third down-sampling stage is 0.8, and the query radius in the fourth down-sampling stage is 1.6. In the RoI grid-pooling stage, the Transformer Encoder uses a single-head attention mechanism and has only one encoder layer.
(3) Training and testing details on KITTI: The effective point cloud range of the detector, denoted as ( x m i n , y m i n , z m i n , x m a x , y m a x , z m a x ) , is equal to [0, −40, −3, 70.4, 40, 1]. All experiments were trained on a single NVIDIA GTX 4080 GPU with a batch size of two. The software environment was Ubuntu 20.04, with CUDA 11.3, cuDNN 8.2.0, and spconv 2.1.25. We employed the Adam optimizer in conjunction with the One Cycle learning rate strategy for training. The momentum parameter was set to 0.9, the initial learning rate was 0.01, the weight decay coefficient was 0.01, and the gradient norm clipping threshold was 10.
The parameters of the One-Cycle learning rate strategy are as follows: The initial and final values of the momentum were 0.95 and 0.85, respectively. The proportion of the learning rate increase period to the total training period was 0.4. The ratio of the initial learning rate to the maximum learning rate was 10.
The list of learning rate decay periods was [35, 45], and the learning rate decay factor for each decay was 0.1. The lower bound of the learning rate was set to 0.0000001, and no learning rate warm-up was used.
For the hyperparameters of training epochs and confidence threshold, which have significant impacts (our detector is relatively sensitive to the confidence threshold parameter), we adopted the method of comparative experiments to select the optimal parameters. Eventually, the number of training epochs was set to 100, and the confidence threshold was set to 0.2.

4.3. Quantitative Comparison on the Dataset

(1) Validation Set: As shown in Table 1, it presents the performance ranking comparison between our SpaA and the baseline on the KITTI validation set. Compared with the baseline (under the Moderate difficulty level), SpaA has an improvement in the AP value. For the Car category, the AP value increases by 0.62, with a percentage increase of 0.72%. For the Pedestrian category, the AP value rises by 1.52, a percentage increase of 2.27%. And for the Cyclist category, the AP value increases by 1.61, with a percentage increase of 2.10%. The improvement is particularly reflected in the precision of the Cyclist and Pedestrian categories, which are sensitive to spatial structure information.
Figure 5 shows the comparison of the PR curves between SpaA and the baseline on the KITTI dataset. It can be seen from the figure that our algorithm outperforms the baseline in all categories, and the improvement is more significant in the Pedestrian and Cyclist categories. Observing the PR curve of this study for the Cyclist category, the precision is significantly higher than that of the baseline model in scenarios with a high recall rate. This indicates that the algorithm in this study performs excellently in scenarios where positive-class samples are scarce. It can effectively identify positive-class samples while reducing false positives, and this characteristic is crucial in the technical field of object detection, where the number of positive samples is significantly less than that of negative samples.
(2) Test Set: As shown in Table 2, it presents the performance ranking comparison between SpaA and other models on the KITTI test set. From the data in the table, the advantage of SpaA lies in its comprehensive detection performance, with the mAP higher than all other methods. SpaA does not have obvious weaknesses in different detection categories, which benefits from the high-quality upstream data provided by the convolutional network. Notably, the detection performance of the cyclist category is far superior to other methods. This is because SpaA has learned a more comprehensive spatial perception ability, which greatly enhances the detection capabilities for the Cyclist and Pedestrian categories.
Figure 6 shows the comparison chart of the detection performance between SpaA and the baseline at different distances. By observing the data in the figure, it can be found that, except for the negative gain in the AP value of SpaA when detecting the Cyclist category at a distance of over 30 m, SpaA has achieved performance improvement compared with the baseline in other cases. In addition, the AP value gains of the Cyclist and Pedestrian categories are significantly greater than those of the Car category, and the AP value gains in long-distance scenarios are significantly greater than those in short-distance scenarios. This proves that SpaA has fully learned the characteristics of point density in different scenarios of point cloud based 3D object detection.
The point cloud distribution of the LiDAR becomes sparser as the distance increases, and the point cloud density also affects the detection of smaller objects such as pedestrians and cyclists. The baseline network ignores the predictable relationship between the point density and the distance of the LiDAR sensor and its importance in category judgment.

4.4. Qualitative Comparison on the Dataset

(1) Detection Results: As shown in Figure 7, it is a visual comparison of the detection results between SpaA and the baseline from a 3D perspective. By observing the areas marked with red dotted lines in (a) and (b), we can conclude that, in the detection results yielded by SpaA, the number of false-positive prediction boxes is significantly reduced, and almost all the prediction boxes for the Pedestrian and Cyclist categories are true-positive ones with accurate predictions.
In addition, by observing the visualization results marked with orange dash-dotted lines in (c) and (d), we can find that the baseline may miss the detection of the ground-truth boxes. For example, as shown in (c), for a ground-truth of the Pedestrian category, the baseline fails to detect the target at that position. SpaA successfully solves the problem of missed and false detections in this frame, which proves that SpaA can better process and identify detection targets sensitive to density information, such as those in the Pedestrian category, and improves the robustness of the model.
Figure 8 shows a visual comparison of the detection results of SpaA and the baseline projected onto a 2D image. By observing the area marked with a red dotted line in the figure, we can see that SpaA reduces the misjudgment of detection targets at a relatively long distance. At the same time, it detects the Cyclist category targets that the baseline missed, thus improving the performance of the object detection network in long-distance scenarios.
As shown in Figure 9, this is the projection of voxel feature positions in a 2D space under a downsampling stage of a 3D sparse convolution backbone network on the KITTI dataset. (a) represents the projection of a general 3D sparse convolution, while (b) represents the projection of VS-Conv. Observing the red boxed area in (a), traditional sparse convolution networks do not expand the feature positions of the region merely because local features might be detection targets; instead, they treat every input region equally. In (b), the voxel features are distributed on a regular grid-like foundation while additionally generating some voxel features, which is an optimization formed by VS-Conv dynamically expanding effective input positions. Observing the red boxed area in (b), VS-Conv learns the characteristics of different detection targets and predicts whether local features are detection targets. When it predicts that local features might be detection targets, VS-Conv can adaptively expand target features without introducing redundant background features. This allows locally valuable features to be better extracted and also addresses the issue of traditional sparse convolutions being unable to handle spatially disconnected voxel features.

4.5. Ablation Study

We conducted experiments on the KITTI validation set (moderate class).
(1) VS-Conv: As shown in Table 3, after introducing the complete variable sparse convolution calculation module, the AP value of VS-Conv has a relatively obvious increase. VS-Conv dynamically expands the effective input positions and learns the important features of spatial structure information, which effectively improves the accuracy for the categories of Cyclist and Pedestrian that are sensitive to spatial information.
Meanwhile, it can be noted that C o n v I m p offers a greater improvement compared to Down2Up and plays a leading role in the variable sparse convolution computation module.
(2) Use stages: As shown in Table 4, in the 3D sparse convolutional backbone network, introducing the variable sparse convolution module at the (1, 2, 3,) stages provides the most significant improvement to the overall performance of the network.
(3) Bandwidth: In kernel density estimation, the bandwidth is a very crucial parameter, which determines the smoothness of the estimation. The selection of the bandwidth directly affects the quality of kernel density estimation. Specifically, the bandwidth determines the scope of influence of each data point on the density estimation. A larger bandwidth makes the estimation smoother because the influence scope of each data point is wider, thus reducing fluctuations overall. On the contrary, a smaller bandwidth makes the estimation rougher because it is more sensitive to local changes in data points.
As shown in Table 5, when the bandwidth is 0.15, the accuracy of only the Car category increases while that of other categories decreases. The reason may be that the excessively small bandwidth leads to over-fitting, that is, the estimated density curve overly reflects the noise and random fluctuations in the data. When the bandwidth is 0.35, the decrease in accuracy may be due to over-large bandwidth causing under-fitting, that is, important features and patterns in the data are ignored, which results in the newly introduced KDE features failing to play any role in class discrimination.
(4) SDLA: As shown in Table 6, when only the channel summation is introduced, although the number of model parameters and FPS are improved due to the reduction in the number of feature channels, the overall accuracy decreases. This is because the channel summation module splits the high-dimensional features and then adds them element-wise, which inevitably leads to the loss of some feature information.
When the spatial concatenation is introduced simultaneously, the downstream network learns the semantics of features at different positions. It reduces the network’s computational load and improves the model’s performance at the same time. This proves that the significance of the channel summation module lies in reducing the computational load and the number of feature channels, so as to ensure that the number of feature channels is not overly long after the subsequent spatial concatenation operation of features.

5. Discussion

The experimental results presented in Section 4 demonstrate that the proposed SpaA network achieves consistent improvements over baseline methods on the KITTI benchmark, particularly in detecting pedestrian and cyclist categories. These categories are inherently more challenging due to their smaller physical size, fewer LiDAR return points, and higher sensitivity to spatial distribution. The observed performance gains can be attributed to two key innovations: the VS-Conv backbone and the SDLA module, which collectively enhance spatial awareness in both feature extraction and local aggregation stages.
The VS-Conv module introduces a dynamic mechanism that expands effective input positions based on learned importance. As illustrated in Figure 9, this leads to a more adaptive distribution of active voxel features, enabling the network to focus on regions likely to contain foreground objects without incurring the computational overhead of dense convolution. The ablation study in Table 3 further confirms that both the C o n v I m p and Down2Up components contribute positively, with C o n v I m p playing a dominant role. Notably, applying VS-Conv in the first three downsampling stages yields the best performance (Table 4), suggesting that early-stage spatial expansion is more beneficial, while over-expansion in deeper stages may introduce noise or redundant computations.
The SDLA module addresses two critical limitations of conventional set abstraction: the loss of spatial structure due to max-pooling and the neglect of point density variations. By incorporating kernel density estimation (KDE) and replacing PointNet with a spatial concatenation strategy, SDLA preserves local spatial semantics and encodes density cues into the aggregated features. Table 6 shows that the combination of channel summation and spatial concatenation achieves the best trade-off between accuracy and efficiency, reducing parameter count while improving detection performance. The sensitivity analysis in Table 5 highlights the importance of bandwidth selection in KDE, with an optimal value of 0.25 balancing over-smoothing and over-fitting.
From a qualitative perspective, Figure 6 and Figure 7 demonstrate that SpaA significantly reduces false positives and missed detections, especially in long-range and occluded scenarios. These improvements are critical for real-world autonomous driving, where the reliability of perception systems directly impacts safety. The distance-based analysis in Figure 6 further supports the effectiveness of SDLA, as the performance gain is more pronounced at longer ranges where point cloud sparsity becomes more severe.
Despite these advantages, SpaA is not without limitations. First, the dynamic expansion mechanism in VS-Conv introduces additional computational overhead, particularly in the C o n v I m p module, which requires extra convolution operations and deduplication steps. Although the current implementation maintains real-time performance (Table 6), further optimization may be required for deployment on embedded platforms. Second, the SDLA module relies on voxel query with fixed grid partitions, which may not adapt well to highly irregular object shapes or extreme aspect ratios. Third, the current evaluation is limited to the KITTI dataset, which, while widely used, may not fully represent the diversity of real-world driving environments, such as dense urban scenes or adverse weather conditions.
Future work could explore several directions. One promising avenue is the integration of temporal information across consecutive frames to improve detection consistency and leverage motion cues. Another direction is the extension of VS-Conv to other sparse convolution architectures, such as those used in semantic segmentation or multi-modal fusion. Additionally, adaptive bandwidth selection for KDE, either learned or dynamically adjusted per region, could further enhance the robustness of SDLA.
In summary, the proposed SpaA network introduces effective mechanisms for spatial-aware feature learning in LiDAR-based 3D object detection. The combination of VS-Conv and SDLA addresses key limitations of existing grid-based methods, leading to improved accuracy and robustness, particularly for spatially sensitive and sparsely represented categories. It is important to note that this work focuses on improving spatial feature extraction from a single-frame LiDAR point cloud. The temporal dynamics inherent in sequential point cloud data, while crucial for a complete autonomous driving system, are beyond the scope of this paper and represent a promising direction for future research.

6. Conclusions

This paper primarily focuses on enhancing the feature extraction capability and detection accuracy in 3D object detection tasks based on point cloud data by optimizing sparse convolutions and leveraging information such as point density and spatial structure. The proposed VS-Conv outperforms traditional sparse convolution backbone networks in handling point cloud inputs by dynamically learning the importance of spatial information. In place of conventional set abstraction, SDLA introduces additional density features, assigns independent parameters to features from different spatial locations, and directly concatenates them, enabling the aggregated features to express spatial structures. The results on the KITTI test set demonstrate that SpaA is highly competitive with existing methods, particularly in the challenging detection of pedestrians and cyclists.

Author Contributions

Writing—review and editing, Supervision, Conceptualization, J.S.; Writing—review and editing, Visualization, Formal analysis, Data curation, C.Z. (Chu Zhang); Software, Editing, C.Z. (Cheng Zhang); Visualization, Data curation, L.S.; Validation, R.W.; Supervision, K.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available upon request.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
VS-ConvVariable Sparse Convolution Network
SDLASpatial-aware Density-based Local Aggregation
FPSFarthest Point Sampling
KDEKernel Density Estimation
FFNFeed-Forward Network

References

  1. Liang, Z.; Huang, Y. Survey on deep learning-based 3D object detection in autonomous driving. Trans. Inst. Meas. Control 2023, 45, 761–776. [Google Scholar] [CrossRef]
  2. Wu, Y.; Wang, Y.; Zhang, S.; Ogai, H. Deep 3D Object Detection Networks Using LiDAR Data: A Review. IEEE Sens. J. 2021, 21, 1152–1171. [Google Scholar] [CrossRef]
  3. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
  4. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
  5. Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-Based 3D Single Stage Object Detector. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
  6. Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough Voting for 3D Object Detection in Point Clouds. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9277–9286. [Google Scholar]
  7. Chen, Y.; Liu, S.; Shen, X.; Jia, J. Fast Point R-CNN. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9775–9784. [Google Scholar]
  8. Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
  9. Shi, G.; Li, R.; Ma, C. PillarNet: Real-Time and High-Performance Pillar-Based 3D Object Detection. In Computer Vision—ECCV; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 35–52. [Google Scholar] [CrossRef]
  10. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
  11. Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
  12. Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
  13. Graham, B.; Engelcke, M.; van der Maaten, L. 3D Semantic Segmentation With Submanifold Sparse Convolutional Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9224–9232. [Google Scholar]
  14. Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
  15. Zhang, G.; Chen, J.; Gao, G.; Li, J.; Liu, S.; Hu, X. SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection. arXiv 2024. [Google Scholar] [CrossRef]
  16. Zhang, G.; Chen, J.; Gao, G.; Li, J.; Hu, X. HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds. arXiv 2023. [Google Scholar] [CrossRef]
  17. Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
  18. Eldar, Y.; Lindenbaum, M.; Porat, M.; Zeevi, Y.Y. The farthest point strategy for progressive image sampling. IEEE Trans. Image Process. 1997, 6, 1305–1315. [Google Scholar] [CrossRef] [PubMed]
  19. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  20. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  21. Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
  22. Hu, J.S.K.; Kuai, T.; Waslander, S.L. Point Density-Aware Voxels for LiDAR 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8469–8478. [Google Scholar]
  23. Zhu, B.; Jiang, Z.; Zhou, X.; Li, Z.; Yu, G. Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection. arXiv 2019. [Google Scholar] [CrossRef]
  24. Ge, R.; Ding, Z.; Hu, Y.; Wang, Y.; Chen, S.; Huang, L.; Li, Y. AFDet: Anchor Free One Stage 3D Object Detection. arXiv 2020. [Google Scholar] [CrossRef]
  25. Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds. In Proceedings of the Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2020; pp. 923–932. [Google Scholar]
  26. Chen, Q.; Sun, L.; Cheung, E.; Yuille, A.L. Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 21224–21235. [Google Scholar]
  27. Deng, S.; Liang, Z.; Sun, L.; Jia, K. VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8448–8457. [Google Scholar]
  28. Ye, M.; Xu, S.; Cao, T. HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1631–1640. [Google Scholar]
  29. Wang, T.; Zhu, X.; Lin, D. Reconfigurable Voxels: A New Representation for LiDAR-Based Point Clouds. In Proceedings of the 2020 Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2021; pp. 286–295. [Google Scholar]
  30. Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-Time 3D Object Detection From Point Clouds. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
  31. Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; García, F.; De La Escalera, A. BirdNet: A 3D Object Detection Framework from LiDAR Information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar] [CrossRef]
  32. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
  34. Chen, Y.; Liu, J.; Qi, X.; Zhang, X.; Sun, J.; Jia, J. Scaling up kernels in 3d cnns. arXiv 2022, arXiv:2206.10555. [Google Scholar]
  35. Lu, T.; Ding, X.; Liu, H.; Wu, G.; Wang, L. LinK: Linear Kernel for LiDAR-Based 3D Perception. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1105–1115. [Google Scholar]
  36. Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.-X.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing Single Stride 3D Object Detector with Sparse Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8458–8468. [Google Scholar]
  37. Wang, H.; Shi, C.; Shi, S.; Lei, M.; Wang, S.; He, D.; Schiele, B.; Wang, L. DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13520–13529. [Google Scholar]
  38. Bhattacharyya, P.; Czarnecki, K. Deformable PV-RCNN: Improving 3D Object Detection with Learned Deformations. arXiv 2020. [Google Scholar] [CrossRef]
  39. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  40. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015. [Google Scholar]
  41. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  42. Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1951–1960. [Google Scholar]
  43. Shi, S.; Wang, Z.; Wang, X.; Li, H. Part-a2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv 2019, arXiv:1907.03670. [Google Scholar]
  44. Chen, Q.; Sun, L.; Wang, Z.; Jia, K.; Yuille, A. Object as Hotspots: An Anchor-Free 3D Object Detection Approach via Firing of Hotspots. In Computer Vision—ECCV 2020; SpringerLink: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  45. Chen, J.; Han, Y.; Yan, Z.; Qian, J.; Li, J.; Yang, J. RagNet3D: Learning distinguishable representation for pooled grids in 3D object detection. Neurocomputing 2025, 635, 129841. [Google Scholar] [CrossRef]
  46. Xu, Q.; Zhong, Y.; Neumann, U. Behind the curtain: Learning occluded shapes for 3d object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2893–2901. [Google Scholar] [CrossRef]
  47. Zhang, Y.; Chen, J.; Huang, D. CAT-Det: Contrastively Augmented Transformer for Multi-Modal 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 908–917. [Google Scholar]
Figure 1. The process of computation using different sparse convolution operators. The left diagram represents a sparse input, where P1 and P2 are active input positions, and all other positions are zero values. Assuming the convolution kernel size is 3 × 3 , with a Stride of 1 and Padding of 0. The middle blue output represents the result of regular sparse convolution. The right green output represents the result of submanifold sparse convolution. A1 and A2 are the results of convolution calculations for P1 and P2, respectively, while A1A2 represents the sum of the convolution results for P1 and P2. It can be observed that regular sparse convolution dilutes the sparsity of the input, whereas submanifold sparse convolution preserves the sparsity.
Figure 1. The process of computation using different sparse convolution operators. The left diagram represents a sparse input, where P1 and P2 are active input positions, and all other positions are zero values. Assuming the convolution kernel size is 3 × 3 , with a Stride of 1 and Padding of 0. The middle blue output represents the result of regular sparse convolution. The right green output represents the result of submanifold sparse convolution. A1 and A2 are the results of convolution calculations for P1 and P2, respectively, while A1A2 represents the sum of the convolution results for P1 and P2. It can be observed that regular sparse convolution dilutes the sparsity of the input, whereas submanifold sparse convolution preserves the sparsity.
Remotesensing 18 01104 g001
Figure 2. Sparse Convolution Backbone Network VS-Conv. Among them, ResSubMConv3D is the residual submanifold sparse convolution operator, ResSparseConv3D is the residual regular sparse convolution operator, and FlexConv3D is the variable sparse convolution computation module. The circles with cross represents Residual Connection. Through four stages of convolution combination modules c o n v 1 , c o n v 2 , c o n v 3 , and c o n v 4 , the 3D sparse convolution backbone network as a whole performs an 8 × downsampling on the input voxel features.
Figure 2. Sparse Convolution Backbone Network VS-Conv. Among them, ResSubMConv3D is the residual submanifold sparse convolution operator, ResSparseConv3D is the residual regular sparse convolution operator, and FlexConv3D is the variable sparse convolution computation module. The circles with cross represents Residual Connection. Through four stages of convolution combination modules c o n v 1 , c o n v 2 , c o n v 3 , and c o n v 4 , the 3D sparse convolution backbone network as a whole performs an 8 × downsampling on the input voxel features.
Remotesensing 18 01104 g002
Figure 3. The working principle of the variable sparse convolution computation module. Through M a s k v , the original effective input features X f o r e that can be expanded and their corresponding position indices I n d i c e s f o r e are filtered out. Furthermore, through M a s k i m p and M a s k p r o f o r e , the importance values M a s k i m p f o r e of the expanded positions can be obtained.
Figure 3. The working principle of the variable sparse convolution computation module. Through M a s k v , the original effective input features X f o r e that can be expanded and their corresponding position indices I n d i c e s f o r e are filtered out. Furthermore, through M a s k i m p and M a s k p r o f o r e , the importance values M a s k i m p f o r e of the expanded positions can be obtained.
Remotesensing 18 01104 g003
Figure 4. The structure of the Down2Up module. After the downsampling by regular sparse convolution, the distances between the originally non-adjacent input features are reduced. At this time, performing submanifold sparse convolution calculation can complete the exchange of features. Then, the original input state is restored through regular sparse deconvolution without affecting the original sparsity of the data.
Figure 4. The structure of the Down2Up module. After the downsampling by regular sparse convolution, the distances between the originally non-adjacent input features are reduced. At this time, performing submanifold sparse convolution calculation can complete the exchange of features. Then, the original input state is restored through regular sparse deconvolution without affecting the original sparsity of the data.
Remotesensing 18 01104 g004
Figure 5. Comparison of recall–precision curves between our SpaA and baseline on KITTI validation set (moderate split).
Figure 5. Comparison of recall–precision curves between our SpaA and baseline on KITTI validation set (moderate split).
Remotesensing 18 01104 g005
Figure 6. The gain in detection performance of SpaA compared to baseline at different distances on the KITTI dataset.
Figure 6. The gain in detection performance of SpaA compared to baseline at different distances on the KITTI dataset.
Remotesensing 18 01104 g006
Figure 7. Qualitative results of SpaA on the KITTI test set. (a,c) are the visualization results of the baseline in a 3D perspective, while (b,d) are the visualization results of SpaA in a 3D perspective. The ground truth, car, cyclist, and pedestrian are represented in the figure using dark blue, green, yellow, and light blue colors, respectively.
Figure 7. Qualitative results of SpaA on the KITTI test set. (a,c) are the visualization results of the baseline in a 3D perspective, while (b,d) are the visualization results of SpaA in a 3D perspective. The ground truth, car, cyclist, and pedestrian are represented in the figure using dark blue, green, yellow, and light blue colors, respectively.
Remotesensing 18 01104 g007
Figure 8. Qualitative results of SpaA on the KITTI test set. (a) is the projection of the baseline detection result on the 2D image, and (b) is the projection of the SpaA detection result on the 2D image. Cars, cyclists, and pedestrians are represented by green, blue, and purple colors, respectively, in the picture.
Figure 8. Qualitative results of SpaA on the KITTI test set. (a) is the projection of the baseline detection result on the 2D image, and (b) is the projection of the SpaA detection result on the 2D image. Cars, cyclists, and pedestrians are represented by green, blue, and purple colors, respectively, in the picture.
Remotesensing 18 01104 g008
Figure 9. (a) the projection of a general 3D sparse convolution. (b) the projection of VS-Conv.
Figure 9. (a) the projection of a general 3D sparse convolution. (b) the projection of VS-Conv.
Remotesensing 18 01104 g009
Table 1. Three-dimensional detection results on the KITTI validation set. R40 denote AP under 40 recall thresholds. The improved results are in bold.
Table 1. Three-dimensional detection results on the KITTI validation set. R40 denote AP under 40 recall thresholds. The improved results are in bold.
MethodCar 3D (R40) (%)Ped. 3D (R40) (%)Cyc. 3D (R40) (%)3D mAP
EasyMod.HardEasyMod.HardEasyMod.Hard
Baseline92.4885.7583.3374.2566.8761.6291.9276.6972.1578.34
SpaA(Ours)92.9186.3785.8175.6668.3963.2393.7078.3073.9779.82
Table 2. Three-dimensional detection results on the KITTI test set. The best methods are in bold. Only published methods are reported. The overall 3D mAP was calculated as the average over the Moderate difficulty for Car, Pedestrian, and Cyclist categories. For methods that do not report results for Pedestrian and Cyclist categories (denoted by “-”), their mAP was calculated based on the available categories for reference only and is not directly compared in the discussion.
Table 2. Three-dimensional detection results on the KITTI test set. The best methods are in bold. Only published methods are reported. The overall 3D mAP was calculated as the average over the Moderate difficulty for Car, Pedestrian, and Cyclist categories. For methods that do not report results for Pedestrian and Cyclist categories (denoted by “-”), their mAP was calculated based on the available categories for reference only and is not directly compared in the discussion.
MethodCar 3D (R40) (%)Ped. 3D (R40) (%)Cyc. 3D (R40) (%)3D mAP
EasyMod.HardEasyMod.HardEasyMod.Hard
PointRCNN [4]86.9675.6470.7047.9839.3736.0174.9658.8252.5360.33
PointPillars [10]82.5874.3168.9951.4541.9238.8977.1058.6551.9260.65
STD [42]87.9579.7175.0953.2942.4738.3578.6961.5955.3063.60
Part A2 [43]87.8178.4973.5153.1043.3540.0679.1763.5256.9363.99
PV-RCNN [12]90.2581.4376.8252.1743.2940.2978.6063.7157.6564.91
HotSpotNet [44]87.6078.3173.3453.1045.3741.4782.5965.9559.0065.19
Voxel-RCNN [11]90.9081.6277.06-------
RagNet3D [45]88.7481.9177.45---83.8468.5561.94-
Focals Conv [17]90.5582.2877.59-------
BtcDet [46]90.6482.8678.0947.8041.6339.3082.8168.6861.8165.96
CAT-Det [47]89.8781.3276.6854.2645.4441.9483.6868.8161.4567.05
SpaA (Ours)90.4082.2077.4150.1644.0441.1786.0170.3463.3467.23
Table 3. The ablation results of the VS-Conv on the KITTI validation set.
Table 3. The ablation results of the VS-Conv on the KITTI validation set.
Method C o n v I m p Down2UpCarPed.Cyc.
Baseline85.7566.8776.69
VS-Conv 86.4367.4777.55
86.0867.1377.16
86.5567.6977.71
Table 4. The ablation results on use stages of the VS-Conv on the KITTI validation set.
Table 4. The ablation results on use stages of the VS-Conv on the KITTI validation set.
MethodStagesCarPed.Cyc.
Baseline85.7566.8776.69
VS-Conv(1,)85.8867.1377.43
(1, 2,)86.5267.3177.19
(1, 2, 3,)86.5567.6977.71
(1, 2, 3, 4)85.6367.4077.47
Table 5. The ablation results on bandwidth of the SDLA on the KITTI validation set.
Table 5. The ablation results on bandwidth of the SDLA on the KITTI validation set.
MethodBandwidthCarPed.Cyc.
Baseline85.7566.8776.69
SDLA0.1585.9366.8176.54
0.285.8666.9976.75
0.2586.1567.0876.84
0.3086.0167.0376.79
0.3585.6966.7576.58
Table 6. The ablation results of the SDLA on the KITTI validation set. (CS represents channel summation operation and SC represents spatial concatenation operation.)
Table 6. The ablation results of the SDLA on the KITTI validation set. (CS represents channel summation operation and SC represents spatial concatenation operation.)
MethodCSSCCarPed.Cyc.Param.FPS
Baseline85.7566.8776.6911,535,4565.1
SDLA 85.6066.5676.5211,398,5905.7
86.2167.3277.3411,445,2985.4
86.3167.2677.4811,308,4326.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, J.; Zhang, C.; Zhang, C.; Song, L.; Wang, R.; Xie, K. SpaA: A Spatial-Aware Network for 3D Object Detection from LiDAR Point Clouds. Remote Sens. 2026, 18, 1104. https://doi.org/10.3390/rs18081104

AMA Style

Song J, Zhang C, Zhang C, Song L, Wang R, Xie K. SpaA: A Spatial-Aware Network for 3D Object Detection from LiDAR Point Clouds. Remote Sensing. 2026; 18(8):1104. https://doi.org/10.3390/rs18081104

Chicago/Turabian Style

Song, Jianfeng, Chu Zhang, Cheng Zhang, Li Song, Ruobin Wang, and Kun Xie. 2026. "SpaA: A Spatial-Aware Network for 3D Object Detection from LiDAR Point Clouds" Remote Sensing 18, no. 8: 1104. https://doi.org/10.3390/rs18081104

APA Style

Song, J., Zhang, C., Zhang, C., Song, L., Wang, R., & Xie, K. (2026). SpaA: A Spatial-Aware Network for 3D Object Detection from LiDAR Point Clouds. Remote Sensing, 18(8), 1104. https://doi.org/10.3390/rs18081104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop