Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection

Zhao, Jingwen; Huang, Zhicong; Zheng, Zhijie; Long, Yunliang; Hu, Haifeng

doi:10.3390/electronics14112154

Open AccessArticle

Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection

by

Jingwen Zhao

¹,

Zhicong Huang

¹,

Zhijie Zheng

¹,

Yunliang Long

² and

Haifeng Hu

^1,3,*

¹

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510275, China

²

School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China

³

Pazhou Laboratory, Guangzhou 510335, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2154; https://doi.org/10.3390/electronics14112154

Submission received: 7 April 2025 / Revised: 8 May 2025 / Accepted: 10 May 2025 / Published: 26 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

The sparse and uneven distribution of point clouds in LiDAR-captured outdoor scenes poses significant challenges for 3D object detection in autonomous driving. Specifically, the imbalance between foreground and background points can degrade detection accuracy. While existing approaches attempt to address this issue through sampling or segmentation strategies, effectively retaining informative foreground points and integrating features from multiple sources remains a challenge. To tackle these issues, we propose SMA², a semantic-guided multi-feature attention aggregation network. It consists of two key components: the Keypoint Attention Enhancement (KAE) module, which refines keypoints by leveraging semantic information through attention-based local aggregation, and the Multi-Feature Attention Aggregation (MFAA) module, which adaptively integrates keypoint, voxel, and BEV features using a keypoint-guided attention mechanism. Compared to existing fusion methods such as PV-RCNN, SMA² provides a more flexible and context-aware feature integration strategy. Experimental results on the KITTI test set demonstrate consistent performance improvements, especially in detection accuracy for small and distant objects. Additional tests on the Waymo and DAIR-V2X-V datasets further highlight the method’s strong generalization capability across diverse environments.

Keywords:

semantic-guided; keypoint attention enhancement module; keypoint query; multi-feature attention aggregation

1. Introduction

As the demand for advanced robotics and autonomous driving technologies increases, precise 3D object detection becomes essential for enabling these systems to function safely and effectively in various environments [1,2,3]. LiDAR, in comparison to RGB images, is particularly suited for 3D object detection under challenging environmental and lighting conditions. It provides high-accuracy point cloud data, offering detailed 3D information about objects, making LiDAR a primary sensor in autonomous driving and robotics applications. Deep learning-based 3D object detection methods can be broadly divided into voxel-based [4,5,6,7] and point-based [8] approaches. Voxel-based methods transform point clouds into 3D voxels or 2D grids, enabling the application of sparse convolutions for feature extraction, which improves computational efficiency. However, the process of voxelization and the use of sparse convolutions can result in a loss of important 3D spatial details, which may reduce detection accuracy [5,9,10]. On the other hand, point-based approaches, such as PointNet [11] and its variants [12,13], directly operate on the original point clouds, preserving fine-grained 3D information and enabling more flexible receptive field selection, leading to improved detection. Despite their advantage in maintaining spatial accuracy, these methods typically use furthest-point sampling (FPS) to select keypoints, which can miss foreground points and produce less effective proposals due to imbalanced point distributions. Since foreground points are essential for accurate detection, their presence in the sampled keypoints directly influences performance. To address this issue, methods like [14,15] have been proposed to capture a larger number of keypoints, improving the distinction between foreground and background.

This paper introduces the semantic-guided multi-feature attention aggregation (SMA²) network, a novel approach that combines the foreground point extraction capability of semantic sampling and the feature representation strength of a sparse convolution backbone. By fusing semantically enhanced keypoints, voxels, and BEV, SMA² captures both local structures and global information in point cloud data, enhancing object detection accuracy. The network adopts a series of progressive steps for feature fusion, detailed as follows.

To extract more valuable foreground points, Yang et al. [16] utilized point cloud feature information for downsampling and employed Distance-FPS (D-FPS) to retain foreground points, thereby improving classification accuracy. Wu et al. [17] proposed a semantic-based strategy for foreground point extraction, which focuses on acquiring important spatial and positional information while minimizing computational cost. Extracting foreground points from a large pool of background points relies heavily on semantic information. Drawing inspiration from Part

A^{2}

[7], the method uses point cloud semantic segmentation results as prior knowledge to guide the detector in extracting more foreground points. Specifically, each point is assigned a semantic segmentation class label based on its position relative to the 3D ground-truth annotation box.

To achieve this, a novel Keypoint Attention Enhancement (KAE) module is introduced. The semantic segmentation scores are used as weights for semantic-guided sampling, extracting foreground points from raw point clouds. Simultaneously, segmented semantic features are aggregated with the sampled foreground points via cross-attention, forming keypoint semantic enhancement features. Unlike FPS [12], S-FPS [18], and Sectorized-Centric Keypoint Sampling [15], this approach maximizes the retention of keypoint features from foreground points, enhancing attention mechanisms for better detection performance.

After obtaining semantic-enhanced keypoint features, the challenge arises as to how to aggregate keypoint and multi-scale features. We aim to model the relationship between keypoints and multi-scale sparse voxels. Previous works [12,19] in 3D object detection have constructed these relationships by using point–voxel fusion, maximum pooling layers [14,20,21], and graph relationships [19,22]. The advantage is that the context information and dependencies can be captured, which can greatly enhance the ability to identify fine-grained patterns. However, the key challenge is to mine the correlation between key point features and multi-scale voxel features while effectively fusing the two features together. Inspired by the transformer architecture [23,24], a multi-feature attention aggregation (MFAA) module is proposed, which consists of three components: keypoints, BEV, and multi-scale sparse voxel features. By leveraging self-attention mechanisms, the MFAA module adaptively attends to relevant features from each representation, enabling the precise fusion of local and global information. The keypoint query guides attention, focusing on pertinent regions while ensuring efficient communication between features extracted from keypoints, BEV, and voxel grids.

The MFAA module facilitates cross-feature interactions, allowing keypoint features to be enhanced by semantic information and effectively fused with multi-scale voxel representations. This process captures both local and global spatial relationships, improving object detection accuracy, particularly in complex scenes. Additionally, hierarchical feature interactions at different scales help the model to focus on both fine details and broader contextual patterns, resulting in more robust detection performance.

In summary, the main contributions in this work are summarized as follows:

A Keypoint Attention Enhancement (KAE) module is introduced to capture more valuable foreground points from the raw point cloud, enabling the model to focus accurately on areas containing small objects;
We propose a multi-feature attention aggregation (MFAA) module, designed to aggregate keypoints and their corresponding voxel features to generate a comprehensive feature representation. This method effectively leverages the complementarity between point cloud and voxelized representations;
The proposed keypoint query allows for the direct extraction of voxel features near the keypoints, eliminating the need to traverse all voxels and thereby improving computational efficiency.
Extensive experiments demonstrate that SMA² achieves competitive performance on the widely used KITTI 3D object detection benchmark. Furthermore, the method has been validated for robustness on the Waymo and DAIR-V2X-V validation sets.

The structure of this paper is as follows. Section 2 surveys related work on LiDAR-based 3D object detection, covering point-based, voxel-based, and hybrid methods. In Section 3, we introduce the SMA² network, which integrates semantic-aware modules with multi-feature attention mechanisms. Section 4 details the experimental setup, datasets, and implementation specifics. Section 5 reports quantitative results on the KITTI, Waymo, and DAIR-V2X-V benchmarks. Section 6 presents a comprehensive analysis, including ablation studies and module evaluations. Section 7 compares the inference performance of SMA² against existing approaches.

2. Related Work

2.1. Overview of LiDAR-Based Methods

Point cloud data, due to its irregular and sparse nature, cannot be directly processed by traditional Convolutional Neural Networks (CNNs) without preprocessing to define a suitable representation. Point cloud representation methods can generally be classified into point-based, voxel-based, and point–voxel-based approaches.

Point-Based Methods: These approaches directly process raw point clouds to extract point-level features for 3D object detection. Key works in this area include PointRCNN [20], which utilizes PointNet++ [12] for bounding box refinement in a two-stage process. The method 3DSSD [16] introduces a fusion sampling strategy that balances the retention of foreground points with computational efficiency. IA-SSD [25] applies progressive downsampling to preserve foreground points while optimizing efficiency. STD [26] uses spherical anchors for high recall in point-based proposal generation, and Point-GNN [27] employs automatic point registration and feature alignment to predict object categories and bounding boxes.

Voxel-Based Methods: These methods convert point clouds into voxel representations for efficient feature extraction using 3D CNNs. VoxelNet [4] pioneered voxel feature encoding but suffered from high computational complexity. SECOND [5] improved on this by using sparse convolutions, exploiting point cloud sparsity to reduce computation. PointPillars [6] further advanced this by encoding point clouds into vertical pillars, generating BEV pseudo-images. Part-

A^{2}

[7] proposed an encoder–decoder structure for non-empty voxel prediction and introduced RoI-grid pooling to refine detection frames. Voxel-RCNN maps initial BEV proposals back to 3D space for enhanced feature representation, while HVNet [28] improves detection across categories by fusing voxel features at different resolutions. CIA-SSD [29] incorporates a lightweight aggregation module for feature correction and confidence enhancement, and VoTr [30] utilizes transformers to better capture contextual dependencies.

Point–Voxel-Based Methods: These approaches combine the strengths of both point-based and voxel-based methods, leveraging the geometric information of points alongside the structured benefits of voxelization. SA-SSD [31] introduces an auxiliary network at the voxel layer for supervised geometric feature learning. PV-RCNN [14] encodes multi-scale voxel features into keypoints via a voxel set abstraction layer, effectively aggregating keypoint features through FPS sampling. CT3D [32] models geometric relationships at the channel level with a transformer network, and Pyramid-RCNN [33] adapts a spherical query radius to progressively expand the RoI region, improving feature extraction across multiple scales.

As shown in Table 1, SMA² improves upon existing methods in the following key areas. By combining voxelization with point-based processing, SMA² effectively balances computational efficiency and fine detail preservation. The keypoint-guided multi-feature attention mechanism enables more context-aware feature fusion, enhancing detection accuracy. SMA² uses semantic keypoint sampling to refine foreground point detection, particularly for small or occluded objects.

2.2. Two-Stage 3D Object Detection Methods

In two-stage 3D object detection frameworks, the first stage generates coarse proposals, while the second stage refines them for accurate localization. The key difference among existing methods lies in how they leverage point and voxel features. STD [26] adopts spherical anchors to sample seed points and constructs dense feature maps within proposals, enhancing contextual understanding and recall. PV-RCNN++ [15] integrates both voxel-wise and point-wise features—extracting voxel features at keypoints in the first stage and aggregating grid-level features for refinement in the second stage—achieving a fine-grained balance between accuracy and efficiency. In contrast, Voxel-RCNN [9] argues that coarse voxel granularity is sufficient and introduces a voxel ROI pooling module to directly extract features from voxel maps, simplifying the pipeline without sacrificing accuracy.

2.3. Multi-Modal 3D Object Detection Methods

Multi-modal 3D object detection plays a vital role in the domains of autonomous driving and robotics by leveraging complementary information from different sensors to improve detection performance [3]. GraphAlign [34] utilizes a graph matching strategy to effectively align semantic features from images with geometric cues from point clouds, enhancing the overall accuracy of multi-modal detection. RoboFusion [35] further improves robustness by incorporating vision-based models to mitigate the impact of real-world noise and environmental disturbances. To ensure consistency between modalities during augmentation, Dyfusion [36] performs joint enhancement on both point clouds and their corresponding image data. These techniques collectively demonstrate that multi-modal approaches are better suited for high-precision applications, particularly in advanced autonomous driving systems and scenarios requiring strong resilience.

2.4. Attention-Based 3D Object Detection Methods

Recently, attention-based methods have gained significant traction in 3D computer vision, particularly for tasks involving sparse and irregular point cloud data [37,38]. The success of self-attention and cross-attention mechanisms in transformer architectures has demonstrated strong capabilities in modeling global context and enhancing feature representation. For example, Pointformer [39] applies a self-attention framework to semantic scene segmentation, highlighting the value of capturing long-range dependencies in 3D space. Similarly, transformer-based models such as PCT [40] and M3DETR [41] leverage attention to aggregate contextual information and capture multi-scale semantics directly from raw point clouds. Compared with traditional voxelization or projection-based approaches, which may suffer from resolution loss or discretization errors, transformer architectures can preserve geometric fidelity while learning richer interactions. VoTr [30] adopts a voxel-based transformer encoder–decoder architecture to model contextual features, but its aggregation remains constrained within voxel partitions. SST [42] improves efficiency by eliminating multi-scale fusion and voxel downsampling, while SAT-GCN [43] combines graph convolution and attention to enhance semantic reasoning via neighborhood aggregation.

In contrast to these methods, our proposed SMA² introduces a multi-feature attention aggregation (MFAA) module that leverages keypoint-guided attention to dynamically integrate features across multiple representations—keypoints, voxels, and BEV—while retaining fine-grained semantic details. This design differs from methods like Pointformer and VoTr by explicitly combining multi-source features under semantic guidance, enabling more robust detection, particularly for small or occluded objects.

3. Methodology

3.1. The Overview of Our Method

The pipeline of SMA² is shown in Figure 1. The network is composed of three main components. First, the foreground point extraction module utilizes the Spconv–Unet encoder–decoder architecture, as depicted in Figure 2. This network combines sparse convolution blocks with submanifold convolution blocks to learn discriminative voxel features. The encoder employs three sparse convolution blocks to downsample the input voxel space by a factor of eight, effectively capturing essential feature information. The second component, the Keypoint Attention Enhancement module, samples the raw point clouds based on the detected point categories from the first stage. A self-attention mechanism is then applied between the voxelized sampling points and semantic space features, computing the offsets between location features and input features to refine keypoint selection. Finally, the multi-feature attention aggregation module leverages a 3D sparse CNN to extract voxel features, which are subsequently compressed into 2D BEV features. The keypoints, BEV features, and sparse voxel features are then passed through the transformer aggregation module, where they are integrated to optimize detection performance.

3.2. Foreground Points Extraction

To segment foreground points, an Spconv–Unet encoder–decoder network is used to learn discriminative voxel features through sparse and submanifold convolution blocks. The encoder downsamples the voxel space using three sparse convolution layers followed by two submanifold convolution layers to capture high-level features efficiently. The decoder then restores the original spatial resolution using four upsampling blocks, aiming to recover non-empty voxel features while maintaining computational efficiency. This architecture enables effective segmentation of foreground points by handling sparse point cloud data efficiently.

And then we choose a voxel-based data representation as input and divide the raw point clouds into regular small voxels with spatial resolution

L \times M \times N

. Each voxel represents the characteristics of points contained in its grid. The mean of coordinates among points within a non-empty voxel is firstly calculated as the initial value of the voxel feature.

Given the size

(w, l, h)

, orientation

θ

in the bird’s-eye view, and the center position

(C_{x}, C_{y}, C_{z})

from the 3D ground-truth boxes, represented as

(C_{x}, C_{y}, C_{z}, w, l, h, θ)

, we compute the relative position of each foreground point using Equation (1). The relative position of the foreground points, denoted as

(q_{x}, q_{y}, q_{z})

, is expressed as

(f_{x}, f_{y}, f_{z})

.

\begin{matrix} [\begin{matrix} u_{x} \\ u_{y} \end{matrix}] = [\begin{matrix} q_{x} - c_{x} \\ q_{y} - c_{y} \end{matrix}] [\begin{matrix} cos (- θ) & - sin (- θ) \\ sin (- θ) & cos (- θ) \end{matrix}], \\ f_{x} = \frac{u_{x}}{w} + 0.5, f_{y} = \frac{u_{y}}{l} + 0.5, f_{z} = \frac{q_{z} - c_{z}}{h} + 0.5, \end{matrix}

(1)

where

(f_{x}, f_{y}, f_{z}) \in [0, 1]

and the center of relative position is

(0.5, 0.5, 0.5)

. It is worth noting that all coordinates follow the LiDAR coordinate system in the KITTI dataset, which is a right-handed coordinate system with the z-axis up, the x-axis forward, and the y-axis left.

For each foreground point

(q_{x}, q_{y}, q_{z})

within a 3D bounding box, the process begins by shifting the point coordinates relative to the box center

(c_{x}, c_{y}, c_{z})

, so that the position is described in the object’s local coordinate system. A 2D rotation matrix, with angle

- θ

, is then applied to the point in the x–y plane to align the local coordinate frame with the object’s orientation in the bird’s-eye view. This rotation ensures that all objects are standardized to a common heading direction. The resulting offsets

(u_{x}, u_{y})

, along with the vertical difference

(q_{z} - c_{z})

, are normalized by the bounding box dimensions

(w, l, h)

to map the point’s relative position into a canonical coordinate space within the range of

{[0, 1]}^{3}

. This normalization centers the object at

(0.5, 0.5, 0.5)

, ensuring a consistent geometric representation regardless of object scale or orientation. It should be noted that this transformation is based on the LiDAR coordinate system defined in the KITTI dataset, a right-handed system where the x-axis points forward, the y-axis points left, and the z-axis points upward.

The 3D ground-truth boxes inherently encode semantic categories for associated foreground points, while a significant imbalance exists between foreground and background point distributions [7]. To alleviate this class imbalance problem, we use focal loss [44], defined by

L_{seg} = \frac{1}{N_{fore}} \sum_{i}^{N} - α_{t} {(1 - p_{i})}^{γ} log (p_{i})

(2)

where

p_{i} = \{\begin{matrix} s, & if s = 1 \\ 1 - s, & otherwise . \end{matrix}

where

p_{i}

represents the probability of classifying a point as foreground or background, with hyperparameters set to

α t = 0.25

and

γ = 2

.

3.3. Semantic FPS

Farthest-point sampling (FPS) [45] is commonly employed for point cloud sampling, aiming to maintain the spatial structure and balance of the original data. It excels at selecting a subset of informative points that reflect the overall geometry of the scene. Nevertheless, in outdoor LiDAR scenarios, where point clouds can be dense and unevenly distributed, FPS may struggle to consistently capture representative points. This limitation can lead to missing critical features, such as small or distant objects, ultimately affecting the accuracy of 3D object detection.

To address this challenge, we propose a foreground-aware distance-weighted sampling strategy, inspired by [17]. The key idea is to incorporate foreground information into the FPS process to prioritize important points that contribute to object detection. Specifically, we leverage a foreground point segmentation module, which classifies each point as either foreground or background. This classification helps in assigning different importance to points, with foreground points receiving higher weights during the sampling process. A two-layer Multi-Layer Perceptron (MLP) is employed to classify points. Given N input points with feature vectors

(f_{1}^{l}, f_{2}^{l}, \dots, f_{N}^{l})

, each of dimension L, the MLP predicts a foreground score

q \in [0, 1]

for each point, representing the probability of being part of the foreground. The foreground score

q_{i}

for the i-th point is computed as follows:

q_{i} = σ (H (f_{i}^{l})),

(3)

where

H

represents the segmentation module, which maps the input point-wise features

f_{i}

to foreground scores

q_{i}

, and

σ (\cdot)

is the sigmoid activation function. A score near 1 implies a strong association with the foreground, whereas a value closer to 0 indicates a higher likelihood of being part of the background. Once the foreground scores are obtained, we incorporate these scores into the FPS process to refine the selection of keypoints. Let

D = {d_{1}, d_{2}, \dots, d_{N}}

represent the distances from each unselected point to the existing keypoints in the keypoint set. In each iteration, the point with the largest semantically weighted distance is selected as a keypoint. The weight is computed by multiplying the foreground score

q_{i}

with the distance

d_{i}

, as follows:

k_{i} = q_{i} \cdot d_{i},

(4)

where

k_{i}

represents the weighted score for the i-th point. By incorporating the foreground score, we prioritize points that are both far from the existing keypoints and likely to be part of the foreground, ensuring that informative points are selected. The point with the largest weighted score

k_{i}

is added to the keypoint set K, which is iteratively updated as

K = {k_{1}, k_{2}, \dots, k_{N}} .

This process allows us to adaptively select keypoints based on both their spatial distribution and their importance in representing the foreground, resulting in more efficient and effective point cloud sampling.

3.4. Keypoint Attention Enhancement

Our proposed Keypoint Attention Enhancement module dynamically enhances 3D keypoint features through the hierarchical fusion of geometric and semantic cues, enabling robust representation learning in complex scenes. As shown in Figure 3, given an input set of 3D keypoints

K \in R^{N \times 3}

, where N denotes the number of points and each point

k_{i}

is represented by its

(x, y, z)

coordinates, the module operates as follows.

The raw keypoints K are encoded into high-dimensional semantic–geometric features using a feature extractor (e.g., MLP or PointNet-based backbone). This yields a feature map:

F = {f_{1}, f_{2}, \dots, f_{N}} \in R^{N \times C},

(5)

where C is the feature dimension. Each

f_{i}

encapsulates local geometric structures and global semantic contexts.

To prioritize discriminative regions, we compute a global attention map

b \in R^{N}

by fusing voxel foreground point features and semantic keypoint features using vector pool aggregation [15]. The attention weights are normalized via softmax:

b_{i} = \frac{exp (MLP (f_{i}))}{\sum_{j = 1}^{N} exp (M L P (f_{j}))},

(6)

where

b_{i}

reflects the global significance of the i-th keypoint. The feature map

F

is refined through a fully connected (FC) layer with layer normalization (LN):

F^{'} = LN (FC (F)) .

(7)

The attention map b dynamically weights

F^{'}

via element-wise multiplication:

H = F^{'} ⊙ b .

(8)

For each keypoint i, we enhance its representation using a two-layer network with ReLU and LayerNorm:

{F_{k e y}}_{i} = f_{i} ⊙ {FC}_{2} (ReLU (LN ({FC}_{1} (\sum_{j = 1}^{N} w_{j} f_{j})))),

(9)

where ⊙ denotes element-wise multiplication and

w_{j}

are learnable weights.

The final output

F_{k e y} \in R^{N \times C}

consolidates discriminative geometric and semantic features.

3.5. Multi-Feature Attention Aggregation

3.5.1. Multi-Scale Voxel Feature Group

Sparse voxels enable higher spatial resolution by leveraging a smaller number of occupied voxels, thereby preserving fine-grained geometric structures. This characteristic is particularly advantageous for capturing subtle object boundaries and small-scale scene details in 3D perception tasks. However, voxelization inherently introduces quantization artifacts and information loss due to the discretization of irregular point clouds. To mitigate this issue, we adopt a multi-scale voxel feature aggregation strategy that captures point cloud information across multiple spatial resolutions and enriches context representation.

Instead of enhancing the sparse voxel backbone via isolated stages as in [46,47], we construct a hierarchical feature aggregation framework that aligns sparse voxel features with BEV (bird’s-eye view) features at multiple scales. Specifically, we extract downsampled sparse voxel features

{f_{1}, f_{2}, f_{3}, f_{4}}

from four stages with strides

{1, 2, 4, 8}

. These features are progressively aligned to a common BEV resolution and concatenated with coarse-resolution voxel features. This results in a multi-scale hierarchical feature representation that captures both fine and coarse spatial information, which is then used to guide feature propagation to higher scales. The resulting non-empty voxel features

{F_{1}, F_{2}, F_{3}, F_{4}}

, centered around keypoints, are aggregated to enhance semantic richness. To handle discrepancies in feature dimensions across different scales, we apply a channel reduction step using sparse 3D convolution (SparseConv), ensuring consistent feature dimensionality before fusion.

This process not only bridges the resolution gap among features at different levels but also facilitates efficient information flow across scales. Moreover, our design allows for contextual feature enhancement in both dense and sparse regions by adaptively integrating voxel features with varying levels of granularity. This proves particularly beneficial for detecting small or distant objects, where fine-scale features provide critical cues. By fusing multi-scale voxel features into a unified representation, our method achieves a more comprehensive understanding of the scene while maintaining computational efficiency inherent to sparse convolutional networks.

3.5.2. BEV Feature Map Representation

In the 3D voxel CNN branch, we first extract the 8-times downsampled voxel feature map

f_{4}

, with dimensions

\frac{L}{8} \times \frac{W}{8} \times \frac{H}{8} \times C

, where L, W, and H represent the length, width, and height of the input voxel grid, respectively. To effectively transform the voxel features into a bird’s-eye view (BEV) representation, we stack the features along the z-axis, aggregating the three-dimensional voxel features into two-dimensional BEV feature maps with a size of

\frac{L}{8} \times \frac{W}{8} \times \frac{H \times C}{8}

. This representation not only preserves the spatial geometry of the scene but also ensures higher computational efficiency. Subsequently, to obtain the semantic embeddings of the downsampled points in the BEV space, we employ bilinear interpolation to extract the corresponding feature vectors from the BEV feature map, denoted as

F_{b e v} \in R^{N \times C}

, where N is the number of points and C represents the number of channels. This process enhances the contextual information of the point cloud representation, providing more robust features for subsequent fusion and detection tasks.

3.5.3. Transformer-Based Multi-Feature Aggregation

We aim to obtain a more accurate and comprehensive data representation by aggregating multi-feature information. The voxel features and BEV features are connected to obtain multi-scale aggregation features

F_{m u l} = concat (F_{1}, F_{2}, F_{3}, F_{4}, F_{b e v})

, where the dimension of each class of feature is equal to C. Inspired by the linear combination of inputs using relevant weights in the attention mechanism [48,49], two input matrices

F_{k e y}

and

F_{m u l}

are interacted and weighted by correlation scores in the self-attention mechanism. The attention output layer is defined as

Attention (Q, V, K) = Softmax (\frac{Q K^{T}}{\sqrt{d_{i n}}}) V \in R^{l \times d_{out}}

(10)

where Q = F_{k e y} W_{q}, K = F_{k e y} W_{k} and V = F_{m u l} W_{v} .

(11)

where the matrices Q, K, and V correspond to the query, key, and value, respectively, while

W_{q}

,

W_{k}

, and

W_{v}

represent their respective linear projections. As illustrated in Figure 4, the multi-feature aggregation method incorporates fused feature maps that combine multi-scale voxel groups and semantic keypoint information, which are then processed through self-attention. The key distinction is that we compute the weight between the semantic keypoint feature

F_{k e y}

using the query and key feature vector matrices. By applying a weighted sum, we efficiently fuse the resulting outputs, denoted as

V^{'} \in R^{c_{v} \times N}

.

3.6. Loss Functions

Our approach can be trained in an end-to-end manner through the RPN and R-CNN stages, optimized using a multi-task loss function

L

, defined as follows:

L = L_{seg} + L_{rpn} + L_{rcnn}

(12)

The segmentation loss

L_{seg}

is calculated using binary cross-entropy to extract the semantic features of foreground points. Following the approach in [7], the RPN loss

L_{rpn}

is composed of three components: object classification loss, box localization regression loss, and corner loss.

L_{rpn} = α_{1} L_{cls} + α_{2} L_{reg} + α_{3} L_{corner}

(13)

where

α_{1}, α_{2}, α_{3}

represent the weight coefficients of the above three sub-tasks. Smooth-L1 loss [15] is adopted for calculating

L_{reg}

, the regression objective function is calculated by the relative offset between the anchor and the ground truth:

▵ x = \frac{x^{g t} - x^{a}}{d^{a}}

,

▵ y = \frac{y^{g t} - y^{a}}{d^{a}}

,

▵ z = \frac{z^{g t} - z^{a}}{d^{a}}

,

Δ h = log (\frac{h^{g t}}{h^{a}})

,

Δ w = log (\frac{w^{g t}}{w^{a}})

,

▵ θ = θ^{g t} - θ^{a}

, where

d^{a} = \sqrt{{(w^{a})}^{2} + {(l^{a})}^{2}}

. And the regression loss

L_{reg}

can be defined as

\begin{matrix} L_{reg} = \sum_{b \in (x, y, z, w, l, h, θ)} Smooth-L 1 (▵ b), \end{matrix}

(14)

and the classification loss

L_{c l s}

can be expressed by the focal loss as follows:

\begin{matrix} L_{c l s} = - α {(1 - p^{a})}^{γ} log p^{a}, \end{matrix}

(15)

the hyperparameters

α

and

γ

require manual tuning, and

p^{a}

represents the classification predictions. In this study,

α

is set to 0.25, while

γ

is set to 2. The corner loss

L_{corner}

is calculated using sine-error loss [5] for angle regression. In the refinement stage,

L_{r c n n}

is used as the loss for classification and localization. Its objective is to filter proposals using ground truth during the Region of Interest (ROI) process. This loss consists of three parts: the classification confidence loss

L_{r c l s}

, location regression loss

L_{r l o c}

, and box corner loss

L_{r c o r n e r}

, and is defined as follows:

\begin{matrix} L_{r c n n} = L_{r c l s} + L_{r l o c} + L_{r c o r n e r} \end{matrix}

(16)

4. Experiment

In this section, we assess the performance of SMA² on the KITTI dataset for LiDAR-based 3D object detection through efficiency analysis and ablation studies.

4.1. Datasets and Evaluation Metric

KITTI Dataset: A widely recognized benchmark for 3D object detection. The LiDAR coordinate system range is restricted to

[0 m, - 10 m, - 3 m, 70.4 m, 40 m, 1 m]

, according to common practice. The dataset contains 7481 training and 7518 testing LiDAR scans. The standard split is used, with 3712 samples for training and 3769 samples for validation. The purpose of such segmentation is to ensure that images of the same sequence are distributed as independently as possible in the training and validation sets. For each object category, the detection results are evaluated based on three standard regimes: easy, moderate, hard, defined according to the object size, occlusion state, and truncation level. As shown in Figure 5, these statistics show the distribution of different object categories in the dataset. The Car and Pedestrian categories contain a large number of samples, while Truck, Person (sitting), Tram, and Misc have relatively fewer instances. However, for comparison with mainstream methods, we only report results for the Car, Pedestrian, and Cyclist categories.

Waymo Dataset: The Waymo Open dataset is one of the most extensive and high-resolution datasets available for autonomous driving research, featuring a diverse set of sensor data and complex traffic scenarios. It consists of 798 training sequences (about 158k point cloud samples) and 202 validation sequences (approximately 40k point cloud samples), each with 360-degree field-of-view annotations. Performance evaluation is carried out using metrics like mean average precision (mAP) and mean average precision with heading angle (mAPH). Predictions are classified into two levels: LEVEL_1, which includes 3D labels with more than five LiDAR points, and LEVEL_2, which includes labels with at least one LiDAR point. The detection range spans [−75.2 m, 75.2 m] along the X and Y axes, and [−2 m, 4 m] along the Z axis. Raw point clouds are voxelized with a resolution of (0.1 m, 0.1 m, 0.15 m).

DAIR-V2X-V Dataset: The DAIR-V2X-V dataset is a pioneering large-scale, multi-modal resource designed for cooperative vehicle–road autonomous driving research. It features data gathered from real-world scenarios and includes both 2D and 3D annotations. The dataset contains 22,325 image frames and an equal number of point cloud frames, with 3D annotations for 15 common road obstacles. The dataset is split into training, validation, and testing sets in a 5:2:3 ratio, with evaluation performed on the validation set. Consistent with KITTI, the evaluation uses average bounding box perception. Vehicle classification is evaluated using intersection-over-union (IoU) thresholds set at [0.7, 0.5, 0.5] to account for varying difficulty levels in the evaluation process.

4.2. Implementation Details

In line with the standard practices adopted by recent works [5,6,14,26], evaluation on the validation set is conducted using 11 recall positions to compute average precision (AP), while the KITTI test benchmark utilizes 40 recall positions. As a result, 11-point AP is used for validation and 40-point AP for testing. For the final test benchmark submission, the complete KITTI training set is employed to train the SMA² model. The performance is assessed using two key metrics: 3D average precision (3D AP) and bird’s-eye view average precision (BEV AP). The model is trained using the Adam optimizer [50] with a weight decay of 0.01 and a momentum of 0.9. The training process spans 70 epochs, starting with an initial learning rate of 0.003. Specifically, the learning rate is initialized at 0.01 and a step decay strategy is employed, reducing the learning rate by a factor of 0.1 at the 40th and 60th epochs. The batch size is set to eight for all experiments. To generate the final predictions during inference, the process begins by filtering the initial 3D proposals through a non-maximum suppression (NMS) step with an IoU threshold of 0.7, retaining the top-100 candidates. These selected proposals are subsequently refined via RoI Grid Pooling, which integrates detailed keypoint features to enhance spatial representation. After refinement, a second NMS with a stricter threshold of 0.1 is performed to remove duplicate detections and produce the final outputs. In the keypoint semantic enhancement module, the S-FPS module precedes the self-attention (SA) layers, where 16,384 points are sampled from the raw point cloud as input. The initialization stage includes a voxel-based submodule consisting of two 3 × 3 × 3 3D convolutional layers, both with a stride of 1 and a padding of 1.

In our training framework, we adopt Smooth-L1 loss for bounding box regression and focal loss for classification. These losses are applied to both the region proposal network (RPN) and the region-based CNN (RCNN) stages. Empirically, we observe that the RPN loss

L_{rpn}

stabilizes rapidly during early training, effectively generating high-quality object proposals. This allows the RCNN loss

L_{r c n n}

to focus on refining detection results with improved precision. The two losses interact in a complementary manner, and no conflicting gradients or training instability were observed. This stable convergence behavior confirms the compatibility and mutual reinforcement between the RPN and RCNN stages in our framework.

5. Results and Analysis

5.1. Comparison on the KITTI Test Set

Table 2 presents the performance of SMA² on the KITTI test set, evaluated for 3D object detection across three categories: ‘Car’, ‘Pedestrian’, and ’Cyclist’. The detectors are categorized by their input modality: ‘L+C’ represents methods that utilize both LiDAR point clouds and 2D images, while ‘L’ refers to methods using only point clouds. The official KITTI evaluation protocol sets an IoU threshold of 0.7 for the ‘Car’ category and 0.5 for both ‘Pedestrian’ and ‘Cyclist’. Performance is reported at three difficulty levels: easy, moderate, and hard.

The experimental results show that SMA² outperforms other two-stage detectors on the KITTI test set. Specifically, when compared to the baseline method, PV-RCNN [14], SMA² achieves notable improvements in detection accuracy across all three difficulty levels. As shown in Table 2, SMA² surpasses PV-RCNN by 0.41%, 0.37%, and 1.49% in the ‘Car’, ‘Pedestrian’, and ‘Cyclist’ categories, respectively. The improvements are particularly evident in the ‘Car’ and ‘Cyclist’ categories, suggesting that SMA² is especially effective in detecting larger, more distinct objects. In contrast, the ‘Pedestrian’ category shows a slight performance decline. This can be attributed to the presence of unlabelled objects in the point cloud data—such as fire hydrants and utility poles—which share similar shapes to pedestrians, leading to occasional misclassifications.

5.2. Comparison on Waymo Validation Set

As shown in Table 3, SMA² exhibits competitive performance across all categories on the KITTI benchmark. On the Car class, it achieves 76.64 mAP/76.41 mAPH at LEVEL_1 and 68.45/68.06 at LEVEL_2, showing a consistent improvement over strong baselines such as PV-RCNN++ and Pyramid-RCNN. Compared to voxel-based methods like Voxel R-CNN and CT3D, SMA² maintains a slight advantage, which may be attributed to more effective multi-scale feature aggregation. In the Pedestrian category, SMA² attains 74.27/66.13 at LEVEL_1, offering better handling of small and occluded instances, and outperforming methods such as Part-A²-Net and PV-RCNN. For Cyclists, it reaches 68.43/67.12 at LEVEL_1 and 65.75/64.21 at LEVEL_2, indicating stronger adaptability to sparse and elongated point distributions. These results suggest that SMA² provides a balanced trade-off between accuracy and generalization across object categories with varying characteristics.

5.3. Comparison on DAIR-V2X-V Dataset

We further compare our method with the competitive approaches on the DAIR-V2X-V dataset. As shown in Table 4, the 3D object detection performance varies significantly across object categories and difficulty levels. For Vehicle3D detection, the proposed SMA² model achieves the highest performance, with AP scores of 70.12% (easy), 57.39% (moderate), and 56.95% (hard). PV-RCNN also performs competitively, reaching 69.01% in the easy setting. In Pedestrian3D detection, PV-RCNN outperforms other methods with APs of 44.49% (easy), 39.20% (moderate), and 39.81% (hard). SMA² also shows strong performance, achieving 45.23% AP in the easy setting. For Cyclist3D detection, PV-RCNN leads again with APs of 48.84% (Easy), 43.35% (moderate), and 40.34% (hard), while SMA² delivers competitive results, including 45.23% in the easy mode. Overall, both PV-RCNN and SMA² demonstrate robust and accurate performance across diverse categories and difficulty levels. These results highlight their effectiveness in complex scenes and their potential for reliable 3D perception in autonomous driving applications.

5.4. Ablation Studies

In this section, we analyze each component of SMA² and present the results of the ablation study to demonstrate the effectiveness of our method. We simultaneously explore the impact of the proposed method on cars, cyclists, and pedestrians. All results for average precision across all 3D methods are calculated over 40 recall locations.

5.4.1. Effect of S-FPS

To assess the impact of the S-FPS module, it is initially incorporated into the Spconv–Unet framework. As illustrated in Table 5, the introduction of S-FPS leads to consistent performance enhancements across all object categories, with particularly marked gains for Pedestrian and Cyclist detection—improvements of 0.23%, 0.60%, and 1.64% for Car, Pedestrian, and Cyclist, respectively. This outcome suggests that S-FPS facilitates the selection of discriminative foreground points while suppressing background interference, which is especially beneficial for accurately detecting smaller objects. Further investigations into the influence of point sampling density are presented in Table 6. With an increasing number of sampled points, overall detection accuracy exhibits a rising trend. The optimal performance is observed at 2048 sampled points, where Pedestrian and Cyclist categories achieve 61.95% and 76.54%, respectively. However, extending the point count to 4096 results in a noticeable performance drop, particularly for smaller categories. This degradation likely stems from foreground oversampling, which introduces excessive and redundant information, ultimately impeding feature discrimination. These observations highlight that a careful balance between foreground richness and spatial sparsity is essential. The S-FPS strategy, when applied with 2048 samples, offers a favorable trade-off—retaining semantically meaningful points without introducing noise—thereby supporting robust detection of small-scale instances.

As shown in Table 7, S-FPS consistently outperforms FPS across three aspects: detection accuracy, computational complexity, and inference speed. Specifically, S-FPS improves detection performance by 1.8%, 1.95%, and 2.25% on the Car, Pedestrian, and Cyclist categories, respectively. Meanwhile, it reduces computational cost by 875 MFLOPs and shortens runtime from 143 ms to 21 ms. These results highlight S-FPS as a more accurate and efficient sampling strategy, particularly suitable for real-time applications.

Figure 6 presents a visualization comparing the impact of using 2048 versus 4096 keypoints on object detection. As shown, oversampling with 4096 keypoints results in a significant increase in false positives, especially for small objects. This explains the decline in performance for small object categories at higher sampling rates. The visualization confirms that a sampling rate of 2048 offers a better balance between detection accuracy and computational efficiency.

5.4.2. Effect of KAE

Building upon the S-FPS framework, the incorporation of the Keypoint Attention Enhancement (KAE) module yields notable performance gains, as reported in the third row of Table 5. Specifically, the model attains 3D AP scores of 58.56% for Pedestrian and 68.64% for Cyclist, indicating that the fusion of foreground-aware and semantic-guided sampling effectively facilitates the learning of more discriminative keypoint representations. Across various difficulty levels, average gains of 0.81%, 1.09%, and 0.4% are observed, underscoring the robustness of this integration. To further evaluate the contribution of the attention mechanism, KAE is benchmarked against two alternative aggregation strategies—bilinear interpolation and vector-based pooling. As shown in Table 8, the proposed multi-level feature attention aggregation (MFAA) module consistently delivers superior results, with improvements of 2.0%, 0.3%, and 0.9% on the moderate setting for the Car, Pedestrian, and Cyclist categories, respectively. The Car class, in particular, demonstrates the most pronounced benefit. In addition, we observe that the improvement of KAE over vector-pooling is less pronounced for the pedestrian class. This may be attributed to two factors: (1) The small size and morphological variability of pedestrian instances, which pose challenges for feature representation and alignment; and (2) Label noise in the KITTI dataset, which primarily affects smaller objects like pedestrians and may reduce the effectiveness of kernel-based attention mechanisms.

5.4.3. Effect of MFAA

To evaluate the effectiveness of our MFAA, we perform a comparative analysis of two fusion methods on the KITTI validation dataset. As shown in Table 9, MFAA outperforms the concatenation method (Concat) of features from keypoints, voxels, and BEV, achieving improvements of 2.60%, 1.44%, and 3.32%, respectively. Notably, MFAA demonstrates significant gains, particularly in detecting cyclists. In ablation studies, MFAA enhances performance by 2.43%, 3.6%, and 1.9% on the hard difficulty level across three categories (Table 5). Additionally, the use of keypoints extracted via KAE improves performance in the pedestrian category. These results highlight the effectiveness of MFAA in fusion-based methods.

5.4.4. Effect of Keypoint Query

With the aggregation approach of transformer based on keypoint query, it is worth exploring the importance of keypoint in the query method; we have analyzed the validation experiments for voxel query and keypoint query. From Table 10, it can be seen that the two query approaches show great differences: the keypoint query approach is better in performance compared to the voxel query, from the analysis of the results. It is concluded that the keypoint query strategy itself comes with enhanced semantic information; however, the voxel-based query approach shows a performance degradation due to the fact that the voxel loses a lot of structural information during downsampling and exhibits sparsity in the BEV feature.

6. Inference Analysis

Table 11 summarizes the inference efficiency and detection performance of SMA² across three benchmark datasets. The model is trained for 70 epochs on KITTI, 60 epochs on 10% of the Waymo training split, and 60 epochs on DAIR-V2X-V. All experiments are conducted using an Intel i7-7820X CPU and a single GTX 1080Ti GPU with a batch size of 1. For consistency, the number of proposals generated by the region proposal network (RPN) is fixed: K = 90 for KITTI and DAIR-V2X-V, and K = 275 for Waymo.

Compared to PV-RCNN, SMA² achieves faster inference, reducing runtime by 18.1%, 15.5%, and 12.7% on KITTI, Waymo, and DAIR-V2X-V, respectively. In addition, it delivers improved detection accuracy, achieving gains of +2.43% AP on KITTI, +3.55% mAPH on Waymo, and +0.76% Car 3D AP on DAIR-V2X-V.

We analyze the runtime overhead of the proposed keypoint-based querying pipeline. As illustrated in Figure 7, MFAA accounts for approximately 60% of the runtime, followed by KAE (28%) and S-FPS (12%). Despite the additional computation introduced by MFAA, the overall increase in runtime is marginal (∼6.7%), which we consider a favorable trade-off for the observed ∼5% AP gain.

7. Visualization and Analysis

Figure 8 shows the qualitative results on the KITTI dataset, highlighting SMA²’s consistent performance in complex environments, with precise object localization. As shown in Table 2, our method achieves higher average precision for cars and cyclists, highlighting its effectiveness. However, for small-scale objects such as pedestrians, the performance is less satisfactory. Some unlabeled instances and cyclist targets are incorrectly detected as pedestrians, indicating a gap compared to several leading methods. One contributing factor is the limited number of semantic keypoints extracted for pedestrian instances. To address this limitation, future work will explore the integration of keypoint semantics with morphological cues to better handle small-object detection.

8. Conclusions and Discussion

In this paper, we present SMA², a unified framework that fuses keypoint, bird’s-eye view (BEV), and sparse voxel features for enhanced 3D object detection. To begin with, we design a Keypoint Attention Enhancement module that extracts discriminative local keypoints from foreground points by applying semantic-guided sampling and self-attention to segmented foreground features. Then, to capture the interactions between keypoints and non-empty voxels, we propose a multi-feature attention aggregation module that performs keypoint-guided feature fusion across multiple representations. Experimental results on the KITTI dataset show that SMA² achieves superior performance over existing two-stage detectors. Moreover, it exhibits strong robustness and generalization on the Waymo and DAIR-V2X-V validation sets.

While our method demonstrates strong performance overall, it still struggles with highly occluded pedestrian instances. In future work, we plan to incorporate morphological features, such as human shape priors and skeleton-based keypoints, to enhance the model awareness of structural cues and reduce false positives in complex scenes. Meanwhile, we will explore different representations and fusion methods of point clouds in multi-modal scenes.

Beyond current benchmarks, the proposed SMA² model can be extended to a wide range of 3D perception tasks. In autonomous driving, it enables the accurate and real-time detection of vehicles, pedestrians, and cyclists. In robotics, it supports fine-grained scene understanding to facilitate safe navigation. In augmented reality (AR) or digital twin systems, SMA² can contribute to precise environment reconstruction and interactive understanding.

To comprehensively assess the robustness of the method, it is important to consider the impact of varying LiDAR sensors and scanning conditions. Differences in LiDAR sensor configurations, scanning density, and point cloud acquisition conditions can significantly affect detection performance in real-world scenarios. Therefore, incorporating data from different LiDAR setups and evaluating how these variations influence model performance, particularly in spatial information integration during classification decisions, will be essential for improving the method’s applicability in diverse environments.

Furthermore, integrating contextual information, such as semantic context or point cloud density, using Graph Convolution Networks (GCNs) or attention-based mechanisms, could enhance the model’s ability to capture local features and improve classification stability. This is especially important for small object detection. Leveraging these additional features aims to improve detection accuracy, particularly in complex scenarios involving small targets, such as pedestrians or cyclists. These future directions will be further explored in ongoing research.

Author Contributions

Methodology, J.Z.; Validation, Z.H.; Investigation, Z.Z.; Project administration, H.H.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of Guangdong Province grant number 2015A030312010.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Song, Z.; Liu, L.; Jia, F.; Luo, Y.; Jia, C.; Zhang, G.; Yang, L.; Wang, L. Robustness-aware 3d object detection in autonomous driving: A review and outlook. IEEE Trans. Intell. Transp. Syst. 2024, 25, 15407–15436. [Google Scholar] [CrossRef]
Zhang, L.; Li, C. PPF-Net: Efficient Multimodal 3D Object Detection with Pillar-Point Fusion. Electronics 2025, 14, 685. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2647–2664. [Google Scholar] [CrossRef]
Jain, W.H.; Jhong, B.G.; Chen, M.Y. A Social Assistance System for Augmented Reality Technology to Redound Face Blindness with 3D Face Recognition. Electronics 2025, 14, 1244. [Google Scholar] [CrossRef]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
Kuang, H.; Wang, B.; An, J.; Zhang, M.; Zhang, Z. Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 2020, 20, 704. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11040–11048. [Google Scholar]
Wu, P.; Gu, L.; Yan, X.; Xie, H.; Wang, F.L.; Cheng, G.; Wei, M. PV-RCNN++: Semantical point-voxel feature interaction for 3D object detection. Vis. Comput. 2023, 39, 2425–2440. [Google Scholar] [CrossRef]
Chen, C.; Chen, Z.; Zhang, J.; Tao, D. Sasa: Semantics-augmented set abstraction for point-based 3d object detection. AAAI Conf. Artif. Intell. 2022, 36, 221–229. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Lan, S.; Yu, R.; Yu, G.; Davis, L.S. Modeling local geometric structure of 3d point clouds using geo-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 998–1008. [Google Scholar]
Xie, Z.; Chen, J.; Peng, B. Point clouds learning with attention-based graph convolution networks. Neurocomputing 2020, 402, 245–255. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Zhu, G.; Zhou, Y.; Yao, R.; Zhu, H.; Zhao, J. Cyclic self-attention for point cloud recognition. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–19. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Long Beach, CA, USA, 15–20 June 2019; pp. 1951–1960. [Google Scholar] [CrossRef]
Shi, W.; Rajkumar, R. Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1711–1719. [Google Scholar]
Ye, M.; Xu, S.; Cao, T. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1631–1640. [Google Scholar]
Zheng, W.; Tang, W.; Chen, S.; Jiang, L.; Fu, C.W. Cia-ssd: Confident iou-aware single-stage object detector from point cloud. AAAI Conf. Artif. Intell. 2021, 35, 3555–3562. [Google Scholar] [CrossRef]
Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3164–3173. [Google Scholar]
He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11873–11882. [Google Scholar]
Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 2743–2752. [Google Scholar]
Mao, J.; Niu, M.; Bai, H.; Liang, X.; Xu, H.; Xu, C. Pyramid r-cnn: Towards better performance and adaptability for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 2723–2732. [Google Scholar]
Song, Z.; Jia, C.; Yang, L.; Wei, H.; Liu, L. GraphAlign++: An accurate feature alignment by graph matching for multi-modal 3D object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2619–2632. [Google Scholar] [CrossRef]
Song, Z.; Zhang, G.; Liu, L.; Yang, L.; Xu, S.; Jia, C.; Jia, F.; Wang, L. Robofusion: Towards robust multi-modal 3d obiect detection via sam. arXiv 2024, arXiv:2401.03907. [Google Scholar]
Bi, J.; Wei, H.; Zhang, G.; Yang, K.; Song, Z. Dyfusion: Cross-attention 3d object detection with dynamic fusion. IEEE Lat. Am. Trans. 2024, 22, 106–112. [Google Scholar] [CrossRef]
Gong, G.; Wang, X.; Zhang, J.; Shang, X.; Pan, Z.; Li, Z.; Zhang, J. MSFF: A Multi-Scale Feature Fusion Convolutional Neural Network for Hyperspectral Image Classification. Electronics 2025, 14, 797. [Google Scholar] [CrossRef]
Han, L.; Song, B.; Wu, S.; Nie, D.; Chen, Z.; Wang, L. Semantic Segmentation of Distribution Network Point Clouds Based on NF-PTV2. Electronics 2025, 14, 812. [Google Scholar] [CrossRef]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7463–7472. [Google Scholar]
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Guan, T.; Wang, J.; Lan, S.; Chandra, R.; Wu, Z.; Davis, L.; Manocha, D. M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 772–782. [Google Scholar]
Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.X.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8458–8468. [Google Scholar]
Wang, L.; Song, Z.; Zhang, X.; Wang, C.; Zhang, G.; Zhu, L.; Li, J.; Liu, H. SAT-GCN: Self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 2023, 259, 110080. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; Tian, Q. Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3323–3332. [Google Scholar]
Chen, Y.; Liu, J.; Qi, X.; Zhang, X.; Sun, J.; Jia, J. Scaling up kernels in 3d cnns. arXiv 2022, arXiv:2206.10555. [Google Scholar]
Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17545–17555. [Google Scholar]
Yan, Y.; Ni, B.; Yang, X. Fine-grained recognition via attribute-guided attentive feature aggregation. In Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1032–1040. [Google Scholar]
Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; Teh, Y.W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 3744–3753. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4604–4612. [Google Scholar]
Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16. Springer: Cham, Switzerland, 2020; pp. 720–736. [Google Scholar]
Noh, J.; Lee, S.; Ham, B. Hvpr: Hybrid voxel-point representation for single-stage 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14605–14614. [Google Scholar]
Xie, T.; Wang, L.; Wang, K.; Li, R.; Zhang, X.; Zhang, H.; Yang, L.; Liu, H.; Li, J. FARP-Net: Local-global feature aggregation and relation-aware proposals for 3D object detection. IEEE Trans. Multimed. 2023, 26, 1027–1040. [Google Scholar] [CrossRef]
Yang, H.; Wang, W.; Chen, M.; Lin, B.; He, T.; Chen, H.; He, X.; Ouyang, W. Pvt-ssd: Single-stage 3d object detector with point-voxel transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13476–13487. [Google Scholar]
Wang, C.H.; Chen, H.W.; Chen, Y.; Hsiao, P.Y.; Fu, L.C. VoPiFNet: Voxel-Pixel Fusion Network for Multi-Class 3D Object Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8527–8537. [Google Scholar] [CrossRef]
Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, PMLR, Virtual, 16–18 November 2020; pp. 923–932. [Google Scholar]
Ge, R.; Ding, Z.; Hu, Y.; Wang, Y.; Chen, S.; Huang, L.; Li, Y. Afdet: Anchor free one stage 3d object detection. arXiv 2020, arXiv:2006.12671. [Google Scholar]
Wang, Y.; Fathi, A.; Kundu, A.; Ross, D.A.; Pantofaru, C.; Funkhouser, T.; Solomon, J. Pillar-based object detection for autonomous driving. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 18–34. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Zhu, Z.; Meng, Q.; Wang, X.; Wang, K.; Yan, L.; Yang, J. Curricular object manipulation in lidar-based object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1125–1135. [Google Scholar]
Nie, M.; Xue, Y.; Wang, C.; Ye, C.; Xu, H.; Zhu, X.; Huang, Q.; Mi, M.B.; Wang, X.; Zhang, L. Partner: Level up the polar representation for lidar 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; pp. 3801–3813. [Google Scholar]

Figure 1. The architecture of the semantic-guided multi-feature attention aggregation network. (1) Spconv–Unet architecture is used to extract foreground points and semantic space features. (2) According to the semantic score, the raw point clouds are resampled to obtain as many sampling points as possible, and the keypoint feature with semantic information are obtained through the Keypoint Attention Enhancement module. (3) The keypoint, BEV, and sparse voxel features from the three representations are used as the input of the aggregation module and fused by the keypoint query mechanism. S-FPS stands for semantic farthest-point sampling; Dconv stands for Deconvolution.

Figure 2. The Spconv–Unet encoder–decoder structure.

Figure 3. Demonstration of Keypoint Attention Enhancement module. The aggregated features undergo a series of transformations, including fully connected (FC) layers, layer normalization (LN), and ReLU activation, to further refine the feature representations. Finally, the enhanced features are obtained through element-wise multiplication operations, resulting in keypoint semantic enhancement features

F_{k e y}

.

Figure 3. Demonstration of Keypoint Attention Enhancement module. The aggregated features undergo a series of transformations, including fully connected (FC) layers, layer normalization (LN), and ReLU activation, to further refine the feature representations. Finally, the enhanced features are obtained through element-wise multiplication operations, resulting in keypoint semantic enhancement features

F_{k e y}

.

Figure 4. Multi-feature attention aggregation module. The fused feature maps with multi-scale voxel group and semantic keypoint query information were fed into self-attention. The input consists of multi-scale voxel groups and semantic keypoint query features. These features are fused through concatenation and channel-wise weighting, forming a unified feature map. The fused map is then processed by a self-attention mechanism to model long-range dependencies and refine spatial–semantic relationships.

Figure 5. The distribution of different categories within the KITTI dataset.

Figure 6. Comparison between using 2048 and 4096 keypoints for 3D object detection. The green boxes indicate the detected small objects, while the yellow circles highlight the false positives.

Figure 7. Runtime distribution of key modules in our keypoint-based querying framework. MFAA dominates computation (∼60%), while S-FPS and KAE contribute ∼12% and ∼28%, respectively.

Figure 8. Additional qualitative results from the KITTI validation set are presented. We also include the corresponding projected 3D bounding boxes overlaid on image views. Specifically, predicted boxes are highlighted in red, while ground-truth annotations are color-coded as follows: green for cars, cyan for pedestrians, and yellow for cyclists.

Table 1. Comparison of methods in terms of voxelization, feature aggregation, and semantic segmentation usage.

Method	Voxelization	Feature Aggregation	Semantic Segmentation
PointRCNN [20]		2-stage point refinement	✓
PV-RCNN [14]	✓	Voxel-to-point and RoI pooling
Part-A² [7]	✓	Part-aware RoI pooling	✓
SMA²	✓	Multi-feature attention	✓

Table 2. Performance comparison on KITTI official test server.

Method	Reference	Modality	Car-3D.AP (IoU = 0.7)			Pedestrian-3D.AP (IoU = 0.5)			Cyclist-3D.AP (IoU = 0.5)
Method	Reference	Modality	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
F-PointNet [13]	CVPR2018	L+C	82.19	69.79	60.59	50.53	42.15	38.08	72.27	56.12	49.01
PointPainting [51]	CVPR2020	L+C	88.93	78.27	77.48	50.32	40.97	37.87	77.63	63.78	55.89
3D-CVF [52]	CVPR2020	L+C	89.20	80.05	73.11	−	−	−	−	−	−
SECOND [5]	Sensors	L	84.65	75.96	68.71	−	−	−	−	−	−
PointPillars [6]	CVPR2019	L	82.58	74.31	68.99	51.45	41.92	38.89	77.1	58.65	51.92
PointRCNN [20]	Sensors	L	86.96	75.64	70.7	47.98	39.37	36.01	74.96	58.82	52.53
Point-GNN [27]	CVPR2020	L	88.33	79.46	72.29	51.92	43.77	40.14	78.60	63.48	57.08
Part- $A^{2}$ [7]	TPAMI	L	87.81	78.49	73.51	53.10	43.35	40.06	79.17	63.52	56.93
SA-SSD [31]	CVPR2020	L	88.75	79.79	74.16	−	−	−	−	−	−
3DSSD [16]	CVPR2020	L	88.36	79.57	74.55	$54.64$	44.27	40.23	82.48	64.1	56.90
PV-RCNN [14]	CVPR2020	L	90.25	81.43	76.82	52.17	43.29	40.29	78.57	63.71	57.65
CIA-SSD [29]	CVPR2021	L	89.59	80.28	72.87		−	−	−	−	−
HVPR [53]	CVPR2021	L	86.38	77.92	73.04	$53.47$	43.96	40.64	−	−	−
IA-SSD [25]	CVPR2022	L	88.34	80.13	75.04	46.51	39.03	35.60	78.35	61.94	55.70
FARP-Net [54]	TOM2023	L	88.36	81.55	78.98	−	−	−	−	−	−
PVT-SSD [55]	CVPR2023	L	90.65	82.29	76.85	−	−	−	−	−	−
VoPiFNet [56]	TITS2024	L	88.51	80.97	76.74	53.07	$47.43$	$45.22$	77.64	64.10	58.00
SMA²	−	L	$91.03$	$81.74$	$77.08$	52.34	43.66	40.51	$80.56$	$65.20$	$58.43$

Table 3. Comparative results on the Waymo Open dataset on validation set, evaluating 3D object detection performance for vehicles (IoU = 0.7), pedestrians (IoU = 0.5), and cyclists (IoU = 0.5). results marked with † are cited from [15].

Method	Veh. (LEVEL_1)		Veh. (LEVEL_2)		Ped. (LEVEL_1)		Ped. (LEVEL_2)		Cyc. (LEVEL_1)		Cyc. (LEVEL_2)
Method	mAP	mAPH	mAP	mAPH	mAP	mAPH	mAP	mAPH	mAP	mAPH	mAP	mAPH
PointPillars [6]	56.62	−	−	−	59.25	−	−	−	−	−	−	−
SECOND [5]	72.27	71.69	63.85	63.33	68.7	58.18	60.72	51.31	60.62	59.28	58.34	57.05
MVF [57]	62.93	−	−	−	65.33	−	−	−	−	−	−	−
AFDet [58]	63.69	−	−	−	65.33	−	−	−	−	−	−	−
Pillar-based [59]	69.80	−	−	−	72.51	−	−	−	−	−	−	−
Part-A2-Net ^† [7]	74.82	74.32	65.88	65.42	71.76	63.64	62.53	55.3	67.35	66.15	65.05	63.89
Voxel-RCNN [9]	75.59	−	66.59	−	72.51	−	−	−	−	−	−	−
CT3D [32]	76.30	−	69.04	−	65.33	−	−	−	−	−	−	−
Pyramid-RCNN [33]	76.3	75.68	67.23	66.68	−	−	−	−	−	−	−	−
CenterPoint [60]	76.7	76.2	68.8	68.3	79.0	72.9	71.0	65.3	−	−	−	−
Curricular [61]	72.15	71.04	64.64	64.29	73.62	64.47	65.84	60.39	−	−	−	−
VoTr-TSD [30]	74.95	74.25	65.91	65.29	−	−	−	−	−	−	−	−
CenterFormer [5]	72.27	71.69	63.85	63.33	68.7	58.18	60.72	51.31	60.62	59.28	58.34	57.05
PV-RCNN ^† [14]	75.17	74.6	66.35	65.84	72.65	63.52	63.42	55.29	67.26	65.82	64.88	63.48
PARTNER [62]	76.05	75.52	68.58	68.11	−	−	−	−	−	−	−	−
PV-RCNN++ [15]	76.14	75.62	68.05	67.56	73.97	65.43	65.64	57.82	68.38	67.06	65.92	64.65
SMA²	76.64	76.41	68.45	68.06	74.27	66.13	64.29	57.71	68.43	67.12	65.75	64.21

Table 4. Quantitative comparison of 3D object detection methods on the DAIR-V2X-V validation set. All results are evaluated locally using the mean average precision (mAP) metric computed over 40 recall positions.

Model	Vehicle3D (IoU = 0.5)			Pedestrian3D (IoU = 0.25)			Cyclist3D (IoU = 0.25)
Model	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
PointPillars [6]	61.76	49.02	43.45	33.40	24.68	22.39	38.24	33.80	32.35
SECOND [5]	69.44	59.63	57.63	43.45	39.06	38.78	44.21	39.03	37.74
Voxel-RCNN [9]	69.11	56.73	56.76	43.49	38.25	37.28	43.24	36.23	36.45
PointRCNN [20]	64.06	54.08	54.12	41.36	36.87	36.47	37.26	32.19	32.56
IA-SSD [25]	69.16	56.77	56.80	45.11	39.56	38.69	44.86	39.46	38.11
PV-RCNN [14]	69.01	56.63	56.67	44.69	39.20	39.81	44.35	38.09	37.17
SMA²	70.12	57.39	56.95	45.23	39.43	39.21	45.23	39.43	38.21

Table 5. Effects of SP-FPS module, KAE, and MFAA for SMA² network. Results on the KITTI val set for Car, Pedestrian, and Cyclist classes.

S-FPS	KAE	MFAA	Car-3D.AP (IoU = 0.7)			Pedestrian-3D.AP (IoU = 0.5)			Cyclist-3D.AP (IoU = 0.5)
S-FPS	KAE	MFAA	Easy	Mod	Hard	Easy	Mod	Hard	Easy	Mod	Hard
			89.46	79.23	77.83	68.42	61.61	55.06	88.31	74.13	67.64
√			90.13	79.46	78.60	68.58	61.68	55.47	88.54	75.77	68.23
√	√		94.61	81.33	79.12	69.26	62.85	58.56	89.12	76.68	68.64
√	√	√	95.54	82.6	80.26	68.76	63.14	58.61	89.33	77.4	69.51

Table 6. Performance comparison on the KITTI validation set based on the number of foreground keypoints.

The Number of Foreground Keypoints	Car-3D.AP	Pedestrian-3D.AP	Cyclist-3D.AP
The Number of Foreground Keypoints	Mod (IoU = 0.7)	Mod (IoU = 0.5)	Mod (IoU = 0.5)
512	78.45	60.57	72.56
1024	79.26	61.79	75.48
2048	79.14	61.95	76.54
4096	79.87	61.46	76.43

Table 7. Performance comparison of FPS and S-FPS sampling strategies.

Strategy	Car	Ped.	Cyc.	FLOPS (M)	Running Time
FPS	77.56	59.69	74.35	−0	143 ms
S-FPS	79.36	61.64	76.60	−875	21 ms

Table 8. Evaluation of keypoint enhancement strategies on KITTI validation set.

Enhancement Method	Car.AP	Pedestrian.AP	Cyclist.AP
Enhancement Method	Mod (IoU = 0.7)	Mod (IoU = 0.5)	Mod (IoU = 0.5)
Bilinear interpolation	77.46	59.87	74.21
Vector pool aggregation	79.25	61.35	75.68
KAE	$81.26$	$61.64$	$76.59$

Table 9. Performance comparison of different fusion strategies on the KITTI validation set.

Fusion Method	Car.AP	Pedestrian.AP	Cyclist.AP
Fusion Method	Mod (IoU = 0.7)	Mod (IoU = 0.5)	Mod (IoU = 0.5)
Concat	79.46	61.54	74.13
MFAA	82.05	62.98	77.45

Table 10. Performance comparison of different keypoint query strategies on the KITTI validation set.

Module	Car.AP	Pedestrian.AP	Cyclist.AP
Module	Mod (IoU = 0.7)	Mod (IoU = 0.5)	Mod (IoU = 0.5)
Voxel query	77.41	58.13	70.12
Keypoint query	82.45	63.23	77.35

Table 11. Runtime and detection performance are compared on the KITTI, Waymo, and DAIR-V2X-V validation sets. Reported metrics include average 3D AP under moderate difficulty for KITTI, LEVEL_2 3D mAPH for Waymo, and moderate 3D AP for the Car class on DAIR-V2X-V.

Method	KITTI		Waymo		DAIR-V2X-V
Method	Speed (ms)	3D.AP	Speed (ms)	mAPH	Speed (ms)	Car-3D.AP
PV-RCNN [14]	154	71.71	413	61.54	197	56.63
SMA²	126	74.14	349	65.09	172	57.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Huang, Z.; Zheng, Z.; Long, Y.; Hu, H. Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection. Electronics 2025, 14, 2154. https://doi.org/10.3390/electronics14112154

AMA Style

Zhao J, Huang Z, Zheng Z, Long Y, Hu H. Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection. Electronics. 2025; 14(11):2154. https://doi.org/10.3390/electronics14112154

Chicago/Turabian Style

Zhao, Jingwen, Zhicong Huang, Zhijie Zheng, Yunliang Long, and Haifeng Hu. 2025. "Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection" Electronics 14, no. 11: 2154. https://doi.org/10.3390/electronics14112154

APA Style

Zhao, J., Huang, Z., Zheng, Z., Long, Y., & Hu, H. (2025). Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection. Electronics, 14(11), 2154. https://doi.org/10.3390/electronics14112154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Overview of LiDAR-Based Methods

2.2. Two-Stage 3D Object Detection Methods

2.3. Multi-Modal 3D Object Detection Methods

2.4. Attention-Based 3D Object Detection Methods

3. Methodology

3.1. The Overview of Our Method

3.2. Foreground Points Extraction

3.3. Semantic FPS

3.4. Keypoint Attention Enhancement

3.5. Multi-Feature Attention Aggregation

3.5.1. Multi-Scale Voxel Feature Group

3.5.2. BEV Feature Map Representation

3.5.3. Transformer-Based Multi-Feature Aggregation

3.6. Loss Functions

4. Experiment

4.1. Datasets and Evaluation Metric

4.2. Implementation Details

5. Results and Analysis

5.1. Comparison on the KITTI Test Set

5.2. Comparison on Waymo Validation Set

5.3. Comparison on DAIR-V2X-V Dataset

5.4. Ablation Studies

5.4.1. Effect of S-FPS

5.4.2. Effect of KAE

5.4.3. Effect of MFAA

5.4.4. Effect of Keypoint Query

6. Inference Analysis

7. Visualization and Analysis

8. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI