Semantic-Guided Multi-Feature Attention Aggregation Network for LiDAR-Based 3D Object Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- Abstract and Introduction (Lines 1–20 and 21–50)
- [Line 3–6]: The problem of foreground-background imbalance is well-motivated, but the novelty of the proposed method is under-emphasized. It reads more like an extension than a paradigm shift.
Suggestion: Clearly state what makes SMA2 different from or better than existing multi-feature fusion methods like PV-RCNN++, SA-SSD, etc.
- [Line 12–15]: The term “more discriminative proposals” is vague.
Suggestion: Replace with a quantifiable outcome such as “higher Intersection-over-Union (IoU) detection accuracy” or “increased recall of small objects.”
- [Line 35–41]: While you acknowledge voxelization causes information loss, no metric or experimental citation is provided to support this.
Suggestion: Back up with references or quantitative examples showing degradation.
- Related Work (Lines 102–197)
- [Line 102–146]: While multiple works are cited, the section lacks synthesis and comparative insight.
Suggestion: Add a table comparing methods (PointRCNN, PV-RCNN, Part-A2, SMA2) based on voxelization, feature aggregation, and semantic segmentation usage.
- [Line 172–197]: The attention-based section is informative but fails to link directly to SMA2’s transformer-based aggregation.
Suggestion: Explicitly position how SMA2’s MFAA differs from, say, PointFormer or VoTr.
- Methodology (Lines 198–316)
- [Line 203–208]: The Spconv-Unet encoder-decoder structure is not illustrated well in Fig. 1.
Suggestion: Provide a zoomed inset or detailed block diagram.
- [Line 223–230]: Equation (1) is dimensionally correct, but lacks an intuitive explanation.
Suggestion: Add a geometric diagram explaining rotation matrix and its relation to LiDAR coordinate systems.
- [Line 237–240]: Focal loss is used, but no justification for α=0.25, γ=2 is given.
Suggestion: Discuss whether this is dataset-specific or tuned via cross-validation.
- [Line 252–261]: The Semantic-FPS method is promising, but the computational cost is not discussed.
Suggestion: Report time or FLOPs comparison between S-FPS vs traditional FPS.
- [Line 274–276]: Equation (6) is overloaded with nested functions (ReLU, FC, LN) — no intermediate shapes or dimensions are mentioned.
Suggestion: Specify dimensions of F, wi, and Fkey_i to avoid ambiguity during implementation.
- [Line 307–316]: Attention formula and projection matrices are correct, but a diagram showing query/key/value matrices and flow would help.
Suggestion: Enhance Fig. 3 with arrows indicating matrix operations and residual paths.
- Loss Function (Lines 317–334)
- [Line 327–329]: The usage of Smooth-L1 and focal loss is appropriate, but the interaction between Lrpn and Lrcnn losses is not studied.
Suggestion: Include a plot or table showing how loss values evolve during training.
- Experiment Section (Lines 335–493)
- [Line 343–347]: No details on the balance of object classes in KITTI subsets.
Suggestion: Add a class distribution chart for better understanding of evaluation fairness.
- [Line 373–380]: The training parameters (70 epochs, learning rate) are generic.
Suggestion: Provide learning rate decay policy and batch size used.
- [Line 398–403]: SMA2 improvement over PV-RCNN is marginal (+0.41%), but not statistically validated.
Suggestion: Conduct a Wilcoxon signed-rank test or other statistical comparison across 5 runs.
- [Line 424–426]: Results on DAIR-V2X-V are promising but lack object-wise granularity.
Suggestion: Include per-class AP scores for all 15 object classes in supplementary or Table 3.
- Ablation Study (Lines 427–481)
- [Line 439–444]: Table 5’s insight on over-sampling (4096 keypoints) is helpful, but no visual analysis or error heatmap is provided.
Suggestion: Add a failure case visualization comparing 2048 vs 4096 samples.
- [Line 455–459]: KAE improvement over vector-pooling is minimal on pedestrians.
Suggestion: Discuss whether this is due to morphology or label noise in the KITTI dataset.
- [Line 473–479]: Voxel query vs keypoint query shows a ~5% gain, but GPU memory or runtime tradeoff is omitted.
Suggestion: Add bar chart comparing runtime per module (S-FPS, KAE, MFAA).
- Conclusion and Limitations (Lines 504–515)
- [Line 504–514]: The conclusion highlights strengths but underplays limitations, e.g., failure in occluded pedestrian detection.
Suggestion: Expand on how your future work will use morphological features (e.g., shape priors, skeleton-based keypoints) to reduce false positives.
Comments on the Quality of English LanguageThe quality of presentation and sentence formation is fine.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors
This paper proposes the SMA² network, which combines keypoint-guided semantic features and sparse voxel features using novel attention modules to enhance 3D object detection through more effective feature aggregation.
- Firstly, the manuscript exhibits a high similarity index (over 37%), which must be reduced to below 10% to meet publication standards.
- In the paper, the author did not pay adequate attention to punctuation, capitalization at the beginning of sentences, and spacing. All abbreviations (except well-known ones) should be defined upon first use. The word 'we' has been used excessively throughout the manuscript and should be reduced for better clarity and formal tone.
- Add the workflow of the other sections at the end of Introduction section. The text should precede the figures and tables within the same section or subsection. This issue needs to be resolved. Table 1 should be moved to Section 5. In Section 5, carefully verify the numbering of all tables to ensure accuracy. The explanation for Table 3 is missing and should be added to the text. Figure 4 should be moved to Section 7, and labels such as (a), (b), etc., should be added for clarity.
- How does the keypoint attention enhancement (KAE) module select local keypoints from the foreground points? What role does semantic-guided sampling play in the KAE module? What is meant by 'keypoint query' in the context of the MFAA module? What advantages does combining keypoint, BEV, and voxel features offer over using them independently? What are the potential applications of the SMA² model in 3D object detection or scene understanding?
- How large or small must an object be in order to be detectable? What are the limitations of the proposed method? Additionally, how does the light of the environment affect the detection process?
- What happens when multiple objects are located very close to each other? This point should be explained. How many objects can the proposed method detect simultaneously?
- Overall, the presentation of the results and the explanation of the methods can be improved. It is recommended that the authors enhance the structure and readability of the paper and provide clearer explanations of the results. The abstract and conclusion should also be revised. Additional experimental results, similar to Figure 4, are needed, along with a discussion on how the parameters are selected to achieve the desired outcomes using the proposed method.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper titled "Semantic-guided multi-feature attention aggregation network for LiDAR-based 3D object detection" provides a professional approach for 3D objects detection.
The authors present a novel framework called SMA2 (Semantic-guided Multi-feature Attention Aggregation) that addresses a fundamental challenge in LiDAR-based 3D object detection: the imbalance between foreground and background points in point cloud data. This imbalance typically hinders detection accuracy, especially for smaller objects like pedestrians and cyclists.
The paper's main innovation lies in its two-module approach. First, the Keypoint Attention Enhancement (KAE) module uses semantic segmentation this guide the extraction of foreground points from raw point clouds. This is a significant improvement over traditional sampling methods like Furthest Point Sampling (FPS), which don't distinguish between foreground and background points. The second key contribution is the Multi-Feature Attention Aggregation (MFAA) module, which uses a transformer-based architecture with a keypoint query mechanism which effectively combines information from keypoints, Bird's Eye View (BEV) features, and multi-scale sparse voxel features.
Several advantages of the proposed approach stand out.
First, the semantic-guided sampling strategy demonstrates superior ability this retain valuable foreground points compared this conventional methods. The experiments clearly show this translates this better detection performance, especially for smaller objects like pedestrians and cyclists, which have historically been challenging for LiDAR-based systems.
Second, the key point query mechanism in the MFAA module represents an elegant solution to a computational efficiency problem. By allowing direct retrieval of voxel features in the neighborhoods of key points without traversing all voxels, the authors achieve both better performance and lower computational cost. The ablation studies convincingly demonstrate that key point query outperforms voxel query approaches.
Third, the framework shows impressive generalizability across multiple datasets. While many papers in this field focus exclusively on KITTI, the authors validate their approach on Waymo and DAIR-V2X-V datasets as well, showing consistent improvements over baseline methods. This suggests the approach addresses fundamental challenges in LiDAR-based detection rather than merely optimizing for a specific dataset.
The experimental results are comprehensive and compelling. On the KITTI test set, SMA2 achieves state-of-the-art performance for car detection with 91.03% in the "easy" category, outperforming previous methods. The inference time analysis also shows a 12-18% reduction in runtime compared this the PV-RCNN baseline across all datasets, which is significant for real-time applications like autonomous driving.
In the article on Semantic-guided multi-feature attention aggregation network for LiDAR-based 3D object detection, there are several disadvantages of the proposed approach that should be addressed.
The authors claim that their method effectively detects objects of various categories, including pedestrians, however, the results in Table 1 shows that for the "Pedestrian" category, their SMA2 method demonstrates lower performance compared this other methods such as VoPiFNet. The authors themselves acknowledge this in Section 7, noting: "In terms of detecting small objects such as pedestrians, it can be seen that some unlabeled targets and cyclist targets have been mistakenly detected". This drawback relates this the key point extraction methodology, and the authors should develop a specialized approach for small objects with less points.
The authors propose using Semantic-FPS for key point sampling but insufficiently explore potential computational complexity issues. Table 5 shows how performance changes with the number of key points, but it doesn't investigate the trade-off between accuracy and computational requirements. As LiDAR point density increases in real world conditions, the computational complexity of S-FPS may become problematic. The authors should supplement Section 5.4.1 with an analysis of their method's computational efficiency at different input data densities.
In Section 3.3, the authors describe using a sigmoid function this calculate the foreground score, but don't discuss the limitations of this approach when processing complex scenes with partial object overlaps. Values close to this threshold (0.5) may lead this unstable classification in boundary cases. This drawback relates this the foreground point extraction module, and the authors should develop a more​ robust mechanism for scenes with significant object overlaps.
The proposed MFAA (Multi-Feature Attention Aggregation) method​ uses a transformer architecture for feature aggregation, but the authors don't address the quadratic complexity of the attention mechanism relative this the number of input elements. Section 3.5.3 doesn't discuss how their method scales for very dense point clouds. This is especially important for real-time systems such as autonomous vehicles, where latency is critical. The authors should modify Section 3.5.3 to address potential scalability issues of their method​
In Section 6 and Table 9, the authors discuss inference speed but don't analyze memory usage, which is an important aspect for deployment on embedded systems of autonomous vehicles with limited resources. Execution time has improved, but this analysis should be supplemented with an assessment of memory requirements, especially considering the use of multiple data representations (key points, voxels, BEV), which May require significant memory​
Finally, although the authors demonstrate performance improvements on three datasets, they don't consider the impact of different LiDAR data collection conditions (e.g., varying scanning density, different LiDAR sensors ) on the proposed method. This lack of discussion relates this Section 5, and the authors should include an analysis of their method's reliability under different LiDAR configurations and scanning conditions this better assess its applicability in diverse real world scenarios.
The article title introduces a novel approach to 3D object detection using LiDAR data, which is both original and highly relevant this the field of autonomous driving and robotics.
The SMA2 network contributes significantly this the field by leveraging semantic information this guide the detection process, which is a departure from traditional methods that rely solely on geometric features.
The conclusions drawn in the article are well-supported by the experimental results presented.
The references cited in the article are relevant and appropriately support the discussions and comparisons made. They cover a wide range of existing methods and technology in 3D object detection, providing a solid foundation for the proposed work​
Regarding the tables and figures, they are well-presented and provide clear insights into the performance and effectiveness of the SMA2 network.
In conclusion, SMA2 makes significant contributions this the field of LiDAR-based 3D object detection through its innovative attention mechanisms and semantic-guided point sampling. The performance improvements and computational efficiency gains demonstrated across multiple datasets suggest this approach could have practical value for autonomous driving systems and other robotics applications requiring accurate 3D perception.
Comments for author File: Comments.pdf
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe current version is significantly improved and can be accepted for the publication. I thank the authors for their effort.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have made the necessary changes in the revised version of the manuscript. Please also check the numbering in the workflow and other sections.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf