RSCS6D: Keypoint Extraction-Based 6D Pose Estimation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript proposes RSCS6D, a novel method for estimating 6D pose from RGB-D images, focussing on keypoint cloud extraction. The core contribution lies in combining edge- and curvature-based keypoint detection with semantic segmentation to produce a compact and informative 3D keypoint cloud. The paper also introduces a pruned, lightweight version (RSCS6D_Pruned) using Network Slimming to reduce computational costs. Main contributions: introduction of the RSCS algorithm for keypoint cloud extraction from RGB-D data; integration of RGB edge detection (Sobel operator) and depth curvature information; lightweight variant (RSCS6D_Pruned) optimised for real-time industrial applications. The work is technically original and highly relevant to the field of computer vision, especially in robotics and AR contexts. While the use of Sobel and curvature for feature extraction is not novel per se, their combined and structured application in a 6D pose estimation framework shows clear novelty and effectiveness. Strengths: strong justification for replacing dense point clouds with keypoint clouds; excellent results on the LineMOD benchmark dataset; the lightweight version is valuable for practical deployment. Suggestions: better highlight what fundamentally differentiates RSCS6D from G2L-Net beyond performance; briefly acknowledge related keypoint detection work outside the 6D pose domain for context. The methodology is well-structured, technically sound, and built on state-of-the-art practices (semantic segmentation via PIDNet, PCA-based curvature estimation, Kabsch for rotation, etc.). Clear mathematical formulation, logical division of tasks: segmentation point cloud generation pose estimation, use of interpretable keypoint clouds is well motivated. The curvature extraction section (3.3.3) is detailed but could be streamlined, including a brief discussion on potential failure modes (e.g., segmentation errors, sensor noise). The manuscript is generally well written, though certain sections are overly detailed and could benefit from editing for conciseness. The figures are helpful and well labelled. Suggestions: reduce redundancy in the methodological sections (especially 3.3), standardise mathematical notation formatting (e.g., equation spacing, LaTeX-like presentation), fix minor grammatical issues and formatting inconsistencies in tables and figures, include evaluations in multiobject or dynamic scenes for broader applicability, add error heatmaps or qualitative failure cases to enrich the evaluation section. The paper would benefit from minor revisions related to presentation, editorial clarity, and additional discussion of limitations and broader applicability.
Author Response
Comments 1: Suggestions: better highlight what fundamentally differentiates RSCS6D from G2L-Net beyond performance;
|
Response 1: We have included a dedicated comparison section in the revised manuscript (Section 3.4) to better highlight the fundamental differences between RSCS6D and G2L-Net. The main distinctions are RSCS6D's use of the PIDNet algorithm for semantic segmentation to provide more accurate global localization, and applies key points cloud for pose estimation,, which enables more efficient and accurate pose estimation. The revised section can be found in the last paragraph of Section 3.4 on page 11.
|
Comments 2: briefly acknowledge related keypoint detection work outside the 6D pose domain for context. |
Response 2: We have restructured Section 2.2 into 2.2.1 and 2.2.2, supplementing related keypoint detection work outside the 6D pose domain to offer better context. The updated section is on page 3, Section 2.2.
Comments 3: The curvature extraction section (3.3.3) is detailed but could be streamlined, including a brief discussion on potential failure modes (e.g., segmentation errors, sensor noise). Suggestions: reduce redundancy in the methodological sections (especially 3.3),
Response 3: We have streamlined Section 3.3.3. The method of Curvature Calculation Using 2D Neighborhood-Based PCA is briefly introduced in the first paragraph of Section 3.3.3 on page 7. Additionally, potential failure modes, such as segmentation errors, sensor noise, and missing depth values, are discussed in the last paragraph of Section 3.3.3 on page 8.
Comments 4: standardise mathematical notation formatting (e.g., equation spacing, LaTeX-like presentation),
Response 4: We've standardized mathematical notation formatting across the manuscript, particularly in Section 3.3.4 on pages 9 - 10, ensuring consistent equation spacing and LaTeX - like presentation.
Comments 5: fix minor grammatical issues and formatting inconsistencies in tables and figures,
Response 5: We have standardized the formatting of all tables and figures throughout the manuscript. For instance, Figure 1 on page 3 and Table 2 on page 15 have been formatted uniformly to address minor grammatical issues and formatting inconsistencies.
Comments 6: include evaluations in multiobject or dynamic scenes for broader applicability,
Response 6: In the field of 6D pose estimation, there's a significant challenge: the lack of high-quality datasets for multi-object or dynamic scenes. This restricts algorithm evaluation and application in complex real scenarios. We call for community efforts to develop more 6D pose estimation datasets that include multi-object and dynamic interactions. This would provide researchers with richer resources to test and enhance algorithm performance, better adapting them to real-world settings.
Comments 7: add error heatmaps or qualitative failure cases to enrich the evaluation section.
Response 7: We enriched the evaluation section by examining the limitations of our algorithm, which can be found in Section 4.7 on page 20.
|
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a 6D pose estimation framework that takes RGB-D images as input, and output 6D cloud points. The goal is to reduce data redundancy and accelerate the convergence of the neural network.
In introduction, the Sobel algorithm is used for edge detection. However, it would be beneficial to consider alternative methods such as Prewitt or Canny algorithms. Additionally, Did you experiment with different kernel size with Sobel?
Chapter 3 define and explains the image processing steps, in Section 3.4 it would be nice to assign better naming to the component the you simply call "Module".
The experiment section is well structured, did you perform any hyper-parameters tuning in order to optimize the network?
Lastly the conclusion should be better expanded to summarize your findings.
Author Response
Comments 1: In introduction, the Sobel algorithm is used for edge detection. However, it would be beneficial to consider alternative methods such as Prewitt or Canny algorithms. Additionally, Did you experiment with different kernel size with Sobel?
|
Response 1: In this paper, the Sobel operator is selected for edge detection due to its high computational efficiency and easy implementation. It delivers satisfactory results when dealing with images that have distinct edge features and clear textures. However, we also acknowledge the advantages of other edge detection methods, such as the Prewitt and Canny operators. During the experimentation, we explored the impact of different Sobel kernel sizes, including 3×3, 5×5, and 7×7, on keypoint extraction.It is important to note that the pose measurement algorithm proposed in this paper relies on the number of keypoints. Thus, the chosen keypoint extraction algorithm must ensure a sufficient quantity of feature points to meet the requirements of subsequent pose measurement. For an in-depth discussion on the keypoint extraction algorithm and the experimental results, please refer to the last two paragraphs of Section 3.3.1 on page 7.
|
Comments 2: Chapter 3 define and explains the image processing steps, in Section 3.4 it would be nice to assign better naming to the component the you simply call "Module". |
Response 2: In the algorithm proposed in this paper, we have assigned more appropriate names to the components in the pose estimation phase.Specifically, in Section 3.4 on page 11, we divided the RSCS 6D pose estimation part into the Key Vector Cluster Module, Rotation Prediction Module, and Rotation Residual Prediction Module. Such naming more clearly reflects the function and role of each module, facilitating a better understanding of the entire pose estimation process.
Comments 3: The experiment section is well structured, did you perform any hyper-parameters tuning in order to optimize the network?
Response 3: In the experiment, we did adjust the network's hyperparameters to boost its performance, enhancing the accuracy and efficiency of pose measurement. The hyperparameter selection method is detailed in the third paragraph of Section 4.1 on page 14. We drew on hyperparameter settings from other established methods in the field and made targeted adjustments based on our network architecture and task objectives. This ensured reasonable hyperparameter settings, enabling the network to achieve the desired pose measurement accuracy while maintaining high computational efficiency.
Comments 4: Lastly the conclusion should be better expanded to summarize your findings.
Response 4: To better expand and summarize the algorithm proposed in this paper, we modified the conclusion section, which can be found in Chapter 5 on page 20.
|
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper presents a novel method for 6D object pose estimation from RGB-D images, named RSCS6D. The approach introduces a cloud extraction algorithm (RSCS) that combines edge and curvature-based keypoints derived from RGB and depth images. This is a technically sound and well-motivated paper that contributes meaningfully to the 6D pose estimation literature. It balances accuracy, efficiency, and interpretability, addressing key challenges in both research and industrial contexts. The inclusion of a lightweight model variant makes it particularly relevant for embedded applications.
I have several questions that I would like the author to answer:
- As authors said, all experiments are conducted on the LineMOD dataset. Can the authors discuss why they chose this data set, and whether they may have taken into account other databases such as YCB-Video, T-LESS, or BOP when making the decision?
- Although the proposed method uses RGB-D input, can authors discuss how they handle noise or missing depth values, which are common in real-world applications?
- Could authors include a direct comparison in the evaluation table with Transformer-based methods like DFTr?
- The paper cites Transformer-based methods like DFTr, but does not include a direct comparison in the evaluation table.
Author Response
Comments 1: As authors said, all experiments are conducted on the LineMOD dataset. Can the authors discuss why they chose this data set, and whether they may have taken into account other databases such as YCB-Video, T-LESS, or BOP when making the decision? |
||||||||||||
Response 1: We chose the LineMOD dataset for experiments due to its wide application in 6D pose estimation, rich object textures, and diverse poses, which facilitate comparison and validation. We considered other datasets like YCB-Video, T-LESS, and BOP but found LineMOD more suitable for our study. In the future, we plan to test our algorithm on more complex datasets to verify its generalization ability. |
||||||||||||
Comments 2: Although the proposed method uses RGB-D input, can authors discuss how they handle noise or missing depth values, which are common in real-world applications? |
||||||||||||
Response 2: In the algorithm proposed in this paper, we have assigned more appropriate names to the components in the pose estimation phase.Specifically, in Section 3.4 on page 11, we divided the RSCS 6D pose estimation part into the Key Vector Cluster Module, Rotation Prediction Module, and Rotation Residual Prediction Module. Such naming more clearly reflects the function and role of each module, facilitating a better understanding of the entire pose estimation process. Comments 3: Could authors include a direct comparison in the evaluation table with Transformer-based methods like DFTr? The paper cites Transformer-based methods like DFTr, but does not include a direct comparison in the evaluation table. Response 3: In the experiment, we did adjust the network's hyperparameters to boost its performance, enhancing the accuracy and efficiency of pose measurement. The hyperparameter selection method is detailed in the third paragraph of Section 4.1 on page 14. We drew on hyperparameter settings from other established methods in the field and made targeted adjustments based on our network architecture and task objectives. This ensured reasonable hyperparameter settings, enabling the network to achieve the desired pose measurement accuracy while maintaining high computational efficiency. Inclusion of Transformer - based methods like DFTr in the evaluation table would indeed add value by offering readers a clearer comparative view. Here's an updated table with DFTr included:
On the LineMOD dataset, DFTr achieves remarkable accuracy with a mean ADD of 99.8%, hitting 100% for nine objects. This highlights its superior pose measurement precision. Nevertheless, DFTr's speed (21 FPS ) pales in contrast to RSCS6D's 40 FPS. Moreover, DFTr demands high - end hardware, limiting its use in real - time industrial applications. By comparison, RSCS6D, a lightweight industrial - oriented method, runs at 40 FPS. This makes it more suitable for real - time industrial scenarios. Given the high - hardware - requirement nature of Transformer - based methods, RSCS6D shows clear practical advantages. This direct comparison enables readers to grasp RSCS6D's edge in speed and applicability, as well as DFTr's accuracy superiority, thus aiding method - selection decisions. |
Author Response File: Author Response.pdf