Enhancing Intelligent Robot Perception with a Zero-Shot Detection Framework for Corner Casting
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis study proposes a zero-shot detection framework based on GroundingDINO to enhance intelligent robot perception in shipping container corner casting detection, validated through comparative experiments with SSD models. The innovative integration of Referring Expression Comprehension (REC) and Additional Feature Keywords (AFK) reduces computational overhead and enables real-time deployment, demonstrating practical value. However, further elaboration is needed on experimental details (e.g., hardware performance metrics), depth of statistical analysis, and comprehensiveness of literature coverage.
1.How robust is GroundingDINO in complex environments (e.g., varying lighting, occlusion)? Were extreme scenarios tested to evaluate detection stability?
2. Critical metrics like real-time processing speed and memory consumption on Raspberry Pi are not provided. How does this validate its suitability for resource-constrained edge computing?
3.How do AFK keywords (e.g., "extreme," "only") specifically enhance detection accuracy? Is there quantitative analysis to support their effectiveness?
4.Why were only SSD models selected for comparison? Were other zero-shot models (e.g., CLIP) or lightweight architectures (e.g., YOLO) considered to strengthen conclusions?
5.While Cohen’s d = 2.2 is significant, how does this translate to practical improvements (e.g., detection success rate)? Is there alignment with industry standards or operational requirements?
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript describes a zero-shot object detection framework, for corner casting detection for shipping container operations. In particular, GroundingDINO with Referring Expression Comprehension (REC) and Additional Feature Keywords (AFK) is used, and is compared against three Supervised Object Detection (SSD) methods, outperforming them.
Some issues might be considered:
1. In Section 2, it is stated that GroundingDINO is used as a zero-shot detection (ZSD) method. Figure 1 for example shows the input text, and accompanying input text "Corner hole Container". The corresponding output in Figure 3 however has two labels, "Shipping container" and "Corner casting hole".
Given this, it might be clarified as to how GroundingDINO was (pre)trained, if any such (pre)training was done. In particular, is GroundingDINO able to understand "Corner casting hole" out of the box?
2. In Figure 4, it is stated that the addition of AFK can turn unsuccessful decisions into successful ones. Larger versions of illustrating figures (Figures 7 to 14) might be presented as supplementary material.
3. In Section 4.1, the container photograph dataset is introduced as containing 482 images. It might be clarified as to how the photographs were annotated. Example images with ground truth annotations should also be presented.
4. Section 3.2 for mAP might be Section 4.4.
5. For the SSD methods compared against in Tables 7 to 12, it is assumed that training of the SSD models had to be performed. The training dataset and methodology used, as well as the validation set used for all methods (GroundingDino and SSD), should be clarified.
6. In Section 6, there is another validation on COCO2017, which does not seem specific to shipping containers. The relevance of this section should be clarified further.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsWe thank the authors for largely addressing our previous comments.
However, it is not clear what is meant by "We have carefully considered the suggestion to move the mAP evaluation to Section 4.4, but we believe it is more appropriate to retain it in Section 3.2". In the manuscript, Section 4.2 in Page 13 is "Hyperparameter", then there is Section 3.2 mean Average Precision (mAP), then 4.5 Detection scores in Page 14. Should Sections 4.3 and 4.4 not follow Section 4.2?
Author Response
We apologize for the oversight. The mAP evaluation section was incorrectly labeled as Section 3.2. It should correctly continue the numbering and be labeled as Section 4.3.
We have amended the manuscript accordingly to reflect the correct section numbering. Thank you for highlighting this important correction.