Intelligent Human–Robot Interaction Assistant for Collaborative Robots
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe research combines and presents existing technologies (YOLOv11, Leap Motion, RANSAC, ICP).
The authors in this paper do not present significant contributions. in the field there are alternatives such as, ABB Wizard Easy Programming or Dex-Net 4.0 which are already more advanced and flexible.
On the experimental side there are certain Technical limitations and reduced practicality
- There is over-reliance on specific hardware (Leap Motion, projector, Intel RealSense). With the exposed configurations the applicability in industry is limited.
- Performance shown below current standards: the YOLOv11 achieves only 96% accuracy, this is below the level of models like Faster R-CNN. The trapping algorithm (PointNetGPD) has only 80% success, this is not accepted in the industry.
- Also there are problems with object position detection: RANSAC and ICP can give errors for symmetric objects. Here an adaptive learning mechanism is needed to bring flexibility to the system.
The methodology presented in the paper is weak and the validation part is insufficient
- The dataset is somewhat insufficient (only 261 images are analyzed), without a real-world testing (with variable illumination or background noise).
- Comparison with other advanced systems is clearly missing in the paper.
So the paper does not demonstrate a clear innovation. The research has major technical limitations and the experimental validation is clearly weak.
Without significant improvements, it cannot be accepted for publication.
The research combines and presents existing technologies (YOLOv11, Leap Motion, RANSAC, ICP).
The authors in this paper do not present significant contributions. in the field there are alternatives such as, ABB Wizard Easy Programming or Dex-Net 4.0 which are already more advanced and flexible.
On the experimental side there are certain Technical limitations and reduced practicality
- There is over-reliance on specific hardware (Leap Motion, projector, Intel RealSense). With the exposed configurations the applicability in industry is limited.
- Performance shown below current standards: the YOLOv11 achieves only 96% accuracy, this is below the level of models like Faster R-CNN. The trapping algorithm (PointNetGPD) has only 80% success, this is not accepted in the industry.
- Also there are problems with object position detection: RANSAC and ICP can give errors for symmetric objects. Here an adaptive learning mechanism is needed to bring flexibility to the system.
The methodology presented in the paper is weak and the validation part is insufficient
- The dataset is somewhat insufficient (only 261 images are analyzed), without a real-world testing (with variable illumination or background noise).
- Comparison with other advanced systems is clearly missing in the paper.
So the paper does not demonstrate a clear innovation. The research has major technical limitations and the experimental validation is clearly weak.
Without significant improvements, it cannot be accepted for publication.
Author Response
Comment 1:The research combines and presents existing technologies (YOLOv11, Leap Motion, RANSAC, ICP).
The authors in this paper do not present significant contributions. in the field there are alternatives such as, ABB Wizard Easy Programming or Dex-Net 4.0 which are already more advanced and flexible.On the experimental side there are certain Technical limitations and reduced practicality
There is over-reliance on specific hardware (Leap Motion, projector, Intel RealSense). With the exposed configurations the applicability in industry is limited.
Response 1: Thank you for your detailed and constructive feedback. The system does not rely exclusively on specific hardware components. The required functionalities—application display, motion control, and environmental scanning—can be implemented using alternative equipment with similar characteristics. For example, motion control can be achieved using a camera or another sensor, provided it includes an appropriate motion recognition algorithm, which the manufacturer. The main components can be replaced by equivalent devices with comparable specifications.
Comment 2: Performance shown below current standards: the YOLOv11 achieves only 96% accuracy, this is below the level of models like Faster R-CNN. The trapping algorithm (PointNetGPD) has only 80% success, this is not accepted in the industry.
Response 2: The accuracy achieved by the proposed system meets the primary requirements defined for this research. Industrial requirements vary depending on the specific application and type of parts being handled by the robot. The proposed system was developed and tested in SmartTechLab, an environment designed for educational and experimental research, aimed at training students in smart enterprise solutions. Future improvements, including enhanced accuracy and adaptability, are outlined in the Discussion section as part of our research roadmap.
Comment 3: Also there are problems with object position detection: RANSAC and ICP can give errors for symmetric objects. Here an adaptive learning mechanism is needed to bring flexibility to the system.
Response 3: The limitations of RANSAC and ICP in handling symmetric objects have been acknowledged and discussed in the Discussion section (lines 572-581).
Comment 4: The dataset is somewhat insufficient (only 261 images are analyzed), without a real-world testing (with variable illumination or background noise).
Response 4: The testing under another scenario will create more materials covering another part of the research. Your recommendation will be considered under the development process of the system.
Comment 5: Comparison with other advanced systems is clearly missing in the paper.
Response 5: Thank you for your recommendation. The main comparisons are introduced in the Introduction section, and some facts are mentioned in the Discussion section.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe study presents the Collaborative Robotics Assistant (CobRA) system, an intelligent HRI interface that utilizes machine vision and convolutional neural networks for real-time object detection and interactively visualizes the control interface using a projector, simplifying the programming of collaborative robots in dynamic industrial environments.
I have the following comments and suggestions for improving the article:
Keywords
Please revise the keywords to include more relevant terms that will facilitate easier retrieval of the article in databases. For example, "industrial growth" and "process innovation" are too general; they should be more closely aligned with the content of the paper.
Introduction
- Line 58: Sources are missing. Is this just the authors' opinion, or is it a generally accepted fact? If the latter, please provide a citation.
- There is inconsistency in terminology: sometimes "collaborative robot" is used, while in other places "cobot" appears. Please unify the terminology or provide a definition explaining why the abbreviation "cobot" is used and its relevance.
- A clear definition of safety aspects according to ISO 15066 is missing. Since this standard is crucial for collaborative robots, I recommend including it and briefly outlining its key requirements.
Research Objectives and Contributions
In the section:
The approach proposed in this paper aims to overcome the current limitations by integrating neural network technologies and machine vision systems into collaborative robot control systems. The development of such solutions will increase the level of automation and simplify the introduction of cobots into enterprises.
The authors describe the proposed approach but do not specify the expected outcomes and contributions of their research. What are the concrete advantages compared to other systems?
Methodology
- There are formatting inconsistencies in spacing between sections.
- If the methodology consists of multiple steps, it would be beneficial to illustrate them graphically, such as with a flowchart.
- A detailed description of the materials and technologies used is missing. To ensure clarity regarding the system’s precision and performance, please specify the components and algorithms applied.
Formatting of Equations and Figures
- The equations are not properly arranged and need better formatting.
- Section 2.2.2 is titled "yollow"—this needs correction, and the topic of this subsection should be clearly defined.
- A system diagram is missing—how are the different system components (camera, projector, robot) connected? While the algorithm is described, it would be beneficial for the reader to see a visual representation.
Calibration and Technical Specifications
- There is no image of the calibration process—what markers are used? Is this calibration method sufficiently accurate?
- Technical specifications for the camera (RDP camera, Leap Motion) are missing—what is the resolution, accuracy, and other parameters?
Results and Graphs
- The graphs are blurry and difficult to read—they should be redesigned to be clearer and more visually informative.
- If the authors claim that the solution is "cost-effective and versatile," this needs to be supported with data or at least referenced from other research studies.
Discussion
The authors developed a system that can recognise a wide range of gestures, allowing the operator to send commands to the robot with low latency intuitively. However, the proposed gesture and voice control system is limited to the robot's movement only, and the proposed system does not consider the possibility of controlling the robot's end effector. Therefore, the proposed system can not completely replace traditional teaching pendants.
- This is an important limitation—but is this supported by research, or is it just the authors' observation?
- It would be beneficial to compare this approach with other existing systems that address similar challenges.
Conclusion
- The article requires revision in terms of both content and formatting.
- Missing elements: a better graphical representation of the methodology, a structured explanation of the motivation, a clear definition of research objectives, and expected outcomes.
- The literature review is too limited (only 17 sources)—I recommend expanding the section on the analysis of the current state of research and existing literature.
Suggested References for Expansion
- Camera-Based Method for Identification of the Layout of a Robotic Workcell
- Vision Systems for a UR5 Cobot on a Quality Control Robotic Station
Final Recommendation
Overall, I recommend revising the article to improve its clarity, technical accuracy, and readability.
Author Response
Comment 1: The study presents the Collaborative Robotics Assistant (CobRA) system, an intelligent HRI interface that utilizes machine vision and convolutional neural networks for real-time object detection and interactively visualizes the control interface using a projector, simplifying the programming of collaborative robots in dynamic industrial environments.
I have the following comments and suggestions for improving the article:
Please revise the keywords to include more relevant terms that will facilitate easier retrieval of the article in databases. For example, "industrial growth" and "process innovation" are too general; they should be more closely aligned with the content of the paper.
Response 1: Thank you for your valuable comment. The selected keywords classify the research according to the Sustainable Development Goals (SDGs).
Comment 2: Line 58: Sources are missing. Is this just the authors' opinion, or is it a generally accepted fact? If the latter, please provide a citation.
Response 2: The sentence on lines 58-60 summarizes the information from the reviewed articles. However, the original citations were inadvertently removed during the revision process. We have now restored the references at the end of the sentence.
Comment 3: There is inconsistency in terminology: sometimes "collaborative robot" is used, while in other places "cobot" appears. Please unify the terminology or provide a definition explaining why the abbreviation "cobot" is used and its relevance.
Response 3: The definition of "cobots" is provided in the first sentence of the first paragraph: “In recent years, there has been a rapid growth in the use of collaborative robots (cobots) in industry.” The use of terms was unified and changed throughout the text.
Comment 4: A clear definition of safety aspects according to ISO 15066 is missing. Since this standard is crucial for collaborative robots, I recommend including it and briefly outlining its key requirements
Response 4: Thank you for your suggestion. We have incorporated a definition of safety aspects according to ISO/TS 15066 in lines 133-138. Specifically, we have outlined how the ABB YuMi collaborative robot complies with this standard by implementing passive risk reduction methods, such as lightweight links, limited motor power, and using soft materials to minimize the risk of injury in case of a collision.
Comment 5: In the section: The approach proposed in this paper aims to overcome the current limitations by integrating neural network technologies and machine vision systems into collaborative robot control systems. The development of such solutions will increase the level of automation and simplify the introduction of cobots into enterprises. The authors describe the proposed approach but do not specify the expected outcomes and contributions of their research. What are the concrete advantages compared to other systems?
Response 5: Thank you for your comment. The expected outcomes and contributions of our research, including the concrete advantages of the proposed system compared to existing solutions, are thoroughly discussed in Section 4: Discussion.
Comment 6: There are formatting inconsistencies in spacing between sections.
Response 6: Thank you for your observation. We have carefully reviewed the manuscript and corrected all formatting inconsistencies in the spacing between sections.
Comment 7: If the methodology consists of multiple steps, it would be beneficial to illustrate them graphically, such as with a flowchart.
Response 7: Thank you for your suggestion. All methodology steps are already illustrated in the manuscript. Adding an additional block diagram would not provide further value and would only clutter the text unnecessarily.
Comment 8: A detailed description of the materials and technologies used is missing. To ensure clarity regarding the system’s precision and performance, please specify the components and algorithms applied.
Response 8: Thank you for your comment. A description of the materials used is provided in Figure 1 and the accompanying text. The software components are detailed in Section 2, and the technical parameters of the components have been added in Section 2.
Comment 9: The equations are not properly arranged and need better formatting.
Response 9: Thank you for your observation. All equations have been reviewed and reformatted according to the template guidelines.
Comment 10: Section 2.2.2 is titled "yollow"—this needs correction, and the topic of this subsection should be clearly defined.
Response 10: Thank you for your observation. The title of Section 2.2.2 has been corrected, and the name of the neural network has been updated to YOLOv11.
Comment 11: A system diagram is missing—how are the different system components (camera, projector, robot) connected? While the algorithm is described, it would be beneficial for the reader to see a visual representation.
Response 11: Thank you for your suggestion. However, in this case, a system diagram would not provide additional value, as all components (camera, projector, robot) are directly connected to the computer via cables. The system's operational algorithm is illustrated in Figure 2, and the functioning algorithms of each module are provided in subsequent figures throughout the text. Adding more diagrams would unnecessarily overload the manuscript with redundant visuals.
Comment 12: There is no image of the calibration process—what markers are used? Is this calibration method sufficiently accurate?
Response 12: Thank you for your insightful comment. ArUco markers were chosen as a reliable and accurate method for determining the position of objects relative to the camera (robot and projector), widely used in scientific and industrial applications, as well as in virtual and augmented reality applications. ArUco markers are quite popular as their use does not require any prior preparation of datasets and is not as sensitive to the quality of illumination. The study [1] aimed to evaluate the recognition accuracy of ArUco markers. The authors developed a measurement bench that used an industrial camera to detect marker position and Laser Doppler Vibrometry (Laser Doppler Vibrometry) was used for verification. The authors determined that recognition could achieve high accuracies (up to 0.009 mm) when high dilation and high fill percentages were used.
The second study [2] compared marker-based methods and keypoint methods (via neural networks) for object pose recognition. Among the key advantages of using ArUco markers was that there was no need to create a dataset, which for high accuracy should not only be synthetic but also based on real photos, and also powerful hardware is needed to run a neural network, while ArUco marker recognition is not as resource intensive. In addition, ArUco method showed high accuracy in z-axis for displacement and displacement (significant gap), but slightly inferior to YOLOv8 in X and Y axes (Figure 1)[2].
After finding the marker, the position of the objects is found using the solvePnP function available in the OpenCV library. The internal parameters of the camera such as focal length and offset from centre in case of Intel RealSense camera can be found out through Intel RealSense SDK.
The projector calibration process is shown in Figure 10 [3]. The calibration of the robot's coordinate systems relative to the camera was performed using the Aruco Grid Board using the methodology presented in [3].
[1] https://iopscience.iop.org/article/10.1088/2631-8695/ac1fc7/meta
[2] https://ieeexplore.ieee.org/abstract/document/10711641
[3] https://www.mdpi.com/1424-8220/20/17/4825
Comment 13: Technical specifications for the camera (RDP camera, Leap Motion) are missing—what is the resolution, accuracy, and other parameters?
Response 13: Thank you for your comment. The technical specifications for the camera are provided in lines 145-151, and the Leap Motion parameters have been added in lines 155-156.
Comment 14: The graphs are blurry and difficult to read—they should be redesigned to be clearer and more visually informative.
Response 14: Thank you for your comment. Due to the formatting constraints of the template, the size of the graphs is limited, which may affect their clarity. However, high-resolution versions of all figures are provided as supplementary files, ensuring that all details remain fully readable.
Comment 15: If the authors claim that the solution is "cost-effective and versatile," this needs to be supported with data or at least referenced from other research studies.
Response 15: Thank you for your comment. The references supporting this claim have been added to the text in line 544.
Comment 16: The authors developed a system that can recognise a wide range of gestures, allowing the operator to send commands to the robot with low latency intuitively. However, the proposed gesture and voice control system is limited to the robot's movement only, and the proposed system does not consider the possibility of controlling the robot's end effector. Therefore, the proposed system can not completely replace traditional teaching pendants. This is an important limitation—but is this supported by research, or is it just the authors' observation? It would be beneficial to compare this approach with other existing systems that address similar challenges.
Response 16: The authors appreciate your comment. Comparing our approach with existing systems is challenging because our solution introduces a novel method that differs from conventional gesture and voice control approaches. While traditional systems primarily focus on robot movement, our method integrates a novel interaction model that enhances both usability and adaptability.
Comment 17: The article requires revision in terms of both content and formatting.
We have reviewed and revised the article to improve both content and formatting.
Missing elements: a better graphical representation of the methodology, a structured explanation of the motivation, a clear definition of research objectives, and expected outcomes.
Response 17: The methodology is already illustrated in multiple figures 2-4 throughout the text, and an additional graphical representation would not provide further value without unnecessary redundancy. The motivation, research objectives, and expected outcomes have been clarified and structured more explicitly in the revised version. All changes in the revised manuscript were marked with a red color.
Comment 18: The literature review is too limited (only 17 sources)—I recommend expanding the section on the analysis of the current state of research and existing literature.
Response 18: The authors appreciate your recommendation. The literature review has been expanded, now including 27 sources, and providing a more comprehensive analysis of the current state of research.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear authors
You can find my report to the PDF file
Comments for author File: Comments.pdf
Author Response
Comment 1: I suggest using different symbols for radial and tangential distortions in Equation (3-4) and (5-6). For example xr,distorted and xt,distorted.
Response 1: Thank you for your insightful suggestion. We have revised the notation for radial and tangential distortions. These modifications have been implemented in Equations (2), (3), (5), and (6).
Comment 2: Equation (7). Could you describe the meanings of Rrobot and Trobot in the text? Are Rrobot and Trobot scalars, vectors, or matrices.
Response 2: Thank you for your comment. The meanings of R_robot and T_robot have been explicitly described in lines 244-247.
Comment 3: Equation 8. Are the variables T in the second part of the equation matrices or vectors? Additionally, could you explain the meaning of the multiplier “·”? Does it represent the inner product of matrices?
Response 3: Thank you for your comment. The transformation parameters are calculated and combined into a single homogeneous transformation matrix based on all collected data. The "·" symbol denotes matrix multiplication, not an inner product. This clarification has been added in lines 255-258.
Comment 4: Figure 4. At the end of the workflow, there is a 'Start.' You can use 'End,' or after the 'output final transformation,' you can loop back to 'Start.'
Response 4: Thank you for your suggestion. The label "Start" at the end of the workflow in Figure 4 has been changed to "Stop". The updated figure can be found in line 339.
Comment 5: Equation 12. Please explain the term Wconv.
Response 5: Thank you for your comment. The explanation of W_conv has been provided in lines 355-357, where it is defined as a convolutional filter in a neural network.
Comment 6: Line 371: Please correct ‘iou’ with ‘IoU’ (Intersection over Union)
Response 6: Thank you for your observation. The term "iou" has been corrected to "IoU" (Intersection over Union) in lines 382-383.
Comment 7: "The loss function measures the difference between predicted and actual grasp classifications" Could you please include an equation that describes the relationship mentioned above? The values along the Y-axis seem somewhat unclear.
Response 7: Thank you for your suggestion. A new loss curve has been provided under Figure 15b. The detailed mathematical description of the relationship would require additional pages and significantly shift the focus of the article. However, in brief, we applied the Binary Cross-Entropy (BCE) Loss function in our implementation, ensuring accurate measurement representation in the graph. For further details on BCE Loss, please refer to the official PyTorch documentation: https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI have to thank you for the exemplary answers and recommendations, what I wrote you have incorporated and I have no further comments on the publication and therefore I accept this contribution as a reviewer.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear authors
The manuscript has been improved I do not have further comments