Teleoperation System for Service Robots Using a Virtual Reality Headset and 3D Pose Estimation
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
This work presents a well-structured and timely contribution to immersive teleoperation for service robots, integrating consumer-grade VR with vision-based 3D pose estimation in a practical pipeline. The approach is conceptually clear and methodically described, offering a reproducible low-cost framework that bridges human motion capture and robotic actuation. While the core idea of coupling MediaPipe with an RGB-D sensor for real-time joint mapping is not entirely novel, the integration within a complete VR telepresence system—and its validation on a physical service robot—adds tangible value to the field. The writing is generally coherent and follows academic conventions, though certain technical sections could benefit from tighter articulation and a more critical discussion of limitations.
Major
1-Axial rotation estimation remains a well-identified weakness in vision-only pipelines, yet the analysis here stops short of quantifying error margins or proposing immediate mitigations beyond future IMU integration.
2-The workaround for head pose estimation—using a colored patch on the VR headset—is clever but inherently limited in robustness. The dependence on consistent lighting and color segmentation is not thoroughly evaluated; a short discussion on how this method performs under different environmental conditions would help readers assess its practical applicability.
3-Experimental validation, while demonstrative, lacks systematic metrics such as task completion time, accuracy rates, or user feedback. Including a simple quantitative evaluation (e.g., success rate in grasping trials, latency measurements) would substantiate the qualitative findings and provide a clearer benchmark for future improvements.
Minor
1-Formula formatting is inconsistent in places—for instance, Equation (2)
2-Several acronyms are used without full form at first mention (e.g., HSV on page 10 appears directly as “HSV mask”; MCP on page 8 is introduced only within parentheses). Expanding these upon first use would improve readability.
Author Response
Major
Comments 1: Axial rotation estimation remains a well-identified weakness in vision-only pipelines, yet the analysis here stops short of quantifying error margins or proposing immediate mitigations beyond future IMU integration.
Response 1: We agree that axial rotation estimation is a critical limitation of vision-based pipelines. To address this, a dedicated subsection entitled “Limitations of Vision-Based Axial Rotation Estimation” was added to the Materials and Methods section. This subsection provides a systematic analysis of elbow and wrist yaw estimation, identifies failure modes and operating conditions, and clarifies why task-level evaluation was adopted instead of per-joint angular error metrics. The discussion also motivates IMU integration as a concrete extension.
Comments 2: The workaround for head pose estimation—using a colored patch on the VR headset—is clever but inherently limited in robustness. The dependence on consistent lighting and color segmentation is not thoroughly evaluated; a short discussion on how this method performs under different environmental conditions would help readers assess its practical applicability.
Response 2: We acknowledge the reviewer’s concern regarding the robustness of the color-based head pose estimation method. A new paragraph was added where it explicitly discusses the operating conditions, environmental dependencies, and limitations of the proposed workaround, clarifying its intended use as a low-cost and practical solution rather than a general-purpose head tracking method.
Comments 3: Experimental validation, while demonstrative, lacks systematic metrics such as task completion time, accuracy rates, or user feedback. Including a simple quantitative evaluation (e.g., success rate in grasping trials, latency measurements) would substantiate the qualitative findings and provide a clearer benchmark for future improvements.
Response 3: To address the lack of explicit experimental metrics, a new subsection entitled “Experimental Evaluation Criteria” was added to the Experiments and Results section. Evaluation is now framed around task completion success, execution consistency, and operational stability, which are appropriate for system-level teleoperation assessment. The text was updated to explicitly reflect repeatability and success of grasping tasks, while clarifying why joint-level angular error metrics were not adopted.
Minor
Comments 1: Formula formatting is inconsistent in places—for instance, Equation (2)
Response 1: Equation (2) was reformulated to ensure consistent mathematical notation and to avoid ambiguity in the use of multiplication operators. Variable definitions were also clarified to improve readability and consistency with the remaining equations.
Comments 2: Several acronyms are used without full form at first mention (e.g., HSV on page 10 appears directly as “HSV mask”; MCP on page 8 is introduced only within parentheses). Expanding these upon first use would improve readability.
Response 2 : All acronyms were revised and expanded at their first occurrence in the manuscript (e.g., HSV, MCP, IMU, RGB-D, VR) to improve clarity and comply with journal style guidelines.
Reviewer 2 Report
Comments and Suggestions for Authors
- In the abstract part, the novelty and contribution has not been clarified. The conclusion and improvements have to be added.
- The authors have mentioned that "To improve robustness and fine-grained motion replication…." in the abstract part. This is not instructive in the current study. The future perspective has to be stated later at the end of manuscript.
- In this study, the introduction requires more related works have to be added. I suggest (Anti-disturbance control design of Exoskeleton Knee robotic system for rehabilitative care).
- In this study, I suggest the proposed methodology to be supported by a general block diagram to visualize the general concept of this article.
- The weakness of previous works has to be presented. The motivation due to this weakness has to be presented.
- The authors mentioned that "devices, the Intel® RealSense™ D455 offers a built-in method, get_distance(), which provides direct access to depth measurements in meters". The authors have not mentioned the specifications of this device.
- The resolution of Figure 2 is low. The authors have to improve the quality of this figure. Deep discussion has to be presented.
- The authors have not presented numerical results. This study is enriched by explanation photos, but how can one evaluate the precisions and other matrices.
- There are no signals which gives evident evaluation.
- It is interesting to conduct a comparison study between other schemes (in the literature) and the proposed method.
- 11. The authors have not presented and dynamic description or specification of robot tele-operation.
- The communication techniques (telecommunications) have not been mentioned. The authors have not addressed this topics.
- The virtual reality has to be supported by force control. The force exerted by operator has to be reflected on the actuated robot.
- The contributions have to be listed in details and inn points (bullets) at the end of introduction part.
- The conclusion is descriptive. It is void of quantitative and numerical improvement and comparison.
- The manuscript includes high redundancy and the general information have to be reduced.
- Some references are online resources. Please, rely on reputed journals and text books (if necessary). Some of these references are just links of videos!!!
Author Response
Comments 1: In the abstract part, the novelty and contribution has not been clarified. The conclusion and improvements have to be added.
Response 1: The Introduction was revised to explicitly discuss limitations of existing VR-based teleoperation and vision-based pose estimation systems, clarifying the motivation for the proposed approach. In addition, a dedicated paragraph listing the main contributions of this work was added in bullet-point form at the end of the Introduction to clearly highlight the novelty and scope of the paper.
Comments 2: The authors have mentioned that "To improve robustness and fine-grained motion replication…." in the abstract part. This is not instructive in the current study. The future perspective has to be stated later at the end of manuscript.
Response 2: We agree that future perspectives should not appear in the Abstract. The sentence describing future integration of IMUs and adaptive calibration/filtering was removed from the Abstract and retained only in the “Conclusions and Future Work” section, where it is discussed as a clear extension of the present vision-based system.
Comments 3: In this study, the introduction requires more related works have to be added. I suggest (Anti-disturbance control design of Exoskeleton Knee robotic system for rehabilitative care).
Response 3: The Related Work section was expanded with additional state-of-the-art studies on humanoid robot teleoperation, motion imitation, and immersive human–robot interaction. In particular, we added representative works addressing RGB-D-based teleoperation with legged locomotion, full-body motion capture combined with VR-based telepresence, and optimization-based motion retargeting for expressive humanoid control. These additions strengthen the contextualization of the proposed approach within closely related teleoperation frameworks while maintaining alignment with the scope and objectives of the present study.
Comments 4: In this study, I suggest the proposed methodology to be supported by a general block diagram to visualize the general concept of this article.
Response 4: A new system overview subsection was added to the Materials and Methods section, together with a high-level block diagram illustrating the overall teleoperation architecture and data flow between sensing, processing, communication, and actuation modules.
Comments 5: The weakness of previous works has to be presented. The motivation due to this weakness has to be presented.
Response 5: The Introduction was revised to explicitly discuss limitations of existing VR-based teleoperation and vision-based pose estimation systems, clarifying the motivation for the proposed approach. In addition, a dedicated paragraph listing the main contributions of this work was added in bullet-point form at the end of the Introduction to clearly highlight the novelty and scope of the paper.
Comments 6: The authors mentioned that "devices, the Intel® RealSense™ D455 offers a built-in method, get_distance(), which provides direct access to depth measurements in meters". The authors have not mentioned the specifications of this device.
Response 6: Technical specifications of the Intel® RealSense™ D455 and Microsoft Kinect for Xbox 360 cameras were added to the Materials and Methods section, to better contextualize the sensing setup used in the proposed system.
Comments 7: The resolution of Figure 2 is low. The authors have to improve the quality of this figure. Deep discussion has to be presented.
Response 7: The readability of the calibration plot was improved by increasing the figure size and font resolution. In addition, the discussion associated with this figure (now Figure 3) was expanded to provide a deeper analysis of the depth calibration process, its sensitivity to noise and non-linearities, and its impact on joint reconstruction accuracy. This revision clarifies how depth estimation limitations propagate through the teleoperation pipeline and how they are mitigated at the task level through continuous visual feedback.
Comments 8: The authors have not presented numerical results. This study is enriched by explanation photos, but how can one evaluate the precisions and other matrices.
Response 8: To strengthen experimental evaluation, the manuscript was revised to explicitly define task-level evaluation criteria and to clarify performance outcomes in the experimental sections. Rather than focusing on joint-level angular error, evaluation is framed around task completion, execution consistency, and operational stability, which are appropriate for system-level teleoperation assessment. The Conclusions section was also revised to reflect these evaluation outcomes more explicitly.
Comments 9: There are no signals which gives evident evaluation.
Response 9: To strengthen experimental evaluation, the manuscript was revised to explicitly define task-level evaluation criteria and to clarify performance outcomes in the experimental sections. Rather than focusing on joint-level angular error, evaluation is framed around task completion, execution consistency, and operational stability, which are appropriate for system-level teleoperation assessment. The Conclusions section was also revised to reflect these evaluation outcomes more explicitly.
Comments 10: It is interesting to conduct a comparison study between other schemes (in the literature) and the proposed method.
Response 10: A qualitative comparison with representative VR-based teleoperation approaches discussed in the manuscript was added in a dedicated discussion subsection preceding the conclusions. This comparison positions the proposed system with respect to existing schemes, highlighting differences in sensing strategy, validation context, and practical deployment considerations.
Comments 11: The authors have not presented and dynamic description or specification of robot tele-operation.
Response 11: A dynamic description of the teleoperation process was added to the experimental section, clarifying how human motion, robotic actuation, and visual feedback interact continuously in a closed-loop manner during VR-based teleoperation.
Comments 12: The communication techniques (telecommunications) have not been mentioned. The authors have not addressed this topics.
Response 12: A description of the communication architecture was added to the Materials and Methods section, specifying how joint commands and sensor data are exchanged between processing units and the robotic platform using a message-based communication layer.
Comments 13: The virtual reality has to be supported by force control. The force exerted by operator has to be reflected on the actuated robot.
Response 13: We agree that force or haptic feedback can further enhance immersion and precision in VR-based teleoperation. The manuscript was revised to explicitly acknowledge the absence of force feedback as a limitation of the current system and to clarify that this choice was motivated by the goal of maintaining a low-cost and lightweight vision-based framework. Force feedback is now clearly identified as a promising direction for future work.
Comments 14: The contributions have to be listed in details and inn points (bullets) at the end of introduction part.
Response 14: The Introduction was revised to explicitly discuss limitations of existing VR-based teleoperation and vision-based pose estimation systems, clarifying the motivation for the proposed approach. In addition, a dedicated paragraph listing the main contributions of this work was added in bullet-point form at the end of the Introduction to clearly highlight the novelty and scope of the paper.
Comments 15: The conclusion is descriptive. It is void of quantitative and numerical improvement and comparison.
Response 15: To strengthen experimental evaluation, the manuscript was revised to explicitly define task-level evaluation criteria and to clarify performance outcomes in the experimental sections. Rather than focusing on joint-level angular error, evaluation is framed around task completion, execution consistency, and operational stability, which are appropriate for system-level teleoperation assessment. The Conclusions section was also revised to reflect these evaluation outcomes more explicitly.
Comments 16: The manuscript includes high redundancy and the general information have to be reduced.
Response 16: The manuscript was carefully revised to reduce redundancy and streamline the presentation. Repeated explanations of system operation were removed, particularly in the results and discussion sections, and generic introductory statements were eliminated to improve clarity and focus on the core scientific contributions.
Comments 17: Some references are online resources. Please, rely on reputed journals and text books (if necessary). Some of these references are just links of videos!!!
Response 17: The reference list was revised to improve academic quality and consistency. Non-scholarly and incomplete references were removed, and online resources were reformulated where appropriate. The remaining references now primarily rely on peer-reviewed journals, conferences, thesis, and established institutional sources.
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors
All comments have been answered well.
Reviewer 2 Report
Comments and Suggestions for Authors
Dear Editor,
I have read the revised version of the article and I have found that the authors have addressed all my concerns and the article is considerably enhanced. I recommend the acceptance of paper in your reputed Journal.
Thank you for your kind trust…with best regards
Prof. Dr. Amjad J. Humaidi
The university of Technology-Iraq

