Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Interactive Teleoperation of an Articulated Robotic Arm Using Vision-Based Human Hand Tracking

Biomimetics 2026, 11(2), 151; https://doi.org/10.3390/biomimetics11020151

by Marius-Valentin Drăgoi^1,*

, Aurel-Viorel Frimu²

, Andrei Postelnicu¹

, Roxana-Adriana Puiu¹, Gabriel Petrea¹

and Alexandru Hank³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Biomimetics 2026, 11(2), 151; https://doi.org/10.3390/biomimetics11020151

Submission received: 29 January 2026 / Revised: 13 February 2026 / Accepted: 17 February 2026 / Published: 19 February 2026

(This article belongs to the Special Issue Recent Advances in Bioinspired Robot and Intelligent Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper introduces an economical vision-based teleoperation system that enables real-time control of an articulated robotic arm using human hand tracking, utilizing a standard laptop camera. Also, a real-time landmark estimation pipeline is used to get information about hand pose and gestures. Meanwhile, a calibration-based control strategy then maps a set of compact kinematic descriptors to robotic joint commands. Finally, a human-in-the-loop research with 42 people tests the system using a standardized pick-and-place task that measures task success rate, completion time, positional accuracy, and a subjective evaluation of usefulness and intuitiveness.

To further enhance the manuscript’s quality, the authors should consider the following issues:

Figure 2 appears to take up too much space. The figure is recommended to be modified.
The experiments in this paper are too simplistic to demonstrate the effectiveness of the proposed control method. It is recommended to supplement the experiments, for example, by varying the size or type of grasped objects, increasing environmental complexity, and adjusting ambient light intensity (as light intensity significantly affects camera performance). This will enhance the generalization capability of the proposed control method.
Table 1 presents data less intuitively. It is recommended to use more curves to more clearly illustrate the results of each experiment.
The mapping relationship between robotic arms and human arms is a critical aspect of teleoperation. How does this paper map human hand movements to robotic arm motions and perform pose estimation? Sections 2.4.2 and 2.4.3 provide only brief introductions without robotic arm modeling, relevant mapping algorithms, calibration-based control strategies, or equations. The authors are supposed to elaborate on these aspects in greater detail.
The experiments in this paper involve position-based servo control of a robotic arm. However, the paper lacks curves or tables related to the control algorithm data, such as servo motor angles, current curves, PWM signals, etc. It is recommended to add relevant figures.
The paper mentions in the Discussion section that “other approaches enhance robustness through wearable sensing (e.g., IMUs, gloves, or myoelectric armbands), which can improve control precision but introduce additional hardware and calibration overhead”. The author should compare and highlight the advantages of this proposed economical method over other high-precision approaches by quantifying metrics such as cost analysis, control accuracy, and success rate. This would strengthen the persuasiveness of the proposed method’s effectiveness.
There are many typos and errors in the paper. Please go through the paper carefully and correct it.

Author Response

Comment 1: This paper introduces an economical vision-based teleoperation system that enables real-time control of an articulated robotic arm using human hand tracking, utilizing a standard laptop camera. Also, a real-time landmark estimation pipeline is used to get information about hand pose and gestures. Meanwhile, a calibration-based control strategy then maps a set of compact kinematic descriptors to robotic joint commands. Finally, a human-in-the-loop research with 42 people tests the system using a standardized pick-and-place task that measures task success rate, completion time, positional accuracy, and a subjective evaluation of usefulness and intuitiveness.

Response 1: We sincerely thank the Reviewer 1 for the thoughtful summary of our work and for the constructive feedback. We have carefully revised the manuscript to address all comments listed below. In the updated version, all changes made specifically in response to Reviewer 1 are highlighted in green, while blue and yellow highlights correspond to revisions introduced for the previously addressed reviewers.

Comment 2: To further enhance the manuscript’s quality, the authors should consider the following issues:
1. Figure 2 appears to take up too much space. The figure is recommended to be modified.

Response 2: We thank the Reviewer for this suggestion. Figure 2 is intended as a high-level workflow overview and its readability relies on text labels. In the MDPI online format, figures can be opened and zoomed interactively, which mitigates the need to enlarge the figure within the manuscript body. Nevertheless, we reviewed the layout and ensured that Figure 2 is placed and scaled consistently with the journal template while preserving legibility.

Comment 3: 2. The experiments in this paper are too simplistic to demonstrate the effectiveness of the proposed control method. It is recommended to supplement the experiments, for example, by varying the size or type of grasped objects, increasing environmental complexity, and adjusting ambient light intensity (as light intensity significantly affects camera performance). This will enhance the generalization capability of the proposed control method.

Response 3: We thank the Reviewer for this valuable suggestion and we agree that broader task and environment variation would strengthen generalization claims. The present study was designed as a reproducible baseline with a standardized pick-and-place protocol and first-time user evaluation (n = 42). To address this concern, we explicitly stated in the Limitations section that the current evaluation uses a single lightweight object in a constrained setup and that camera-only tracking is sensitive to environmental factors such as lighting and occlusions. We further clarified
that future work will extend the experimental protocol by varying object size/material and type, increasing environmental complexity (e.g., clutter), and systematically adjusting ambient illumination levels to quantify robustness and generalization.

Comment 4: 3. Table 1 presents data less intuitively. It is recommended to use more curves to more clearly illustrate the results of each experiment.

Response 4: We thank the Reviewer for this suggestion. We agree that distribution-style plots provide a more intuitive view than a participant-by-participant table alone. This has already been addressed in the revised manuscript by adding curve-based visualizations (distribution plots) that summarize the main experimental outcomes (completion time, number of attempts, ease-of-use, and placement error), complementing the table of individual trials. Please note that these plots were introduced earlier in the revision process at the request of a previous reviewer and are therefore highlighted in yellow.

Comment 5: 4. The mapping relationship between robotic arms and human arms is a critical aspect of teleoperation. How does this paper map human hand movements to robotic arm motions and perform pose estimation? Sections 2.4.2 and 2.4.3 provide only brief introductions without robotic arm modeling, relevant mapping algorithms, calibration-based control strategies, or equations. The authors are supposed to elaborate on these aspects in greater detail.

Response 5: We thank the Reviewer for highlighting the importance of the human–robot mapping in teleoperation. In our system, hand pose/landmark estimation is obtained using the established MediaPipe Hands pipeline (off-the-shelf, real-time landmark detection from monocular RGB). We do not modify or retrain the underlying landmark estimator; rather, our methodological contribution focuses on the control layer built on top of these landmarks, namely (i) the computation of compact kinematic descriptors, (ii) a calibration-based strategy to personalize the mapping from descriptors to robot actuation channels, and (iii) stabilization/bounding policies to ensure smooth and safe teleoperation. We have revised Sections 2.4.2–2.4.3 to make this separation (pose estimation vs. descriptor/mapping/control) explicit and clearer.

Comment 6: 5. The experiments in this paper involve position-based servo control of a robotic arm. However, the paper lacks curves or tables related to the control algorithm data, such as servo motor angles, current curves, PWM signals, etc. It is recommended to add relevant figures.

Response 6: We thank the Reviewer for this suggestion. The focus of this paper is on evaluating a human-in-the-loop teleoperation interface using task-level metrics (success rate, completion time, placement error, and subjective ease-of-use), rather than on low-level actuator characterization. The robotic arm is driven via position setpoints (target servo angles), and the manuscript already reports the key control/update settings used during experiments, including the stabilization strategy (temporal smoothing), the minimum inter-command interval, the minimum angle-change threshold for sending updates, command bounds, and the servo refresh rate (see Section 2.4.4). Since the current prototype was not instrumented with current sensing or oscilloscope-grade PWM logging, detailed current/PWM waveform plots are outside the scope of the present experimental setup.

Comment 7: 6. The paper mentions in the Discussion section that “other approaches enhance robustness through wearable sensing (e.g., IMUs, gloves, or myoelectric armbands), which can improve control precision but introduce additional hardware and calibration overhead”. The author should compare and highlight the advantages of this proposed economical method over other high-precision approaches by quantifying metrics such as cost analysis, control accuracy, and success rate. This would strengthen the persuasiveness of the proposed method’s effectiveness.

Response 7: We thank the Reviewer for this suggestion. We agree that quantifying advantages strengthens the discussion. In the revised manuscript, we already report participant-level quantitative performance for our system (success rate, completion time, and placement error) and we added a quantitative benchmark table (Table 3) summarizing the comparable metrics available in the main related interfaces discussed in the paper (e.g., success rate/time and accuracy proxies as reported). Regarding cost and setup overhead, the proposed method is intentionally economical because it requires no dedicated sensing hardware beyond a standard laptop camera (wearable-free) and therefore avoids the additional device cost and user instrumentation/calibration steps typical for glove/IMU/EMG-based approaches. Absolute price comparisons can vary substantially with component choice and local availability; thus, we emphasize the hardware requirement differential (camera-only vs. additional wearable sensors) and the reported task-level quantitative outcomes (Table 3) as the most objective basis for comparison in this study.

Comment 8: 7. There are many typos and errors in the paper. Please go through the paper carefully and correct it.

Response 8: Thank you for this remark. We carefully proofread the revised manuscript and corrected typographical, punctuation, and minor grammatical issues throughout the paper (including the specific wording refinements introduced during the revision process). All edits made specifically in response to Reviewer 1 are highlighted in green, while blue and yellow highlights correspond to revisions introduced for the previously addressed reviewers.

Reviewer 2 Report

Comments and Suggestions for Authors

biomimetics-4152636

In the manuscript with the title "Interactive Teleoperation of an Articulated Robotic Arm Using
Vision-Based Human Hand Tracking", the authors present a vision-based teleoperation system that enables real-time control of a 5-DOF articulated robotic arm with a gripper using human hand tracking from a standard laptop camera. The system leverages MediaPipe Hands for landmark estimation, computes compact kinematic descriptors (palm position, apparent hand scale, wrist rotation, hand pitch, and pinch gesture), and maps them to joint-level servo commands via a calibration-based strategy. Commands are transmitted over a lightweight HTTP-based network interface to a Raspberry Pi embedded controller driving PCA9685-controlled servos. The system is evaluated in a human-in-the-loop study with 42 participants performing a standardized pick-and-place task, achieving an 88% success rate (37/42), a mean completion time of 53.48 +/- 18.51 s, a mean placement error of 6.73 +/- 3.11 cm (successful trials only), and a mean ease-of-use score of 2.67 +/- 1.20 on a 1-5 scale. The manuscript addresses a relevant problem in accessible human-robot interaction and contributes a reproducible, low-cost teleoperation pipeline with participant-level evaluation data. However, the biomimetic dimension claimed in the abstract and keywords is insufficiently developed throughout the text.

The manuscript has some limitations and weaknesses that need to be addressed and improved. Here are some specific recommendations and suggestions for each section of the manuscript:

Abstract

1. The Abstract is missing quantative representation of the results.

Introduction

2. The introduction follows a "parade of references" structure (lines 75-90) where individual papers are summarized sequentially rather than synthesized thematically. The authors should reorganize the literature review to group related work by contribution type (e.g., vision-based approaches, wearable-based approaches, gesture classification methods) and draw comparative conclusions across studies, rather than listing them one by one.

3. The research gap statement (lines 91-97) needs sharper differentiation from closely related works. The authors must clarify precisely what distinguishes their contribution from the existing literature.

4. The manuscript lacks a dedicated aims paragraph at the end of the introduction. The contribution statement (lines 98-105) blends with the gap identification. A clear, separate paragraph explicitly listing the specific objectives and expected contributions of the study should be added.

Materials and Methods

5. Section numbering is inconsistent throughout the manuscript: Sections 2.1 and 2.2 use a period after the number, while Sections 2.3 and 2.4 omit it. This formatting inconsistency should be corrected to follow a uniform convention.

6. The 3D printing parameters for the PLA components are not clear.

7. MediaPipe Hands configuration parameters are not reported.

8. No information is provided about the laptop used for the operator-side processing. Since the system is presented as running on "consumer hardware," specifying the minimum and tested hardware is essential for reproducibility.

9. No demographic information is reported for the 42 participants (age range, gender distribution, prior experience with robotics or teleoperation, handedness). These factors can substantially influence both performance and ease-of-use ratings and should be documented.

Results and Discussion

10. Table 1 occupies nearly two full pages with raw participant-level data but is not accompanied by any summary visualization. The authors should add box plots or histograms for the key metrics (completion time, placement error, ease-of-use) to help readers quickly grasp the distributions and identify outliers.

11. The statistical analysis is limited to descriptive statistics (mean and standard deviation). No inferential tests are performed. The authors should consider: (a) correlation analysis between number of attempts and completion time; (b) comparison of metrics between successful and unsuccessful participants; (c) analysis of whether ease-of-use correlates with task performance. Even exploratory analyses would strengthen the contribution.

12. No system performance metrics are reported (actual frame rate, tracking dropout rate, command transmission latency, servo response time). These are standard for teleoperation system evaluations and their absence makes it difficult to diagnose the sources of user variability.

13. The Discussion section contains informal or machine-translated phrasing that reduces clarity. Examples include: "practical words" (line 394). Additionally, there is a typographical error on line 380: "n contrast" should read "In contrast." The entire Discussion should be carefully proofread for language quality.

14. The Discussion compares with related work only qualitatively. No quantitative benchmarking against other teleoperation systems is provided (e.g., success rates, completion times, or accuracy from comparable studies). Even approximate comparisons from published data would contextualize the results more effectively.

Conclusions

15. The Conclusions section largely restates the results without synthesizing new insight.

The manuscript requires a major revision before it can be considered for publication.

Author Response

Comments 1: In the manuscript with the title "Interactive Teleoperation of an Articulated Robotic Arm Using Vision-Based Human Hand Tracking", the authors present a vision-based teleoperation system that enables real-time control of a 5-DOF articulated robotic arm with a gripper using human hand tracking from a standard laptop camera. The system leverages MediaPipe Hands for landmark estimation, computes compact kinematic descriptors (palm position, apparent hand scale, wrist rotation, hand pitch, and pinch gesture), and maps them to joint-level servo commands via a calibration-based strategy. Commands are transmitted over a lightweight HTTP-based network interface to a Raspberry Pi embedded controller driving PCA9685-controlled servos. The system is evaluated in a human-in-the-loop study with 42 participants performing a standardized pick-and-place task, achieving an 88% success rate (37/42), a mean completion time of 53.48 +/- 18.51 s, a mean placement error of 6.73 +/- 3.11 cm (successful trials only), and a mean ease-of-use score of 2.67 +/- 1.20 on a 1-5 scale. The manuscript addresses a relevant problem in accessible human-robot interaction and contributes a reproducible, low-cost teleoperation pipeline with participant-level evaluation data. However, the biomimetic dimension claimed in the abstract and keywords is insufficiently developed throughout the text.

Response 1: We sincerely thank the Reviewer for the thorough and constructive assessment of our manuscript and for the positive remarks regarding the relevance of accessible human–robot interaction and the inclusion of participant-level evaluation data. We carefully addressed each comment, and all revisions in the manuscript are highlighted in yellow for Reviewer 2 for ease of inspection, while blue and green highlights correspond to revisions introduced for the other reviewers.

Comments 2: The manuscript has some limitations and weaknesses that need to be addressed and improved. Here are some specific recommendations and suggestions for each section of the manuscript:
Abstract
1. The Abstract is missing quantative representation of the results.

Response 2: Thank you for this important suggestion. We agree that the abstract should report the main outcomes quantitatively. The section Abstract has been updated.

Comments 3: Introduction
2. The introduction follows a "parade of references" structure (lines 75-90) where individual papers are summarized sequentially rather than synthesized thematically. The authors should reorganize the literature review to group related work by contribution type (e.g., vision-based approaches, wearable-based approaches, gesture classification methods) and draw comparative conclusions across studies, rather than listing them one by one.

Response 3: Thank you for this important observation. We revised the introduction to synthesize related work thematically instead of listing studies sequentially. The revised text groups prior contributions into (i) vision-based camera-only approaches, (ii) wearable-based sensing approaches, and (iii) gesture/voice interfaces for industrial or collaborative robot command definition, and it explicitly articulates the trade-offs across these categories.

Comments 4: 3. The research gap statement (lines 91-97) needs sharper differentiation from closely related works. The authors must clarify precisely what distinguishes their contribution from the existing literature.

Response 4: We agree with the Reviewer and strengthened the gap statement to more explicitly distinguish our work from closely related studies. The revised text clarifies that our contribution is an end-to-end, camera-only teleoperation pipeline on consumer hardware, paired with a participant-level physical manipulation evaluation on a low-cost articulated arm, including attempt counting and failure handling under a standardized protocol.

Comments 5: 4. The manuscript lacks a dedicated aims paragraph at the end of the introduction. The contribution statement (lines 98-105) blends with the gap identification. A clear, separate paragraph explicitly listing the specific objectives and expected contributions of the study should be added.

Response 5: Thank you for this observation. We added a dedicated aims paragraph at the end of the Introduction, separated from the gap identification, explicitly listing the objectives and the expected contributions.

Comments 6: Materials and Methods
5. Section numbering is inconsistent throughout the manuscript: Sections 2.1 and 2.2 use a period after the number, while Sections 2.3 and 2.4 omit it. This formatting inconsistency should be corrected to follow a uniform convention.

Response 6: We agree and corrected the formatting so that all section and subsection headings follow a uniform numbering convention throughout the manuscript.

Comments 7: 6. The 3D printing parameters for the PLA components are not clear.

Response 7: We thank the reviewer for this suggestion. To improve reproducibility, we added a dedicated table summarizing the FDM printing parameters used for manufacturing all PLA components of the robotic arm and enclosure. Section 2.1 has been updated.

Comments 8: 7. MediaPipe Hands configuration parameters are not reported.

Response 8: We thank the reviewer for this suggestion. We have added the exact MediaPipe Hands configuration parameters used in our implementation, including the selected operating mode and confidence thresholds, to improve reproducibility of the hand-tracking module. Section 2.4.2 has been updated.

Comments 9: 8. No information is provided about the laptop used for the operator-side processing. Since the system is presented as running on "consumer hardware," specifying the minimum and tested hardware is essential for reproducibility.

Response 9: We thank the reviewer for highlighting reproducibility details. We have added the operator-side tested hardware/software configuration (CPU, RAM, GPU, OS) and the webcam capture settings used during the experiments. Section 2.2 has been updated.

Comments 10: 9. No demographic information is reported for the 42 participants (age range, gender distribution, prior experience with robotics or teleoperation, handedness). These factors can substantially influence both performance and ease-of-use ratings and should be documented.

Response 10: We acknowledge this point. Participants self-identified as students in the consent form; however, to preserve anonymity and minimize personal data collection, no demographic attributes (e.g., age, gender, handedness, prior experience) were recorded or linked to the reported dataset. We clarified this explicitly in Section 2.4.6. and noted it as a limitation.

Comments 11: Results and Discussion
10. Table 1 occupies nearly two full pages with raw participant-level data but is not accompanied by any summary visualization. The authors should add box plots or histograms for the key metrics (completion time, placement error, ease-of-use) to help readers quickly grasp the distributions and identify outliers.

Response 11: We agree. We added distribution visualizations (new Figure 7) for the key participant-level outcomes to complement Table 1 and to make dispersion and outliers easier to interpret briefly. Section results have been updated.

Comments 12: 11. The statistical analysis is limited to descriptive statistics (mean and standard deviation). No inferential tests are performed. The authors should consider: (a) correlation analysis between number of attempts and completion time; (b) comparison of metrics between successful and unsuccessful participants; (c) analysis of whether ease-of-use correlates with task performance. Even exploratory analyses would strengthen the contribution.

Response 12: Thank you for highlighting the need to clarify the statistical reporting. We have revised the Results section to explicitly state that descriptive statistics are reported as mean ± sample standard deviation (SD). We also clarified that placement error is computed only over successful trials (n = 37), while all other measures are summarized over the full cohort (n = 42). These clarifications are highlighted in yellow in the revised manuscript.

Comments 13: 12. No system performance metrics are reported (actual frame rate, tracking dropout rate, command transmission latency, servo response time). These are standard for teleoperation system evaluations and their absence makes it difficult to diagnose the sources of user variability.

Response 13: We thank the Reviewer for this important suggestion. We agree that system-level performance metrics (e.g., effective frame rate, tracking dropout rate, end-to-end latency, and servo response time) are valuable for diagnosing user variability. In the current prototype, these signals were not explicitly logged during the user study (the setup was focused on task-level outcomes under a standardized protocol). To support reproducibility, we already report the key control/update settings used during experiments (update-rate limiting, minimum change thresholds, and smoothing/stabilization policy; see Section 2.4.4.). We also added this logging as a clear direction for future work, where we plan to instrument the pipeline to report frame rate, dropout events, and communication/actuation latency alongside task-level metrics.

Comments 14: 13. The Discussion section contains informal or machine-translated phrasing that reduces clarity. Examples include: "practical words" (line 394). Additionally, there is a typographical error on line 380: "n contrast" should read "In contrast." The entire Discussion should be carefully proofread for language quality.

Response 14: We thank the reviewer for highlighting language issues in the Discussion section. We have carefully proofread and edited the Discussion section to correct all typos. The revised text improves clarity and consistency while preserving the original meaning.

Comments 15: 14. The Discussion compares with related work only qualitatively. No quantitative benchmarking against other teleoperation systems is provided (e.g., success rates, completion times, or accuracy from comparable studies). Even approximate comparisons from published data would contextualize the results more effectively.

Response 15: We thank the Reviewer for this suggestion. We agree that quantitative context strengthens the Discussion. Accordingly, we added a quantitative benchmarking table in the Discussion (Table 3) that summarizes our main outcomes (success rate, completion time, and placement error) alongside the quantitative metrics available in the main related interfaces already cited and discussed in the same section ([7], [8], [21]) (e.g., success rate/time and accuracy proxies as reported). We also added a brief note clarifying that direct one-to-one comparison is limited by differences in robots, tasks, and metric definitions across studies, but that the reported values provide an indicative context for interpreting our results.

Comments 16: Conclusions
15. The Conclusions section largely restates the results without synthesizing new insight.
The manuscript requires a major revision before it can be considered for publication.

Response 16: We thank the Reviewer for this comment. We revised the Conclusions section to reduce repetition and better synthesize the main insights. Specifically, we integrated the brief implementation detail about the stabilization/update policy into the opening summary (instead of keeping it as a standalone methods-style paragraph) and added a short synthesis highlighting what drives performance variability in first-time use (re-grasp events and failure recovery under monocular ambiguity and intermittent landmark loss). We also explicitly position the work as an accessible, camera-only baseline supported by participant-level quantitative results and the benchmarking context in Table 3.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors consider an important topic of the robot teleoperation using only the camera signals. The authors’ ideas are clearly presented, and simulations validate the proposed techniques. I find the paper accurately written and easy-to-follow and have only a few comments listed below.

Major comments:

The authors exemplify the proposed methods on a robot that has six degrees of freedom: five revolute joints + one gripper:
- If possible, I recommend the authors present the kinematic scheme of this robot to better understand the robot architecture.
- Why did the authors consider this specific robot? Are its six degrees of freedom sufficient to simulate the hand movements and perform the desired pick-and-place tasks?
According to Sect. 2.4.3, during the teleoperation, each movement of the hand corresponds to the motion in a specific servo motor. Wouldn’t it be more natural to directly control the motion of the robot end-effector and then resolve the motor motions by solving the inverse kinematic problem?
In Sect. 2.4.6, I recommend the authors give more details on the distance between the initial pose of the grasped box and its final location. This information will be useful to compare this distance with the obtained placement errors.

Minor comments:

P. 2, l. 89. Is abbreviation MDPI appropriate here?
P. 5, Fig. 2. The authors introduce this figure here but do not give any comments about it. I recommend the authors give some details about this figure. The authors may also consider placing this figure later in the paper.
P. 8, Fig. 5. The figure shows the cooling fan needs 12 V as a power supply. The illustrated battery, however, provides only 7.4 V. Is it sufficient to supply the fan?
P. 11, l. 284. The authors introduce the 1–5 scale to estimate the ease of use. Does value 1 correspond to the most difficult case and value 5—to the easiest case?
P. 14, l. 380. There is a typo: “n contrast.”

Author Response

Comments 1: The authors consider an important topic of the robot teleoperation using only the camera signals. The authors’ ideas are clearly presented, and simulations validate the proposed techniques. I find the paper accurately written and easy-to-follow and have only a few comments listed below.

Response 1: Authors response: We sincerely thank the Reviewer for the positive assessment and for recognizing the relevance of camera-only robot teleoperation. We appreciate the feedback regarding clarity and the easy-to-follow presentation. We have carefully addressed all comments listed below and revised the manuscript accordingly. All newly added or modified text in response to Reviewer 3 is highlighted in blue in the revised manuscript, while yellow and green highlights correspond to revisions introduced for the other reviewers.

Comments 2: Major comments:
1.The authors exemplify the proposed methods on a robot that has six degrees of freedom: five revolute joints + one gripper:
• If possible, I recommend the authors present the kinematic scheme of this robot to better understand the robot architecture.
• Why did the authors consider this specific robot? Are its six degrees of freedom sufficient to simulate the hand movements and perform the desired pick-and-place tasks?

Response 2: We thank the Reviewer for this useful remark. We carefully rechecked the manuscript and confirm that it did not state anywhere that the robot has six DOF; the “six” refers only to the six actuation/communication channels (five revolute joints plus a separately actuated gripper). The platform comprises five revolute joints (5R) that define the arm kinematics, while the sixth channel actuates the gripper (open/close). To make the robot architecture clearer, we added a concise kinematic description of the joint chain (J1–J5) and explicitly stated that the gripper is an additional actuator rather than an independent DOF of the arm. Regarding the choice of platform, we selected this specific low-cost 5R + gripper arm because it is widely available, easy to reproduce, and sufficient for the standardized pick-and-place protocol used in our human study (reach–grasp–transport–release). While this configuration does not aim to replicate full human-hand dexterity, it provides adequate motion capability to execute the targeted manipulation primitives required by the task. We also added a short scope statement noting that extending the approach to higher-DOF manipulators and richer hand-like dexterity is an important direction for future work.

Comments 3: 2. According to Sect. 2.4.3, during the teleoperation, each movement of the hand corresponds to the motion in a specific servo motor. Wouldn’t it be more natural to directly control the motion of the robot end-effector and then resolve the motor motions by solving the inverse kinematic problem?

Response 3: We thank the Reviewer for this insightful suggestion. We agree that end-effector (task-space) control with inverse kinematics (IK) is often a more natural abstraction for teleoperation. In this study, however, we intentionally used a lightweight descriptor-to-joint mapping (Section 2.4.3) to keep the pipeline simple and reproducible on low-cost hardware, and to avoid reliance on precise kinematic modeling/calibration that can be challenging for hobby-grade arms (e.g., backlash and tolerance variability). Moreover, our camera-only setup provides primarily image-plane motion cues and a coarse depth proxy, so a direct joint-level mapping offered a pragmatic and stable control baseline for the standardized pick-and-place protocol. We have added a brief discussion acknowledging IK-based end-effector control as a natural alternative and a clear direction for future work.

Comments 4: 3. In Sect. 2.4.6, I recommend the authors give more details on the distance between the initial pose of the grasped box and its final location. This information will be useful to compare this distance with the obtained placement errors.

Response 4: We thank the Reviewer for this suggestion. We agree that reporting the start-to-target separation provides useful context for interpreting the placement errors. Accordingly, we updated Sect. 2.4.6. to explicitly state that the center-to-center distance between the start area and the target area was 30 cm. The newly added text is highlighted in blue in the revised manuscript.

Comments 5: Minor comments:
4. P. 2, l. 89. Is abbreviation MDPI appropriate here?

Response 5: Thank you for pointing this out. We carefully rechecked the indicated location (p. 2, l. 89) and the abbreviation “MDPI” does not appear in the manuscript body there. The only occurrence of “MDPI” was in the standard “Disclaimer/Publisher’s Note” text automatically included by the MDPI template at the end of the document.

Comments 6: 5. P. 5, Fig. 2. The authors introduce this figure here but do not give any comments about it. I recommend the authors give some details about this figure. The authors may also consider placing this figure later in the paper.

Response 6: Thank you for this helpful suggestion. We agree that Figure 2 should be explicitly discussed where it is introduced. Accordingly, we added a brief explanatory paragraph next to Figure 2, summarizing the main stages of the workflow and the two-module architecture, and we included a forward reference to Section 2.4.1. where the workflow is described in detail.

Comments 7: 6. P. 8, Fig. 5. The figure shows the cooling fan needs 12 V as a power supply. The illustrated battery, however, provides only 7.4 V. Is it sufficient to supply the fan?

Response 7: Thank you for noticing this. The cooling fan used in our enclosure is rated for 12 V, but in the current prototype it is intentionally powered directly from the 7.4 V battery (i.e., operated under-voltage). This was a deliberate design choice to reduce component count and wiring complexity, since the Raspberry Pi does not require maximum airflow for the operating conditions of our experiments. Even at 7.4 V, the fan provides sufficient airflow for the intended thermal management in our setup. To avoid confusion, we clarified this point in the manuscript/figure.

Comments 8: 7. P. 11, l. 284. The authors introduce the 1–5 scale to estimate the ease of use. Does value 1 correspond to the most difficult case and value 5—to the easiest case?

Response 8: Thank you for the clarification request. We updated the manuscript to explicitly define the direction of the 1–5 ease-of-use scale. In our questionnaire, 1 corresponds to “very difficult” and 5 corresponds to “very easy”. Section 2.4.6. have been updated.

Comments 9: 8. P. 14, l. 380. There is a typo: “n contrast.”

Response 9: Thank you for spotting this typo. We corrected “n contrast” to “By contrast” in the revised manuscript. (This correction was already included in the previous revision; newly added text is highlighted in blue, while earlier corrections appear in yellow.)

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have satisfactorily addressed the reviewer's comments. In its current form, the paper is acceptable for publication.

Reviewer 2 Report

Comments and Suggestions for Authors

As the revisions are satisfactory, the manuscript is recommended for acceptance.

Article Menu

Interactive Teleoperation of an Articulated Robotic Arm Using Vision-Based Human Hand Tracking

Further Information

Guidelines

MDPI Initiatives

Follow MDPI