Next Article in Journal
Development of IoT-Based Hybrid Autonomous Networked Robots
Next Article in Special Issue
New Approach to Dominant and Prominent Color Extraction in Images with a Wide Range of Hues
Previous Article in Journal
Application of a Hybrid Model for Data Analysis in Hydroponic Systems
 
 
Article
Peer-Review Record

Human-Machine Interaction: A Vision-Based Approach for Controlling a Robotic Hand Through Human Hand Movements

Technologies 2025, 13(5), 169; https://doi.org/10.3390/technologies13050169
by Gerardo García-Gil, Gabriela del Carmen López-Armas * and José de Jesús Navarro, Jr. *
Reviewer 1: Anonymous
Reviewer 2:
Technologies 2025, 13(5), 169; https://doi.org/10.3390/technologies13050169
Submission received: 19 March 2025 / Revised: 9 April 2025 / Accepted: 21 April 2025 / Published: 23 April 2025
(This article belongs to the Special Issue Image Analysis and Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I find the research interesting. Watching the movies, it is visible that the movement of the four fingers - index, middle, ring, and little - is not as precise as it could be, as shown by the detection presented in the second video. The movement was rather in two positions: straightened and shrank.

There are some elements to be improved in the paper before submission:

  1. Line 298 - unnecessary repetition of the "principal point"
  2. Line 421 - you define (D), (IP), and (P) and afterwards use (FD) and (PF) which are undefined
  3. Line 489 - equation (12) please delineate what s() and c() mean - they are explained in the original work you are citing
  4. Line 615 - Figure 10. There are two (b) pictures and no (d) picture in the Figure 10
  5. Line 656 - Figure 11 caption of (d). It seems to be automatically translated with mistakenly chosen words.
  6.  Line 675 - You write: "Figure 12(a-c) below shows how you move in our 3D workspace." - please rewrite to explain what is moving, I'm sure it's not about the movement of the reader of the paper in your space.

 

Author Response

We thank the Reviewer for the thoughtful and constructive comments provided. We have carefully considered each point and revised the manuscript accordingly.

We sincerely appreciate your careful review of our article. Your corrections and extremely valuable suggestions have helped us to significantly improve the quality and clarity of our manuscript. Below, we address each of your comments point by point.

Comment 1: Line 298 – Unnecessary repetition of the "principal point".

Response 1:

 Original:

 The line from the center of the camera perpendicular to the image plane is called the principal axis. The point where the principal axis meets the image plane is called the principal point P, which is the principal point. The camera's center is located here at the origin of the coordinates.

Revised:

The line from the center of the camera perpendicular to the image plane is called the principal axis. The point where the principal axis intersects the image plane is called the principal point (P). The camera center is located at the origin of the coordinate system.

Comments 2: Line 421 - you define (D), (IP), and (P) and afterwards use (FD) and (PF) which are undefined.

Response 2:

 Original:

Each of these fingers, except the thumb, has three bones called the distal phalanx (D), intermediate phalanx (IP), and proximal phalanx (P). Only the FD and the PF are found in the thumb case. The connections between these bones, known as joints, include the distal interphalangeal (DIP).

Revised:

The human hand comprises five fingers: the thumb, index, middle, ring, and little fingers. Each of these fingers, except for the thumb, consists of three bones: the distal phalanx (DP), intermediate phalanx (IP), and proximal phalanx (PP). The thumb, however, has only two: the distal phalanx and the proximal phalanx.

Comments 3: Line 489 - equation (12) please delineate what s() and c() mean - they are explained in the original work you are citing.

Response 3:

Revised:

Each homogeneous transformation matrix Ai is defined using the standard formula, as shown in Eq. 12, where the functions s(θi) and c(θi)represent the sine and cosine functions, respectively.

Comments 4: Line 615 - Figure 10. There are two (b) pictures and no (d) picture in the Fig 10

Corrected¡¡

Comments 5: Line 656 - Figure 11 caption of (d). It seems to be automatically translated with mistakenly chosen words.

Corrected¡¡

Comments 6: Line 675 - You write: "Figure 12(a-c) below shows how you move in our 3D workspace." - please rewrite to explain what is moving, I'm sure it's not about the movement of the reader of the paper in your space.

 Response 6:

Figure 12 (a-c) below shows how the index finger of the side- and front-view robot 694 moves in its 3D workspace. Figure 12.c shows the index finger of the robotic version in blue versus that of the human version in red. Quantitative and qualitative experimental results that support the effectiveness of the computer vision-based control system are presented. It is noteworthy that Figure 12 matches the workspace in Figure 8.

Finger indexindex finger workspace

Once again, we are grateful for your valuable feedback. If you have any additional comments, we are fully open to addressing them. The corrections have been incorporated into the revised and attached manuscript. 

Kind regards,

PhD. Gerardo García

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

There are numerous studies that have focused on controlling a robotic arm using natural gestures, especially hand gestures. While other studies have leveraged both computer vision using an infrared depth sensor (Kinect, Leap Motion or other similar device) other have used traditional IMU (Inertial Measurement Unit) such as accelerometers to measure movement along axes and gyroscopes to measure rotation around the axes).

The authors made use of MediaPipe (open-source framework developed by Google) which is an well-established library for hand tracking used as well within multiple similar research papers.

Within the research papers the authors have proposed to use a simple webcamera (Logitech C920) and eliminate the depth sensors to simplify the setup. Since the aim of the research to identify various hand poses in some cases a Depth Sensors (such as Kinect, Intel RealSense or Leap Motion) will capture the 3D hand poses better and will also manage occlusion better. At the same time, it provides accurate results also within varying lighting conditions, but as presented by the authors they are more expensive, require more processing power and they are also bulkier making the whole setup a lot more complex.

The introduction and related works section are well defined and include a good number of recently published papers. Some related works presented by the authors make use of CNN-based approaches for gesture detection and classification while others use pre-trained models optimized for real-time hand tracking, such as MediaPipe that does not rely on deep learning training from scratch.

Within Figure 1, I would suggest adjusting the images within Stage 1 to better fill the whole figure. The hand poses used within this Figure are also weird, as the hand is positioned on a background with shadows, considering the looks of the images, this look to be 3D rendered images of various hand poses. Adding the images directly from the webcam, as presented within Figure 5 would be a lot better as this would enhance the clarity of the Figure and enhance the plausibility.

Within Figure 3, the camera illustrated within the Figure is not the used Logitech C920, it is an eMeet C960 or a similar camera. I would suggest maintaining the same hardware images within all figures, as those are the ones that have been used, and they would reflect the correct devices. The proposed work will no doubt work with both cameras as they share similar technical specifications.

Within Figure 6, the technical drawing presented to the right (Uhan-UNO robotic hand) the image has been heavily scaled on the vertical axis. There are many images available on the web and it is obvious that the image has been heavily scaled and deformed, also the 174 mm dimension is taken from the side view of the same hand. The current Image processed by the authors is also not correctly considering the 2D technical drawing dimensions with that hanging 174 mm dimension. Please consider also add the reference for each image that has not been created by the authors.

Author Response

We sincerely appreciate your comments and suggestions, which have significantly contributed to improving our manuscript—especially regarding the figures suggested “at the end of the document.” We acknowledge that any manuscript can always be improved and understand that the assessment “needs improvement” across all categories is broad. However, for the aspects we consider essential, we proceed below to justify or modify the content as appropriate. The following topics are addressed:

  1. Does the introduction provide sufficient contextual information and include all relevant references?
  2. Is the research design appropriate?
  3. Are the methods adequately described?
  4. Are the results clearly presented?
  5. Are the conclusions supported by the results?

Response 1: The reviewer comments that the introduction "needs improvement".

The introduction clearly establishes the importance of human-machine interaction in today’s context, highlighting its crucial role in a variety of technological applications, including automation and medicine. By referencing recent research and emerging technologies, the narrative aligns with current trends in the field and market needs, which is essential to capture the interest of readers and reviewers.

With 43 carefully selected references, the section provides a broad and in-depth foundation that supports the claims and the study’s proposal. These references cover a wide range of approaches, from traditional methods to recent innovations, demonstrating a thorough literature review and a solid understanding of ongoing debates in the field. The inclusion of works published as recently as 2025 indicates that the manuscript incorporates the latest developments, further contributing to its relevance.

The literature review presents various approaches, such as computer vision and sensor-based methods, allowing a comprehensive comparison of existing gesture recognition techniques. This multidimensional perspective not only enriches the discussion but also positions the study within a broader context, highlighting its potential contributions to the literature.

Both the introduction and related work sections provide not only theoretical insights but also link to practical real-world applications. This increases the relevance of the work and may inspire future research and applications based on the findings discussed. The text is well-structured and coherent, helping readers follow the flow of ideas effectively. The logical organization aids in identifying gaps in the existing literature and how the proposed study aims to address them.

The manuscript clearly differentiates the proposed study from previous works, which allows reviewers and readers to appreciate the originality and importance of the contribution.
Of course, we are open to expanding or adjusting additional details if greater clarity or depth is deemed necessary.

Response 2: The reviewer suggests that the research design "needs improvement".

The research design presented in the manuscript is appropriate and well-grounded, as it aligns coherently with the objectives defined in the introduction. The main goal is clearly stated: to develop an innovative system to control a robotic arm through hand gestures and movements, eliminating reliance on sensors or traditional physical controls. To this end, MediaPipe and computer vision techniques are employed, enhancing human-machine interaction by accurately recognizing and replicating hand movements.

The methodology successfully integrates theory and practice, combining a solid modeling approach—including the pinhole camera model, direct and inverse kinematics, and Jacobian matrix construction—with rigorous experimental implementation. The use of a 2D camera to capture and process gestures, along with quantitative validation through error metrics (e.g., RMSE, MAE, and MAPE), reflects a comprehensive design encompassing both technical and empirical aspects.

The "Benchmark Experiments" section is particularly relevant as it focuses on replicating the movements of the human index finger, providing valuable insights into the system's performance. Furthermore, comparisons with existing methods in the literature allow the study to be placed within a broader research context, reinforcing the robustness of the design.

Another important feature of the design is its consideration of environmental conditions, using technologies that minimize the influence of external factors such as lighting and obstructions—a critical factor in vision-based systems. The discussion of real-world applications, such as the development of functional prosthetics or improved user interfaces, further highlights the relevance and applicability of the study.

Nevertheless, some improvements have been made:

  1. The calibration and modeling of the lateral fingers could be optimized, although it is acknowledged that these fingers contribute less to grasping and tend to show larger discrepancies.
  2. Filtering techniques, such as Kalman filters, have been implemented (line 610) to mitigate noise sensitivity in gesture detection. Additionally, the number of participants in the experimental phase was expanded to include two adults (one male, one female) and two teenagers (male), as noted in line 597.

In summary, the research design is solid and appropriate, combining current technologies with a rigorous validation process to address human-machine interaction in an innovative way. However, we remain open to addressing specific suggestions related to the design that may further enhance the work.

Response 3: The reviewer suggests that the methods section "needs improvement".

The methods are described in a detailed and structured manner, covering both theoretical foundations and practical implementation. The manuscript explains the use of the pinhole camera model, the formulation of direct and inverse kinematics, the construction of the Jacobian matrix, and the integration of MediaPipe and image processing techniques for gesture recognition. Additionally, the manuscript includes equations, diagrams, and a step-by-step description of the experimental process, which enhances clarity.

Nevertheless, based on the reviewer’s suggestion, some improvements have been implemented to further strengthen the methods section. These include:

  • A more detailed explanation of the calibration process, especially for lateral fingers with greater discrepancies.
  • The inclusion of filtering techniques (e.g., Kalman filters) to improve noise robustness in gesture detection.
  • A clearer description of the experimental conditions.

Response 4: The reviewer suggests that the results section "needs improvement".

The results are clearly and systematically presented. The manuscript includes detailed quantitative data using error metrics (such as RMSE, MAE, and MAPE), as well as comparative tables and graphs (Figures 10–12) that illustrate joint angle evolution for both the human and robotic models. The inclusion of diagrams and a step-by-step description of the experiments helps readers understand the system’s effectiveness in replicating hand movements.

However, we remain open to any additional suggestions to improve clarity and presentation.

Response 5: The reviewer suggests that the conclusions section "needs improvement".

In general, the conclusions are well-supported by the results, as the presented data (error metrics, graphs, and comparisons with existing methods) demonstrate the effectiveness of the proposed system. Nevertheless, we agree that the conclusion section could be strengthened by integrating the experimental findings more thoroughly and by highlighting both the achievements and the limitations of the study.

  • Quantitative Results: Specific findings such as an RMSE of 9.31° for index finger flexion and 4.9° for wrist rotation indicate a high level of precision. If the conclusions state that the system is accurate and effective, these results support such claims.
  • Orthonormal Design: The effective design and its validation through accurate results support the claim of the system’s superiority over previous methods.
  • Practical Applications: Conclusions about the system's potential impact on robotics and prosthetic design are justified, given the demonstrated ability to replicate human movement dynamics.

As with previous sections, we remain attentive to your feedback regarding our methods and experiments. Please feel free to express any concerns.

Comment on Figure 1:

The reviewer suggested adjusting the images in Stage 1 of Figure 1 for better layout, and noted that the hand poses seem unusual, possibly due to shadowed backgrounds or being 3D renders. The reviewer recommended using direct webcam images, as in Figure 5, to improve clarity and realism.

Response:

We have made the following adjustments:

  1. The images in Figure 1 have been resized and rearranged for better visual balance. It is clarified that each image corresponds to real experimental results obtained through each methodology applied by the respective authors.

2) The hands placed over a shadowed background, which appear to be 3D-rendered images of various hand poses, have been removed. Instead, images captured directly from the webcam have been added.

Comment 2: In Figure 3, the illustrated camera is not the Logitech C920 used, but rather an eMeet C960 or a similar model. I suggest using the same hardware images throughout all figures, as they reflect the actual devices used. The proposed system will undoubtedly work with both cameras, as they share similar technical specifications.

Response:
Corrected. Thank you for your observation

Comment 3: In Figure 6, the technical drawing shown on the right (Uhan-UNO robotic hand) appears significantly vertically stretched. Numerous images are available online, and it is evident that the current image has been heavily scaled and distorted. Additionally, the 174 mm dimension was taken from the side view of the same hand. The image currently used, processed by the authors, does not accurately reflect the 2D technical drawing dimensions, particularly the 174 mm measurement. Please also consider including the source for any image not created by the authors.

Response:
Corrected. The image has been modified in the original document, and the source has been cited as “Courtesy of the robotic hand manufacturer.”

© 2025 Hiwonder. All Rights Reserved

We sincerely thank you for your valuable comments and suggestions, which have substantially improved the quality of our manuscript. Should you have any further observations, we remain fully open to addressing them. We remain at your disposal.

Kind regards,
Phd Gerardo Garcia

Back to TopTop