A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper proposes a lightweight hand pose estimation method based on GCN to reduce computational complexity and achieve higher accuracy. The methodology is well-explained, and the experimental results are convincing. However, I suggest the following minor revisions to strengthen the paper:
1. The main feature of the proposed method is its lightweight feature. Therefore, it is better to conduct some experiments to clearly demonstrate the effects of lightweight design, such as improvements in computation time and storage space utilization.
2. The proposed method has shown excellent performance on the CMU-Hand dataset, but how well does the trained network perform on other datasets? The generalization performance of the method can be explored.
3. While the method shows promising results, a brief discussion on the potential limitations or challenges would provide a more balanced perspective. For example, due to the low computational intensity (the ratio of FLOPs to memory access) of depthwise convolution, it is difficult to make effective use of hardware. Besides, for these limitations, what improvements can be made to the proposed method?
4. Is the term "transposed convolution matrix" in Figure 9 incorrect? According to the description in the paper, it seems to be the "transposed input matrix."
5. There are some formatting oversights and errors that need to be corrected. For example, Figure 8 lacks the labels for subfigures a and b; on the second page, the fifth line from the bottom has two periods.
Author Response
For research article
Response to Reviewer 1 Comments
|
||
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Thank you for your useful comments, which enabled us to improve our work. We greatly appreciate your professional review of our paper. In line with your suggestions, we have made extensive and careful revisions to our previous manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. |
||
2. Point-by-point response to Comments and Suggestions for Authors |
||
Comments 1: The paper proposes a lightweight hand pose estimation method based on GCN to reduce computational complexity and achieve higher accuracy. The methodology is well-explained, and the experimental results are convincing. However, I suggest the following minor revisions to strengthen the paper: 1. The main feature of the proposed method is its lightweight feature. Therefore, it is better to conduct some experiments to clearly demonstrate the effects of lightweight design, such as improvements in computation time and storage space utilization. |
||
Response 1: Thank you very much for your valuable feedback, which has important guiding significance for us to further improve the content of the paper. Based on your suggestion, we conducted experimental verification on the lightweight design of the proposed method and added comparative experiments on three key evaluation indicators: FPS (frames per second, reflecting computational efficiency), Params (number of parameters, measuring model complexity), and FLOPs (floating-point operations, evaluating computational complexity). The experimental results show that our method exhibits significant improvements in FPS, Params, and FLOPs while maintaining performance. Higher values have been achieved on FPS, indicating higher computational efficiency. The reduction of Params proves the decrease in model complexity, which is beneficial for reducing storage space usage in practical applications. The reduction of FLOPs further validates the advantage of our method in reducing computational complexity. In addition, we also provided an explanation of the calculation time in the paper. |
||
Comments 2: The proposed method has shown excellent performance on the CMU-Hand dataset, but how well does the trained network perform on other datasets? The generalization performance of the method can be explored. |
||
Response 2: Thank you for your valuable review comments on the paper. We have added relevant experimental content to address the issue you raised. To evaluate the generalization performance of the proposed method, we added the RHD gesture dataset for training and testing, and conducted comparative experiments with the methods proposed in a 2021 paper and two 2023 papers published in recent years. The experimental results show that the proposed method outperforms these comparative methods on the RHD gesture dataset. These newly added experimental data not only validate the effectiveness of the proposed method, but also fully demonstrate its good generalization performance. |
||
Comments 3: While the method shows promising results, a brief discussion on the potential limitations or challenges would provide a more balanced perspective. For example, due to the low computational intensity (the ratio of FLOPs to memory access) of depthwise convolution, it is difficult to make effective use of hardware. Besides, for these limitations, what improvements can be made to the proposed method? |
||
Response 3: Thank you for pointing this out. We agree with this comment. We greatly appreciate the professional guidance provided by the reviewer, which has helped us improve our article. You pointed out that our method lacks sufficient discussion of potential limitations and challenges while demonstrating good results. Therefore, we have added a discussion on limitations and future improvements in Section 3. Regarding the issue you mentioned about the difficulty in effectively utilizing hardware due to the low computational intensity (FLOP to memory access ratio) of deep convolution, we can explore improving the overall performance of the system through appropriate algorithm hardware collaborative design, and explore how to combine the latest hardware technology to achieve more efficient real-time processing. |
||
Comments 4: Is the term "transposed convolution matrix" in Figure 9 incorrect? According to the description in the paper, it seems to be the "transposed input matrix." |
||
Response 4: Thank you for the careful guidance and valuable suggestions from the reviewer. Our original writing was not appropriate and caused ambiguity. In the deconvolution process, the matrix of the original input image was used for zero padding, while the transposed convolution kernel matrix was used for convolution, which is the core step of the deconvolution process. We have provided a correct and rigorous description of this section to ensure that it does not cause any further ambiguity. |
||
Comments 5: There are some formatting oversights and errors that need to be corrected. For example, Figure 8 lacks the labels for subfigures a and b; on the second page, the fifth line from the bottom has two periods. |
||
Response 5: Thank you for the reviewer's correction. We apologize for our carelessness. In the resubmitted revised manuscript, we have added annotations to the subgraph of Figure 8 and checked and corrected all punctuation marks in the entire text to maintain grammatical correctness and fluency. Thank you again for your review and valuable feedback. |
||
3. Response to Comments on the Quality of English Language |
||
Point 1: English is fine, there are only some typos (nothing unusual). However, I would suggest to improve the flow of the paper, in order to enhance readability. |
||
Response 1: Thank you very much for your affirmation of our work, but we still invited MDPI native speakers to proofread to improve sentence structure and word selection, and carefully checked the spelling, grammar and proper nouns in the article. And we have carefully revised full paper according to the requirements of the "Electronics" journal. |
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis is the review for the manuscript titled "A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement".
The authors have emphasized on developing a lightweight method for estimating hand attitude.
There are many shortcomings in this paper, some of them have been listed below. I would request authors to address these comments first.
Section 2 is relatively lengthy. There is a lot of information that readers can read from the original literature. My suggestion to the authors would be to reduce any redundant details in the manuscript and focus more towards proposed work.
Inconsistency in the number of RexNet parameters.
The authors have mentioned the parameters of RexNet only. However, their method uses other computations aswell, such as GCN. How is the complexity affected (computation and time)?
There is no information about the hardware platform. What was the inference speed? Framerate?
Can this method be used in realtime?
There is no comparison with the state-of-the-art and recent published studies. e.g., Wang, Bin, Liwen Yu, and Bo Zhang. "AL-MobileNet: a novel model for 2D gesture recognition in intelligent cockpit based on multi-modal data." Artificial Intelligence Review 57.10 (2024): 282.
The results are not described properly. Only AUC has been discussed.
What about accuracy?
Please mention the limitations of this work.
There are several typos in this paper.
Several sentences are not structured properly.
Comments on the Quality of English LanguageThere is a need to improve the quality of English language.
Author Response
For research article
Response to Reviewer 2 Comments
|
||
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Thank you for your useful comments, which enabled us to improve our work. We greatly appreciate your professional review of our paper. In line with your suggestions, we have made extensive and careful revisions to our previous manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. |
||
2. Point-by-point response to Comments and Suggestions for Authors |
||
Comments 1: This is the review for the manuscript titled "A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement". The authors have emphasized on developing a lightweight method for estimating hand attitude. There are many shortcomings in this paper, some of them have been listed below. I would request authors to address these comments first. Section 2 is relatively lengthy. There is a lot of information that readers can read from the original literature. My suggestion to the authors would be to reduce any redundant details in the manuscript and focus more towards proposed work. |
||
Response 1: Thank you to the reviewing experts for their meticulous and thorough reading of this manuscript, as well as for the constructive suggestions they have provided. These suggestions will help improve the quality of this manuscript. Regarding the issue you pointed out about "Section 2 being relatively long, it is suggested to reduce unnecessary details and focus more on the suggested work", we have carefully considered and made adjustments. Based on your suggestion, we have streamlined the content of the original Section 2, removing irrelevant redundant information. In order to ensure the coherence and readability of the paper, we have integrated the simplified description of the benchmark method into the first section "Benchmark Model" of Section 2 "Methods in this Paper". This not only retains necessary information, but also highlights the focus and innovation of our work. |
||
Comments 2: Inconsistency in the number of RexNet parameters. |
||
Response 2: Thank you very much for the valuable guidance provided by the expert teacher. After carefully reviewing your feedback, we have found that there are indeed inconsistencies in the description of RexNet parameter quantities in the article. We deeply apologize for our negligence. We have conducted a comprehensive review of the article and have corrected any omissions. Thank you again for your review and correction! |
||
Comments 3: The authors have mentioned the parameters of RexNet only. However, their method uses other computations aswell, such as GCN. How is the complexity affected (computation and time)? |
||
Response 3: Thank you very much for your suggestions. In our paper, we used RexNet as the backbone network for feature extraction, which significantly reduces the model parameters compared to the traditional ResNet50. This is very important for model lightweighting in hand pose estimation tasks. In order to further improve the accuracy of joint point estimation, we constructed a GCN feature enhancement module that can effectively enhance the final hand pose estimation results. Although the introduction of GCN increases the computational complexity of the model, our experimental verification shows that the combination of RexNet+GCN significantly reduces the computational complexity compared to the combination of ResNet50+GCN. At the same time, considering the strict requirements for accuracy in hand pose estimation tasks, this small increase in computational complexity is acceptable and results in significant performance improvement. |
||
Comments 4: There is no information about the hardware platform. What was the inference speed? Framerate? |
||
Response 4: Thank you again for the valuable feedback provided by the reviewer. Hardware platform information and inference speed are indeed crucial for evaluating the performance of the model. Therefore, based on your feedback, we have supplemented the relevant data. In the experimental setup section 2.1, it is added that the training and testing environment for the experiment is Linux operating system, the hardware configuration is NVIDIA GeForce RTX 3080 GPU, and the memory is 40GB. In the comparative analysis of experimental results in section 2.3, a comparative experiment was conducted to comprehensively evaluate the performance of the proposed gesture pose estimation model by adding three evaluation indicators: FPS, Params, and FLOPs. |
||
Comments 5: Can this method be used in realtime? |
||
Response 5: Thank you for the careful review of our paper by the reviewer. We attach great importance to your question and have replaced ResNet with RexNet as the backbone feature extraction network. The RexNet network is an improvement based on MobileNetV2 and can be used on mobile devices. We also conducted relevant experimental verification. The experimental results show that our model has a fast inference speed and a low number of parameters, fully capable of real-time processing, and can meet the application requirements of real-time gesture pose estimation. |
||
Comments 6: There is no comparison with the state-of-the-art and recent published studies. e.g., Wang, Bin, Liwen Yu, and Bo Zhang. "AL-MobileNet: a novel model for 2D gesture recognition in intelligent cockpit based on multi-modal data." Artificial Intelligence Review 57.10 (2024): 282. |
||
Response 6: Thank you for your valuable feedback on the paper. In response to your comments on the comparison with state-of-the-art and recently published research, we have added relevant experiments and included the experimental results in the paper. In order to comprehensively evaluate the effectiveness of the proposed method, we have added comparative experiments with two latest research papers published in 2023. The experimental results show that the method proposed in this paper outperforms the methods in these two latest studies in key indicators. These comparative experiments not only verify the effectiveness of the proposed method, but also further prove its progressiveness in the current research field. Thank you again for your feedback, which has played an important role in improving the content of our paper and enhancing the quality of our research. |
||
Comments 7: The results are not described properly. Only AUC has been discussed. |
||
Response 7: Thank you again for your positive feedback and valuable suggestions to improve the quality of our manuscript. Based on your suggestion, we have added a comprehensive analysis of other indicators. In the revised version, we have added experimental results analysis of evaluation indicators closely related to gesture pose estimation, including mean error (E-mean), median error (E-median), frames per second (FPS), number of model parameters (Params), and floating-point operations (FLOPs). We have explored the relationship between different indicators and their specific impact on model performance. |
||
Comments 8: What about accuracy? |
||
Response 8: Thank you for your careful review of our paper. Our method effectively reduces model parameters and computational complexity while ensuring the accuracy of gesture pose estimation. We conducted a series of detailed experiments and compared them with other non lightweight and lightweight methods in recent years. The experimental results show that the accuracy of our method is on par with or better than other non lightweight methods in recent years. Especially in the evaluation of PCK curves, our method demonstrated high accuracy, proving that the matching degree between predicted keypoints and real joints within the set distance threshold is very high, fully demonstrating that our method has achieved a good balance between lightweight and accuracy. |
||
Comments 9: Please mention the limitations of this work. |
||
Response 9: Thank you to the reviewer for their meticulous review of our paper and valuable questions raised. This article solves the problem of low accuracy caused by existing hand pose estimation methods ignoring the internal relationships between hand joints. The algorithm in this article achieves a balance between lightweight and accuracy while maintaining low computational parameters, resulting in high estimation accuracy. The next step is to explore and improve the robustness of the model to occlusion and self similarity in complex scenes, optimize device compatibility, and enhance the model's generalization ability. |
||
Comments 10: There are several typos in this paper. |
||
Response 10: Thank you for the expert teacher's careful guidance and valuable suggestions. We deeply apologize for these omissions and sincerely thank you for your careful guidance. We have conducted a self-examination of the entire content and made corrections to similar typos that appeared in the article one by one. |
||
Comments 11: Several sentences are not structured properly. |
||
Response 11: Once again, we sincerely thank the expert teachers for their patient guidance. We have made every effort to polish the language in the paper and made modifications to the unclear sentences and improper sentence structures, making the logic stronger and the wording more accurate. However, this has not affected the research content and overall framework of the paper. |
||
3. Response to Comments on the Quality of English Language |
||
Point 1: The Language is correct, but some sentences are quite complex to read. I suggest a deeper proofreading of the paper to make it more concise, deleting long subordinate sentences. |
||
Response 1: Thank you very much for your affirmation of our work, but we still invited MDPI native speakers to proofread to improve sentence structure and word selection, and carefully checked the spelling, grammar and proper nouns in the article. And we have carefully revised full paper according to the requirements of the "Electronics" journal. |
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsRong et al. developed a hand pose estimation method based on a series of improvements on a convolutional neural network model, including a lightweight feature extraction network, separable convolution layers, a deconvolution model, and a graph convolutional module. The proposed method was able obtain a detection accuracy similar to benchmark standard, however with a smaller network size.
Overall, I would recommend reconsidering the acceptance of the paper after a revision as more experimental results are needed to justify the conclusions of this paper. The results presented in this paper only included general evaluation metrics of detection accuracy and an ablation study, where the proposed method gave a slightly lower accuracy than SBL and LearnableGroups-Hand. To ground the conclusion of this paper, stats of computational complexity (e.g., number of parameters, FLOPs) as well as performance results on the devices of interest to this paper need to be presented.
Some other minor issues:
Abstract: Please specify which dataset the proposed method was tested on.
Page 1, Paragraph 3: Please only use the last name when citing the authors of an article: "Li et al. [9]". This also applies to several following citations in the paper.
Please also explain the meaning of Kinect. If it indeed refers to the motion tracking device developed by Xbox, then relevant explanation also needs to be added here.
Section 2 does not seem to discuss machine learning theory, instead it still mainly introduces the backbone models on which the proposed method is based. Therefore the section can be combined with Section 3 for clarity.
Some of the sentences in this paper look unnatural or fragmented, such as "In this paper, replace the last three layers of the RexNet network structure with three deconvolution modules, and use the simplest method to estimate heat maps from high-resolution and low-resolution feature maps."
Comments on the Quality of English Language
Some of the sentences in this paper look unnatural or fragmented, such as "In this paper, replace the last three layers of the RexNet network structure with three deconvolution modules, and use the simplest method to estimate heat maps from high-resolution and low-resolution feature maps."
Author Response
For research article
Response to Reviewer 3 Comments
|
||
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Thank you for your useful comments, which enabled us to improve our work. We greatly appreciate your professional review of our paper. In line with your suggestions, we have made extensive and careful revisions to our previous manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. |
||
2. Point-by-point response to Comments and Suggestions for Authors |
||
Comments 1: Rong et al. developed a hand pose estimation method based on a series of improvements on a convolutional neural network model, including a lightweight feature extraction network, separable convolution layers, a deconvolution model, and a graph convolutional module. The proposed method was able obtain a detection accuracy similar to benchmark standard, however with a smaller network size. Overall, I would recommend reconsidering the acceptance of the paper after a revision as more experimental results are needed to justify the conclusions of this paper. The results presented in this paper only included general evaluation metrics of detection accuracy and an ablation study, where the proposed method gave a slightly lower accuracy than SBL and LearnableGroups-Hand. To ground the conclusion of this paper, stats of computational complexity (e.g., number of parameters, FLOPs) as well as performance results on the devices of interest to this paper need to be presented. |
||
Response 1: Thank you very much for the detailed comments and suggestions provided by the reviewer. We highly value and have carefully adopted each of your suggestions. Based on your suggestions, we have made every effort to comprehensively improve the paper. Firstly, we added the RHD gesture dataset and conducted comparative experiments with new methods proposed in recent years on this dataset, further verifying the stability and generalization ability of our proposed method on different datasets. Secondly, following your suggestion, we have added three evaluation metrics: FPS (frames per second), Params (number of parameters), and FLOPs (floating-point operations) to comprehensively evaluate the computational complexity and efficiency of the proposed method. The experimental results show that the proposed method achieves higher numerical values on FPS, demonstrating the advantage of computational efficiency. Meanwhile, the reduction of Params reflects the decrease in model complexity, which is of great significance for reducing storage space occupation in practical applications. The reduction of FLOPs further validates the significant advantages of our method in reducing computational complexity and improving computational speed. In addition, we have added a detailed discussion on the experimental hardware platform information and inference speed. Through this series of supplements and improvements, we have comprehensively evaluated the performance of the proposed gesture pose estimation model and fully demonstrated that our method achieves a good balance between lightweight and accuracy. I believe that after these improvements, the paper will be more convincing and better demonstrate the innovation and practicality of the methods proposed in this article. Thank you again for your review and valuable feedback. |
||
Comments 2: Abstract: Please specify which dataset the proposed method was tested on. |
||
Response 2: Thank you very much for your valuable feedback. In response to your suggestions, we have made corresponding supplements in the abstract section. The revised abstract clearly indicates on which two datasets the proposed method was tested, which helps readers to have a more comprehensive understanding of our research work. Thank you again for your review and valuable feedback. |
||
Comments 3: Page 1, Paragraph 3: Please only use the last name when citing the authors of an article: "Li et al. [9]". This also applies to several following citations in the paper. |
||
Response 3: Thank you for carefully reviewing our paper and providing valuable feedback. For the issues you pointed out in the third paragraph of page 1 and several other citations in the paper, we have made revisions according to your suggestions and carefully checked the other citation sections to ensure that they have been correctly cited. |
||
Comments 4: Please also explain the meaning of Kinect. If it indeed refers to the motion tracking device developed by Xbox, then relevant explanation also needs to be added here. |
||
Response 4: Thank you very much for pointing out this issue. The Kinect mentioned in the paper does indeed refer to the XBOX360 motion sensing peripheral developed by Microsoft, which is a 3D motion sensing camera that can provide color and depth images. Kinect has imported real-time motion capture, image recognition, microphone input, speech recognition and other functions, which can accurately capture human motion and posture information, and provide a convenient and low-cost way to obtain data. The data obtained through Kinect can recognize gesture actions to achieve contactless operation control. We have supplemented the meaning of Kinect in the paper to ensure that readers can understand it clearly. Thank you again for your careful review and valuable feedback. |
||
Comments 5: Section 2 does not seem to discuss machine learning theory, instead it still mainly introduces the backbone models on which the proposed method is based. Therefore the section can be combined with Section 3 for clarity. |
||
Response 5: Thank you for the valuable suggestions provided by the reviewer. In the previous section 2 of the paper, we introduced the benchmark model on which our method is based. Based on your suggestion, we have carefully considered and decided to adjust the structure of the paper to enhance the coherence and clarity of the content. Merge the benchmark model section into Section 3. During the merging process, we paid special attention to the logical order within the chapter and added a new section before introducing the method proposed in this article, which introduces the network model based on it. We believe that by following your valuable suggestions for this adjustment, we can not only respond to your review suggestions, but also further improve the overall quality of the paper, making it easier for readers to understand. |
||
Comments 6: Some of the sentences in this paper look unnatural or fragmented, such as "In this paper, replace the last three layers of the RexNet network structure with three deconvolution modules, and use the simplest method to estimate heat maps from high-resolution and low-resolution feature maps." |
||
Response 6: Thank you to the reviewer for pointing out the issues of unclear expression and grammatical errors in the paper. We have carefully revised the paper to ensure that its expression is clearer and more coherent, while also ensuring that the writing is more concise and organized. On the basis of the RexNet network mentioned above, this article replaces its last three layers with three deconvolution modules. In a more direct and efficient way, heat maps are jointly estimated from high-resolution and low resolution feature maps |
||
3. Response to Comments on the Quality of English Language |
||
Point 1: The Language is correct, but some sentences are quite complex to read. I suggest a deeper proofreading of the paper to make it more concise, deleting long subordinate sentences. |
||
Response 1: Thank you very much for your affirmation of our work, but we still invited MDPI native speakers to proofread to improve sentence structure and word selection, and carefully checked the spelling, grammar and proper nouns in the article. And we have carefully revised full paper according to the requirements of the "Electronics" journal. |
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for addressing the comments.
There are a few issues in the updated manuscript that must be addressed before it can be accepted for publication.
Two terms SqueezeNet and Skeezenet are used in the manuscript. Is this a typo? Do they refer to same network?
Section 2.2: RexNet is used to replace original ResNet50... replace from where?
Section 2.2.3: The calculation process is shown in Eq. (4), where ? represents... Sentence incomplete
Section 2.2.3: A represents the likelihood region, and "Hk(p)".... should be H_k(p)
Section 2.2.3: Equation 4 is not defined properly.
Section 3.2 Second paragraph: where "di" represents....... should be "d_i"
"D" is the normalization.... "d" is the normalization...
Equation 8: There are two summations (Σ and σ). What is the significance of σ? Is this a function? It is not defined in the text.
Equation 8: In the division Σ_i l, is this l or 1? This is also not defined.
Section 3.3 From the figure [29], it can be seen... What are you referring to? A figure in reference [29]? Which figure?
The datasets used in this study are not properly referenced. At one instance (Section 3.3) CMU hand dataset is referenced "In this study, a test is conducted on the CMU-Hand dataset [28]...." Where [28] is not CMU hand dataset.
RHD is not cited at all.
The caption of Figures 12, 13 must be rewritten with more information about the graphs. Include information about the dataset as well.
Author Response
For research article
Response to Reviewer 2 Comments
|
||
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Thank you for your useful comments, which enabled us to improve our work. We greatly appreciate your professional review of our paper. In line with your suggestions, we have made extensive and careful revisions to our previous manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. |
||
2. Point-by-point response to Comments and Suggestions for Authors |
||
Comments 1: There are a few issues in the updated manuscript that must be addressed before it can be accepted for publication. Two terms SqueezeNet and Skeezenet are used in the manuscript. Is this a typo? Do they refer to same network? |
||
Response 1: We thank the reviewers for their corrections and we apologize for our carelessness. We have carefully proofread and corrected the typo Skeezenet in the latest revised manuscript submitted, and thank the reviewers again for their corrections. |
||
Comments 2: Section 2.2: RexNet is used to replace original ResNet50... replace from where? |
||
Response 2: Thanks to the reviewers for their valuable comments on the paper, we mention that a lightweight feature extraction network, RexNet, is used to replace the original ResNet50 network as the backbone network for hand feature extraction to reduce the number of model parameters. Specifically, RexNet replaces ResNet50 in place of the entire hand feature extraction module. The reason for choosing RexNet replacement is mainly based on its lightweight nature and the advantage of being able to significantly reduce the number of model parameters while maintaining a certain level of accuracy. With this substitution, we successfully reduced the number of parameters of the model and verified in the experimental part that RexNet has better performance compared to ResNet50 while maintaining a certain level of accuracy. |
||
Comments 3: Section 2.2.3: The calculation process is shown in Eq. (4), where ? represents... Sentence incomplete. Section 2.2.3: A represents the likelihood region, and "Hk(p)".... should be H_k(p).Section 2.2.3: Equation 4 is not defined properly. |
||
Response 3: Many thanks to the reviewers for their valuable comments. Regarding Section 2.2.3, it has been noticed that in the previously submitted manuscript, the illustrative sentence of Equation (4) is indeed incomplete, which may be due to formatting or display issues. I have now added and improved this section by checking Equation (4), and the improved version is as follows. The calculation process is shown in Eq. (4), whererepresents the position estimation of the k-th joint, A represents the likelihood region, and represents the likelihood value at point p. |
||
Comments 4: Section 3.2 Second paragraph: where "di" represents....... should be "d_i" "D" is the normalization.... "d" is the normalization... |
||
Response 4: Thank you to the reviewers for their careful review of our paper, we apologize for these omissions and sincerely thank you for your careful guidance. Since d is at the beginning of the sentence and therefore the capitalization of d in the previous revision led to ambiguity, we chose to change the expression after consideration and merged it with the previous sentence, the revision is as follows. Where n denotes the number of key points of the hand, denotes the distance between the predicted value and the labeled true value of the i-th joint point of the hand. d is the human body normalization factor, which in the paper is taken to be the Euclidean distance from the center of the palm of the hand to the end of the middle finger, T is the agreed-upon threshold range, which is taken to be 30 mm in the experiments, and the operator denotes whether or not the predicted value of the key point is within the threshold range after calculation. |
||
Comments 5: Equation 8: There are two summations (Σ and σ). What is the significance of σ? Is this a function? It is not defined in the text. Equation 8: In the division Σ_i l, is this l or 1? This is also not defined. |
||
Response 5: Once again, we would like to thank the reviewers for their patience and guidance. According to the reviewers' comments, we have clearly stated the meaning of σ in the paper, and have defined and explained the other parameters clearly in the paper to ensure that readers can understand their meaning. At the same time, we have also corrected the formulas to avoid affecting the readers' reading and understanding. Where n denotes the number of key points of the hand, denotes the distance between the predicted value and the labeled true value of the i-th joint point of the hand. d is the human body normalization factor, which in the paper is taken to be the Euclidean distance from the center of the palm of the hand to the end of the middle finger, T is the agreed-upon threshold range, which is taken to be 30 mm in the experiments, and the operator denotes whether or not the predicted value of the key point is within the threshold range after calculation. |
||
Comments 6: Section 3.3 From the figure [29], it can be seen... What are you referring to? A figure in reference [29]? Which figure? The datasets used in this study are not properly referenced. At one instance (Section 3.3) CMU hand dataset is referenced "In this study, a test is conducted on the CMU-Hand dataset [28]...." Where [28] is not CMU hand dataset. RHD is not cited at all. |
||
Response 6: We would like to thank the expert teachers for their careful guidance and valuable suggestions, and we apologize for these omissions and thank you sincerely for your kind guidance. The image mentioned in the text actually refers to Figure 11 of the visualization results in this paper, which has been changed and clearly labeled in the revised manuscript. Regarding the citation error problem of the dataset, we acknowledge that it is a careless error caused by failing to update the citation number of the CMU-Hand dataset in time after adding new references to the citation. We again sincerely apologize for this and have correctly updated the citations for the CMU-Hand dataset in the main text to ensure the accuracy and consistency of all citations, and have added literature citations for the RHD dataset. |
||
Comments 7: The caption of Figures 12, 13 must be rewritten with more information about the graphs. Include information about the dataset as well. |
||
Response 7: Thanks to the reviewers' meticulous review of our paper and their valuable suggestions, we have rewritten the captions of these two figures to clearly include the dataset, evaluation metrics, and experimental information. |
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have provided a revised manuscript that has addressed all issues in the previous report. Therefore, I would recommend acceptance of the revised manuscript after addressing the question below:
Table 3 refers two hand detection methods, METRO and FastMETRO, as baselines, however, no references are provided for these two methods and the relationship between these methods and previous baselines used for accuracy evaluation. The authors need to elaborate on the reason why METRO and FastMETRO are used as performance baselines as well as to provide relevant references.
Author Response
For research article
Response to Reviewer 3 Comments
|
||
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Thank you for your useful comments, which enabled us to improve our work. We greatly appreciate your professional review of our paper. In line with your suggestions, we have made extensive and careful revisions to our previous manuscript. |
||
2. Point-by-point response to Comments and Suggestions for Authors |
||
Comments 1: Table 3 refers two hand detection methods, METRO and FastMETRO, as baselines, however, no references are provided for these two methods and the relationship between these methods and previous baselines used for accuracy evaluation. The authors need to elaborate on the reason why METRO and FastMETRO are used as performance baselines as well as to provide relevant references. |
||
Response 1:Thank you very much for your valuable comments, which is an important guidance for us to further improve the content of the paper. According to your suggestion, we have added the content of METRO and FastMETRO in Table 3, and we choose METRO and FastMETRO as the comparison methods because we consider the following reasons: these two methods are representative methods in the field of gesture gesture estimation in recent years and are often used as comparison experiments; we compare them together because they not only have excellent performance in gesture estimation accuracy, but also FastMETRO has excellent performance in computational efficiency. The reason for comparing these two methods is that they not only have excellent performance in gesture estimation accuracy, but also FastMETRO has excellent computational efficiency, which is a good balance between detection speed and accuracy, and by comparing them, we can not only observe the degree of performance improvement compared with ordinary methods, but also compare them with methods with the same lightweight processing, so as to evaluate the performance advantages and disadvantages of our methods more objectively. method's performance advantages and disadvantages. According to the reviewers' suggestions, we have added the references of these two methods in the text. [25] Lin, K.; Wang, L.; Liu, Z. End-to-end human pose and mesh reconstruction with transformers, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1954-1963. [26] Cho, J.; Kim, Y.; Oh T. Cross-attention of disentangled modalities for 3dhuman mesh recovery with transformers. European Conference on ComputerVision, 2022, pp. 342-359. |
Author Response File: Author Response.pdf