Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM

Electronics 2025, 14(9), 1793; https://doi.org/10.3390/electronics14091793

by Jinlong Chen^†, Fuqiang Jin^†, Yingjie Jiao, Yongsong Zhan and Xingguo Qin^*

Reviewer 1:

Yaa Takyiwaa Acquaah

Reviewer 2: Anonymous

Reviewer 3:

Shariar Md Imtiaz

Electronics 2025, 14(9), 1793; https://doi.org/10.3390/electronics14091793

Submission received: 28 March 2025 / Revised: 22 April 2025 / Accepted: 23 April 2025 / Published: 28 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Review Comments

The paper addresses the problem of dynamic gesture recognition in complex environments by proposing a hybrid model that combines Transformer and LSTM architecture. The authors should address the comments below:

Authors presented the background of the study in the first paragraph. The rest is on challenges. Could the authors elaborate more, giving use cases of dynamic gesture detection in applications in the introduction?
The caption of figure 1 and all figures are long. A high-level description of figure 1 one can appear in section 3 before the figure appears.
Authors should mention the figures in the text before they appear in the manuscript.
Authors should revise all table captions to conform with the journal requirements.
The authors should introduce Algorithm 1 before it is presented in the manuscript.
Authors should revise the first paragraph of section 4.1 as it is confusing.
All abbreviations should be defined upon first use in the manuscript, with the abbreviation provided in parentheses. The same applies to the abstract. Example LSTM, SAM …
The proposed hybrid model may have a high computational cost due to the stacked Transformer and LSTM layers. This could be a concern for deployment in real-time or resource-constrained environments. Authors should include an analysis of inference time and memory usage to strengthen the evaluation.
Authors are encouraged to include an ablation study that justifies the use of both Transformer and LSTM layers. Results comparing Transformer only, LSTM only and Transformer + LSTM would clarify the value added by each component.
Lines 246 – 247 and lines 248-253 communicate the same message but the authors still don’t mention algorithm 1.
What do the authors mean by “our dataset”?
Authors should briefly describe the SHREC 2017 dataset.
A detailed description of the models used for comparison with the proposed method is necessary for a comprehensive evaluation.
The paper would benefit from a more in-depth discussion that critically reflects on how and why the proposed model outperforms existing mainstream approaches, rather than merely reiterating the results.

Author Response

Comments 1:[Authors presented the background of the study in the first paragraph. The rest is on challenges. Could the authors elaborate more, giving use cases of dynamic gesture detection in applications in the introduction?]
Response 1:[Thank you for your valuable comment. In response to your suggestion, we have revised the introduction (lines 20–34) to include practical use cases of dynamic gesture detection, such as human-computer interaction, sign language recognition, virtual reality, and smart surveillance. This modification aims to better illustrate the real-world significance and broad applicability of our research.]

Comments 2:[The caption of figure 1 and all figures are long. A high-level description of figure 1 one can appear in section 3 before the figure appears. ]
Response 2:[Thank you for the reviewer’s suggestion. We have revised the caption of Figure 1 to make it more concise, and added a high-level description of the figure in Section 3 before it appears. These changes help improve the readability and clarity of the manuscript.]

Comments 3.Authors should mention the figures in the text before they appear in the manuscript.
Response 3:[Thank you for your helpful comment. We have carefully revised the manuscript to ensure that all figures are properly mentioned and briefly described in the main text before they appear. This adjustment improves the logical flow and readability of the paper.]

Comments 4:[Authors should revise all table captions to conform with the journal requirements.]
Response 4:[Thank you for your suggestion. We have revised all table captions to ensure they conform to the journal’s formatting requirements.]

Comments 5:[The authors should introduce Algorithm 1 before it is presented in the manuscript.]
Response 5:[Thank you for your valuable suggestion. We have introduced Algorithm 1 in the main text before its appearance and provided a detailed explanation after it. The relevant modifications have been made in lines 288–309 of the revised manuscript.]

Comments 6:[Authors should revise the first paragraph of section 4.1 as it is confusing.]
Response 6:[Thank you for your insightful comment. We have revised the first paragraph of Section 4.1 to improve clarity. The updated version now provides a clearer and more structured introduction to both the public datasets and our private dataset used in the experiments.]

Comments 7:[All abbreviations should be defined upon first use in the manuscript, with the abbreviation provided in parentheses. The same applies to the abstract. Example LSTM, SAM … ]
Response 7:[Thank you for your helpful comment. We have carefully reviewed the entire manuscript and the abstract to ensure that all abbreviations, such as LSTM and SAM, are defined upon first use with the corresponding abbreviations provided in parentheses.]

Comments 8:[The proposed hybrid model may have a high computational cost due to the stacked Transformer and LSTM layers. This could be a concern for deployment in real-time or resource-constrained environments. Authors should include an analysis of inference time and memory usage to strengthen the evaluation.]
Response 8:[Thank you for your constructive comment. We fully agree that computational efficiency is an important consideration, especially for real-time or resource-constrained applications. In response, we have added an analysis of the model’s inference time and memory usage in lines 409–419 of the revised manuscript to provide a more comprehensive evaluation of the proposed method.]

Comments 9:[Authors are encouraged to include an ablation study that justifies the use of both Transformer and LSTM layers. Results comparing Transformer only, LSTM only and Transformer + LSTM would clarify the value added by each component.]
Response 9:[Thank you for your valuable suggestion. To demonstrate the contribution of each component in our hybrid model, we have conducted an ablation study comparing the performance of Transformer only, LSTM only, and the combined Transformer + LSTM architecture. The results, presented in lines 329–349 of the revised manuscript, clearly show the effectiveness of integrating both modules.]

Comments 10:[Lines 246 – 247 and lines 248-253 communicate the same message but the authors still don’t mention algorithm 1.]
Response 10:[Thank you for pointing this out. We have revised the manuscript by removing the redundant content in lines 246–253 and ensured that Algorithm 1 is properly introduced and referenced. The corresponding modifications have been made in lines 286–309 of the revised manuscript.]

Comments 11:[What do the authors mean by “our dataset”?]
Response 11:[Thank you for your comment. We have revised the manuscript to clarify the meaning of “our dataset.” A detailed explanation has been added in Section 4.1 to describe the composition, collection process, and purpose of our dataset.]

Comments 12:[Authors should briefly describe the SHREC 2017 dataset.]
Response 12:[Thank you for your suggestion. We have added a brief description of the SHREC 2017 dataset in Section 4.1, including its key characteristics and relevance to our study.]

Comments 13:[A detailed description of the models used for comparison with the proposed method is necessary for a comprehensive evaluation.]
Response 13:[Thank you for your suggestion. We have added a detailed description of the models used for comparison with the proposed method. This additional information helps provide a more comprehensive evaluation of our approach.]

Comments 14:[The paper would benefit from a more in-depth discussion that critically reflects on how and why the proposed model outperforms existing mainstream approaches, rather than merely reiterating the results.]
Response14:[Thank you for your insightful comment. In response, we have revised the manuscript to provide a more in-depth discussion on how and why the proposed model outperforms existing mainstream approaches. This critical reflection is included in lines 395–407 of the revised manuscript, offering a deeper analysis beyond just reiterating the results.]

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript titled"Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM"proposes the method of dynamic gesture detection.The article proposes a model based on Attention Enhanced LSTM with Grounding SAM, which combines LSTM, multi head attention mechanism, and Grounding SAM technology to enhance the ability to capture temporal dynamics in attitude sequences.Through the performance of the model on two different datasets and under different conditions,the advantages of this method in dynamic gesture recognition have been demonstrated.Overall speaking, the manuscript is well organized.The specific modification suggestion is as follows：

In the article, the model combines transformer, multi head attention mechanism, and LSTM, which has a complex structure and requires high computational resources and time. How to solve on resource constrained devices?
In current research, the focus is mainly on visual data. Will the fusion of multimodal information such as audio and depth affect recognition accuracy and robustness?
The article mentions that attention mechanisms can improve performance.Can you explain how to further optimize it?

Author Response

Comments 1: [In the article, the model combines transformer, multi-head attention mechanism, and LSTM, which has a complex structure and requires high computational resources and time. How to solve on resource-constrained devices?]
Response 1: [Thank you for your insightful comment. In response to your concern regarding resource constraints, we have added a discussion on potential solutions for deploying the proposed model on resource-constrained devices. The revised explanation, which addresses computational efficiency and optimization strategies, can be found in lines 421–427 of the manuscript.]

Comments 2:[ In current research, the focus is mainly on visual data. Will the fusion of multimodal information such as audio and depth affect recognition accuracy and robustness?]
Response 2: [Thank you for your insightful comment. We have addressed your concern regarding the fusion of multimodal information in the manuscript. In the revised version, we discuss the potential impact of incorporating audio and depth information on recognition accuracy and robustness. This explanation can be found in lines 428–432 of the revised manuscript.]

Comments 3: [The article mentions that attention mechanisms can improve performance. Can you explain how to further optimize it?]
Response 3: [Thank you for your thoughtful comment. In response, we have provided an explanation on how the attention mechanism can be further optimized to enhance performance. This discussion, which includes potential strategies for improving the attention mechanism's efficiency and effectiveness, is located in lines 433–436 of the revised manuscript.]

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

1. In abstracts, it is recommended to avoid using abbreviations such as "LSTM, SAM ". "Experimental results demonstrate that the proposed Attention-Enhanced LSTM with Grounding SAM method exhibits enhanced robustness and higher recognition accuracy when handling complex backgrounds and dynamic gesture changes, significantly improving the system’s efficiency and stability". To support this statement, include a summary of the findings here.
2. In introduction line no. 58 - 64, it should clearly highlight the key features. It needs to be revised and rewritten to highlight your contributions more clearly. The authors can remove the last part of the introduction, lines 65–73.
3. In section 2 line no. 91-92, "... traditional machine learning techniques like Hidden Markov Models (HMM) and Dynamic Time Warping (DTW)". Here, the authors need to include citations to their methods.
4. Section 3.1, 'grounding SAM data preprocessing'; 'Figure 6. illustrates the structure of a two-layer LSTM'; here the title correctly begins with a capital letter. Which is right, grounding SAM or just SAM? In all cases, use the appropriate term.
5. In section 3.1, there is a lack of explanation about Grounding DINO and Grounding SAM. Describe the working principle in more detail, including how these two approaches are integrated. Briefly describe why this approach outperforms the current one.
6. Section 4.1, authors should add an explanation about SHREC 2017 and their own dataset. Add some data processing and a training process with appropriate figures.
7. In the Results Analysis section, the authors should provide a graphical comparison of the current methods in addition to the numerical comparison.
8. It is recommended to avoid arXiv preprint references. Such as ref. [20], [38].

Comments on the Quality of English Language

This paper needs “English proof-reading.”

Author Response

Comments1:[ In abstracts, it is recommended to avoid using abbreviations such as "LSTM, SAM ". "Experimental results demonstrate that the proposed Attention-Enhanced LSTM with Grounding SAM method exhibits enhanced robustness and higher recognition accuracy when handling complex backgrounds and dynamic gesture changes, significantly improving the system’s efficiency and stability". To support this statement, include a summary of the findings here.]
Response 1:[Thank you for your valuable suggestion. In response, we have revised the abstract to avoid using abbreviations such as "LSTM" and "SAM." Additionally, we have included a summary of the findings to better support the statement regarding the improved robustness and recognition accuracy. The updated abstract now provides a clearer overview of the experimental results.]

Comments 2:[ In introduction line no. 58 - 64, it should clearly highlight the key features. It needs to be revised and rewritten to highlight your contributions more clearly. The authors can remove the last part of the introduction, lines 65–73.]
Response 2: [Thank you for your constructive feedback. In response, we have revised lines 58–64 to clearly highlight the key features of our work and to present our contributions more clearly. Additionally, we have removed the last part of the introduction (lines 65–73) as suggested. The relevant changes are reflected in lines 69–78 of the revised manuscript.]

Comments 3:[ In section 2 line no. 91-92, "... traditional machine learning techniques like Hidden Markov Models (HMM) and Dynamic Time Warping (DTW)". Here, the authors need to include citations to their methods.]
Response 3: [Thank you for your suggestion. We have added the appropriate citations to the traditional machine learning techniques, such as Hidden Markov Models (HMM) and Dynamic Time Warping (DTW), in lines 96–97 of the revised manuscript.]

Comments 4:[ Section 3.1, 'grounding SAM data preprocessing'; 'Figure 6. illustrates the structure of a two-layer LSTM'; here the title correctly begins with a capital letter. Which is right, grounding SAM or just SAM? In all cases, use the appropriate term.]
Response 4: [Thank you for your helpful comment. We have reviewed the manuscript and made the necessary corrections. The term "Grounding SAM" has been consistently used where appropriate, and all titles, such as in Section 3.1 ("Grounding SAM Data Preprocessing") and the description of Figure 6, have been revised to start with capital letters.]

Comments 5:[ In section 3.1, there is a lack of explanation about Grounding DINO and Grounding SAM. Describe the working principle in more detail, including how these two approaches are integrated. Briefly describe why this approach outperforms the current one.]
Response 5: [Thank you for your constructive feedback. In response, we have provided a more detailed explanation of the working principles of Grounding DINO and Grounding SAM in Section 3.1. We also describe how these two approaches are integrated and why this approach outperforms existing methods. The relevant changes can be found in lines 149–156 of the revised manuscript.]

Comments 6:[ Section 4.1, authors should add an explanation about SHREC 2017 and their own dataset. Add some data processing and a training process with appropriate figures.]
Response 6: [Thank you for your valuable suggestion. In response, we have added an explanation about the SHREC 2017 dataset and our own dataset in Section 4.1. We have also included details on data processing and the training process, along with appropriate figures to clarify these aspects.]

Comments 7:[ In the Results Analysis section, the authors should provide a graphical comparison of the current methods in addition to the numerical comparison.]
Response 7: [Thank you for your valuable suggestion. In response, we have added a graphical comparison of the current methods in addition to the numerical comparison. The new graphical representation, along with the ablation study, can be found in Section 4.2, specifically in lines 328–347 of the revised manuscript.]

Comments 8:[ It is recommended to avoid arXiv preprint references. Such as ref. [20], [38].]
Response 8: [Thank you for your suggestion. While we understand the recommendation to avoid arXiv preprint references, we have used these references as they are the most relevant and up-to-date sources for our study. We believe that these preprints contribute significantly to the context of our work, and we have provided proper context for their inclusion.]

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed all the comments except some figure captions being lengthy.

Author Response

Comments : [The authors addressed all the comments except some figure captions being lengthy.]

Responses :[Thank you for your valuable feedback. We have carefully revised the figure captions to make them more concise and have completed the necessary adjustments. We hope the revised version meets your expectations.]

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

No more comments are required.

Author Response

Comments :[ No more comments are required.]

Response :[Thank you very much for your detailed feedback. We have made all the necessary revisions and believe the work is now in its final form. We appreciate your suggestions]

Author Response File: Author Response.pdf

Article Menu

Improving Dynamic Gesture Recognition with Attention-Enhanced LSTM and Grounding SAM

Further Information

Guidelines

MDPI Initiatives

Follow MDPI