A 3DCNN-LSTM Multi-Class Temporal Segmentation for Hand Gesture Recognition
Round 1
Reviewer 1 Report
1. The abstract is weak and the contribution has to be stated clearly in the abstract.
2. The literature review is poor and it has to be enriched with other recent work.
3. The points between the line 102 to 112 are the objectives as stated by authors. Where are the contributions? The objectives are not the contributions.
4. In Experimental set-up, what are the hardware tools (Camera,…) for implementing real-time recognition?
5. Why did the authors focus on 12 and 22 participants? I suggest to take further number of participants.
6. The number of epochs are assigned to 64. It is interesting to adjust this design parameter to show its effectiveness of performance of classification.
7. The comparative study has to be conducted with other techniques to prove the effectiveness of proposed method.
8. The conclusion is disappeared from this article. It is compulsory to be added.
9. The discussion part has to include tables for comparison.
10. The future work has to be added to extend this study by referring to some other techniques to be applied in future and compared to the proposed method. I suggest to mention the classification techniques included in the following references: (https://doi.org/10.3390/electronics10212719), (https://doi.org/10.3390/electronics10232888)
Author Response
We sincerely thank you for their insightful feedback, which has allowed us to improve upon the clarity of this manuscript. Please find below the points addressed individually.
- The abstract is weak, and the contribution has to be stated clearly in the abstract.
Thank you for pointing this out. We have revised the abstract to clarify a more transparent way the contributions. Please see the amended version of the abstract for more details. The added part is also pasted below.
“The main contribution of this work includes a custom hand gesture recognition approach from monocular RGB video sequences with a simple pre-trained network that outperforms previous temporal segmentation models.”
- The literature review is poor, and it has to be enriched with other recent work.
Thank you for this suggestion, and we agree that this is an important point for the clarity of the manuscript. We have in the literature review more works, which you can find from line 95 to line 114. Please see the added paragraphs also pasted below for your kind attention.
“Kuehne et al. [31] proposed an end-to-end generative framework for video segmentation using the hidden Markov model for video segmentation and recognition of human activities, with the drawback of intensive processing time and inability to translate this approach in real-time. Ni et al. [32] presented an approach based on recurrent neural networks (RNNs) to perform sliding window detection and segment continuous actions. The issue with this methodology is linked to the identification of peripherical boundaries only, with no global overview of the temporal events.
To overcome these disadvantages, recent approaches have suggested making a distinction between gestural frames, when the action is taking place, and translation frames by merging both shape and spatiotemporal parameters. Such an approach has been presented by Wang [27]. Wang presented a segmentation method that contained both action and appearance-based information and used both RGB and depth capture modalities driven by dual architecture for hand gesture classification and segmentation. This approach has the drawback of dual-modality acquisition, which does not leverage standard monocular RGB cameras. Similarly, most recently, Sahoo et al. [33] presented an end-to-end fine-tuning method of a pre-trained CNN for hand gesture recognition model; however, also their model was driven by dual-modality and multiple architectures.”
- The points between line 102 to line 112 are the objectives as stated by the authors. Where are the contributions? The objectives are not the contributions.
Thank you for the thoughtful comment. We have included a work outline and contributions to Section 1, which you can find from line 140 to line 146. Please see the added paragraph here for your reference.
“To address these objectives, this article is organized as follows. In Section 2 the experimental set-up, data collection, and pre and post-processing steps implemented for the action recognition detector are explained. Section 3 discusses the experimental results and Section 4 summarised the main implications of these findings. Finally, Section 5 concludes the proposed work and addresses future directions. The main contributions of this work include the introduction of a methodology trained from a single acquisition modality (RGB cameras) on a small-scale dataset and on a single architecture.”
- In the Experimental setup, what are the hardware tools (Camera,…) for implementing real-time recognition?
We thank the reviewer for noting the need for additional information on the camera here. We have added the camera information on line 156. Please find pasted sentence on the camera also pasted below.
“Video data were captured using an Oqus RGB camera at a 30 Hz frame rate.”
- Why did the authors focus on 12 and 22 participants? I suggest taking further number of participants.
Thank you for your questions. Studies have been validated on twelve participants and using a small-scale dataset we demonstrate that we can fine-tune the model for a limited number of participants.
- The number of epochs are assigned to 64. It is interesting to adjust this design parameter to show its effectiveness of performance of classification.
We show that there is minimal change in changing the number of tracking baches (or epochs) and thus can postulate that model performance is not altered by further training bach and reaches overfitting.
- The comparative study has to be conducted with other techniques to prove the effectiveness of the proposed method.
Thank you for your point. Please refer to Table 2 (line for comparison in terms of performances with previous investigations (line 395).
- The conclusion is disappeared from this article. It is compulsory to be added.
We thank you for identifying this missing section. Please note that this section has been included and it is now present in the manuscript from line 394 to line 407. Please find the pasted conclusion section below for your reference.
“This work offers an approach for large-scale video segmentation for hand gesture recognition. The video sequences were first segmented into single hand gesture sequences by classifying the frames into different gestures. For one each of the segmented hand gesture series, the suggested technique utilized spatiotemporal information based on a three-dimensional convolutional neural network combined with a long-short-term memory unit. To enhance the accuracy of the model the training was performed on a large-scale hand dataset and fine-tuned for the relevant hand gestures. The presented model illustrated the possibility of training a model utilising a small-scale set of RGB-driven dataset, compared to the previously presented techniques that require vast fully labelled datasets. Furthermore, the pipeline is performed on a small-sized architecture that enables easier integration of further hand gesture classes uses monocular cameras with the aim of leveraging ubiquitous technologies (e.g., in smartphones/laptops) and encourage the scalability for future investigations.”
- The discussion part has to include tables for comparison.
Thank you for helping us improve benchmarking precision in the manuscript.
- The future work has to be added to extend this study by referring to some other techniques to be applied in future and compared to the proposed method. I suggest to mention the classification techniques included in the following references: (https://doi.org/10.3390/electronics10212719), (https://doi.org/10.3390/electronics10232888)
Thank you for this note. We have added additional techniques that will be applied in the future to address this specific point. Particularly the suggested mentioned technique has been added to line 427. The entire section is also pasted below for your reference.
“A foreseen limitation of this investigation includes the absence of edge cases for the recordings captured in unconstrained scenarios. Ambiguous appearance results may be leading to tracking errors. Capturing methods solely relying on two-dimensional appearance information could, in fact, struggle in scenarios where images are blurry, out-of-the-plane or rotated, distant or small. Visual tracking methods may be incorporated to consider types of interference (e.g., blurry hand gestures if the participants or the camera moves suddenly during the acquisition) with the goal of disambiguating the recognition target. Rescuing identifiable appearance cues of image interference for a real-time hand recognition model, for instance with an image blue classification and blue removal, would be an attractive research direction.
Furthermore, while the supervised-based transfer learning produced expected outcomes, the approach presented in this work could be transported to unsupervised learning and could support the automated labelling and segmentation of long video recordings, increasing the models’ generalizability. Furthermore, hybrid deep learning models, such as the work from Nasser et al. [41], that combine recurrent networks to also model the temporal dependencies in high-dimensional sequences, which is an interesting area to explore further.
Reviewer 2 Report
The authors introduce a multi-class hand gesture recognition model developed to identify a set of defined hand gesture sequences in two-dimensional RGB video recordings.
They present an action detection classifier that looks at both appearance and spatiotemporal parameters of consecutive frames. Their classifier utilizes a convolutional-based network combined with a long-short-term memory unit. To leverage the need for a large-scale dataset, the model uses an available dataset to then adopt a technique known as transfer learning to fine-tune the model on the hand gestures of relevance.
The authors conclude that: (a) the presented model illustrates the possibility of training a model with a small set of data (113,410 fully labelled frames). (b)The proposed pipeline embraces a small-sized architecture that could facilitate its adoption.
The article is interesting and well written.
I have some minor comments with a pure academic spirit.
1. The abstract must better synthetize the sections of the manuscript
2. Insert the limitations in the discussion
3. Insert the conclusions.
4. The manuscript must be edited according the mdpi standards. The text, the legends, etc do not respect these standards
Author Response
We sincerely thank you for their insightful feedback, which has allowed us to improve upon the clarity of this manuscript. Please find below points addressed individually.
- The abstract must better synthetize the sections of the manuscript
Thank you for helping us improve clarity of the abstract. We have advised the abstract to clarify in a clearer way the contributions. Please see the amended version of the abstract for more details. The added part is also pasted below.
“The main contribution of this work includes a custom hand gesture recognition approach from monocular RGB video sequences with a simple pre-trained network that outperforms previous temporal segmentation models.”
- Insert the limitations in the discussion
Thank you for the thoughtful comment. Please note that Discussion (Section 4) has been modified to add limitations. Which you can fine from line 373 to line 383. Please find the pasted part enclosed for your reference.
“To adopt and scale this application in real-work scenarios, if multiple classes are considered, future directions could include testing this approach for real-time application using a finite state machine system that can decrease the classes under inspection and increase the accuracy for real-time application. To further improve the model's performance for real-time applications, the input image size or the number of layers could be increased. On top of the 20BN Jester dataset, an additional dataset could be used to enhance the model’s performance. The Jester dataset was developed by actors and did not provide numerous occlusion cases. Regardless, in realistic circumstances, occlusion exists. A foreseen limitation of this investigation includes the absence of edge cases for the recordings captured in unconstrained scenarios. Ambiguous appearance results may be leading to tracking errors. Capturing methods solely relying on two-dimensional appearance information could, in fact, struggle in scenarios where images are blurry, out-of-the-plane or rotated, distant or small. Visual tracking methods may be incorporated to consider types of interference (e.g., blurry hand gestures if the participants or the camera moves suddenly during the acquisition) with the goal of disambiguating the recognition target. Rescuing identifiable appearance cues of image interference for a real-time hand recognition model, for instance with an image blur classification and blur removal, would be an attractive research direction. “
- Insert the conclusions.
We thank you for identifying this missing section. Please note that this section (Section 5) has been included and it is now present in the manuscript from line 394 to line 407. Please find the pasted conclusion section below for your reference.
“This work offers an approach for large-scale video segmentation for hand gesture recognition. The video sequences were first segmented into single hand gesture sequences by classifying the frames into the different gestures. For one each of the segmented hand gesture series, the suggested technique utilized spatiotemporal information based on a three-dimensional convolutional neural network combined with a long-short-term memory unit. To enhance the accuracy of the model the training was performed on a large-scale hand dataset and fine-tuned for the relevant hand gestures. The presented model illustrated the possibility of training a model utilising a small-scale set of RGB-driven dataset, compared to the previously presented techniques that require vast fully labelled datasets. Furthermore, the pipeline is performed on a small-sized architecture that enables easier integration of further hand gesture classes uses monocular cameras with the aim of leveraging ubiquitous technologies (e.g., in smartphones/laptops) and encourage the scalability for future investigations.”
- The manuscript must be edited according the mdpi standards. The text, the legends, etc do not respect these standards
Thank you very much for your kind suggestion. Following the link (https://www.mdpi.com/journal/electronics/instructions) we have included all the main requested changes following the MDPI Microsoft Word template.
Round 2
Reviewer 1 Report
1. The abstract is still weak and it has to be rewritten in compact form.
2. Please look at the phrase "Computer vision techniques rely on convolutional neural networks (CNNs) to extract two-33 dimensional (appearance-based) and three-dimensional (motion-based) array features". This sentence is not streamlined with previous text. The introduction has to be rewritten in continuous way and some texts have discontinuous thoughts.
3. Referring to the phrase "The key objectives of this paper include", the authors declares that the objectives are the same as the contributions. This is not true!!
4. The novelty of this study is missed and all objectives are merely evaluation steps not a contribution steps.
5. I do Know the authors have applied "Twenty-two volunteers", while many gestures can be performed by one person.
6. The legends of Figure (6) does not describe or define the four curves!
7. Why the index "Mean Jaccard Index" has been chosen for evaluation.
8. The authors used 12 and 22 participants. The authors have to explain why these numbers specifically have been used!
9. The results have to be deeply discussed.
10. The conclusion is descriptive and quantitative evaluation based on percentages of improvement has to be added.
11. The future work has to be added.
Author Response
We are extremely thankful for your insightful feedback, which has permitted us to further improve the manuscript. Please find below the points addressed individually.
- The abstract has been amended, as you can see from the track changes.
- The suggested phrase has been better contextualized as suggested (L.30-34)
- The objectives have been amended into the originally intended contributions (L. 165-176)
- Novelties have been better clarified (please see point 3)
- Further clarification on data collection (12 vs 22 participants ) has been added from line 372 to line 382
- The legend of Figure (6) has been updated (L.536)
- Further clarifications on Jaccard Index" provided (L. 593)
- The authors have clarified 12 and 22 participants *please see point 5)
- We believe that the results have been discussed extensively in the discussion section, we have also revied similar approaches as suggested by the reviewer in the previous revision.
- While a descriptive conclusion with a qualitative summary of all of the main points in the body text is allowed and often adopted as a strategy in Electronics MDPI papers that the authors have reviewed, we are very thankful for your suggestion and have included the percentage of improvement (L. 678)
- Please note that the future work is present in the discussion (L 619 to 656), we have also added an additional section on future directions in the conclusion section (L 682 to 687).
Author Response File:  Author Response.pdf
 Author Response.pdf
Round 3
Reviewer 1 Report
The reviewer thanks author for their patience and they have addressed all the comments. I have no further comments and the article can be accepted.
 
        

