Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Fusing Skeleton-Based Scene Flow for Gesture Recognition on Point Clouds

Electronics 2025, 14(3), 567; https://doi.org/10.3390/electronics14030567

by Yahui Liu

and Jiajia Jiao^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Electronics 2025, 14(3), 567; https://doi.org/10.3390/electronics14030567

Submission received: 6 January 2025 / Revised: 25 January 2025 / Accepted: 29 January 2025 / Published: 31 January 2025

(This article belongs to the Special Issue Machine Learning and Deep Learning Based Pattern Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The main question addressed by the research is solving a task of gesture recognition. This is an interesting and up-to-date subject worth researching. The gestures recognition method proposed in this paper contains two steps:
- Preprocessing: scene flow is from hand skeletons. Skeletons are converted into point clouds, and scene flow is estimated based on different pairs of point clouds and metrics;
- Feature information extraction: static features and dynamic features are extracted to identify different types of gestures.
The proposed approach is different from most common procedures where the skeleton is generated from the point cloud. The authors idea is novel an interesting. The proposed methodology sounds correct. The references of the paper are up-to-date and appropriate.

I have the following comments that might improve the paper:

1) Please add more details about the implementation of the proposed method.
2) Please explain the hyperparameters setting in Section 4.1.2, especially of kf.
3) Add the name of the horizontal axis in Figure 7.
4) Discuss the limitations of the proposed method
5) Please publish the source code of your method - without it, results are virtually impossible to reproduce.

Author Response

Thanks for your helpful comments and constructive suggestions. The implementation details of the proposed method in four aspects have been added. The explanation of and other hyperparameters in the training of the FSS-GR also have been added. Moreover, we have added the name of the horizontal axis to Figure 5 and discuss the limitations of the proposed method from the perspective of FLOPs and low efficiency of scene flow branch. We will open the source our code soon in the linkage https://github.com/shawn-fei/fss-gr.git.

The detailed responses are attached in the response letter, and the related modifications (added notes, highlighted in red and green) have been done in the new version of the submission. It is noted that the response letter uses the same number of Figures or Tables as the revised paper.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Comments and Suggestions

1. This paper attempts to improve the ability to capture fine-grained dynamic features and improve the relevance of point clouds and gestures.

2. The authors propose a novel method of Fusing Skeleton-based Scene Flow for Gesture Recognition (FSS-GR) for higher recognition accuracy.

3. The comprehensive experiments and ablation study demonstrate that FSS-GR achieves higher accuracy than state-of-the-art works, up to 95.2%, with only 0.02% to 1.9% extra computation cost.

4. In Fig 1, the definitions of Neighbourhood 1 and Neighbourhood 2 are not clear enough. The use of notation

5. It would be better if you use it if it is appropriate.

6. Page 6, Line 216, means (according to Line 215) . Mathematically it is not valid. Because ( may not be an integer. Generally, .In my opinion, the frame number should be an integer.

7. Unless you settle the issue raised in comment 6, the remaining part of the whole section 3 may not be logical. Please check the logical consistency of definitions and uses of notations.

8. Apart from these the paper is well organized.

9. I would like to advise the authors to make a comparison with other approaches.

Author Response

Thanks for your helpful comments and constructive suggestions. We have revised the definitions of Neighborhood 1 and Neighborhood 2. To explain that frame number used in section 3.1.1 is an integer, we have added the explanation of and each kind of point cloud sequence. To keep the logical consistency of definitions and uses in Section 3.1.1, we have updated notations. Specially, we have modified pairs of point cloud to . We compare FSS-GR with the advanced GCN-based approaches in recent years, as shown in Table 4.

The detailed responses are attached in the response letter, and the related modifications (added notes, highlighted in blue and green) have been done in the new version of the submission. It is noted that the response letter uses the same number of Figures or Tables as the revised paper.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a novel method for gesture recognition by fusing skeleton-based scene flow with point cloud data. This innovative approach addresses existing challenges in dynamic gesture recognition effectively.

The experiments demonstrate robust performance, with the proposed method achieving higher accuracy compared to state-of-the-art techniques. The results are well-supported by comprehensive evaluations on multiple datasets.

The paper provides a clear and well-structured methodology, detailing the process of converting skeletons into point clouds and the subsequent scene flow estimation. This clarity enhances reproducibility.

The authors acknowledge the limitations of their approach and suggest possible improvements, showcasing a reflective understanding of their work's scope and impact.

# weakness

1. The quality of writing needs improvement. There are numerous grammatical errors and awkward phrases that hinder the overall readability of the paper.

2. While the paper mentions computation cost, it does not provide a thorough analysis of the trade-offs between efficiency and recognition performance. This is crucial for understanding the practical applicability of the method.

Comments on the Quality of English Language

The quality of writing needs improvement.

Author Response

Dear Reviewer,

Thanks for your helpful comments and constructive suggestions. We have corrected several typos and grammar errors in the manuscript. The cost of recent works (MAE, FPPR-PCD, PointLSTM-late) has been added in Table 8. Moreover, we have analyzed computational cost from the perspective of Params, FLOPs, and the trade-offs between efficiency and recognition performance.

The detailed responses are attached in the response letter, and the related modifications (highlighted in green and added notes) have been done in the new version of the submission. It is noted that the response letter uses the same number of Figures or Tables as the revised paper.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This manuscript proposed an Fusing Skeleton-Based Scene Flow for Gesture Recognition (FSS-GR) framework for gesture recognition tasks. Specifically, dynamic gesture skeletons were converted into pairs of point clouds first and then fed into self-supervised estimators to generate fine-grained scene flows. This fine-grained motion feature was integrated with coarse-grained features from depth images to improve recognition accuracy. The framework was evaluated on the SHREC’17 dataset. While the performances of the proposed method are promising, the current manuscript still have some questions need to be addressed:

1. Recent research using GNN for skeleton-based gesture recognition achieved promising performance. What's the purpose of converting the skeleton to scene flow for the task? In addition, this preprocessing conversion might increase computational and memory requirements. A detailed computational cost analysis is needed to validate the effectiveness of the proposed method.

2. This study only tested the proposed method on the SHREC’17 dataset. More experimental results on various datasets are needed to demonstrate the generality of the proposed method. In addition, the baselines selected in this study are outdated and more recent GCN baselines are needed for fair comparison.

3. It is unfair to compare the proposed method with baselines that take only skeletons or point clouds as input, as the proposed method leverages both types of input data simultaneously in Table 1. More comparison with multimodal baselines which uses both types of input data is highly encouraged.

4. There are several typos and grammar errors in the current manuscript, such as ‘point cloud-based gesture recognitions’, ‘FSS-GR achieves the best performance of 95.2% accuracy on SHREC’17’ etc. In addition, please merge Figure 1 and 2 for better understanding the overall framework.

Author Response

Dear Reviewer,

Thanks for your helpful comments and constructive suggestions. We have analyzed computational cost from the perspective of Params, FLOPs, and the trade-offs between efficiency and recognition performance. The experiments on the DHG dataset have been conducted. For fair comparison, we have analyzed the characteristics of GCN-based approaches in recent years. Moreover, we compare FSS-GR with the advanced GCN-based approaches in recent years, as shown in Table 4.

The detailed responses are attached in the response letter, and the related modifications (added notes, highlighted in yellow and green) have been done in the new version of the submission. It is noted that the response letter uses the same number of Figures or Tables as the revised paper.

Article Menu

Fusing Skeleton-Based Scene Flow for Gesture Recognition on Point Clouds

Further Information

Guidelines

MDPI Initiatives

Follow MDPI