Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy
Round 1
Reviewer 1 Report
This paper proposes a novel transformer-based model with independent tokens for estimating three-dimensional (3D) human pose and shape from monocular videos, with a specific focus on its application in rehabilitation therapy. Two tokens are introduced to encode 3D rotation of human joints and camera parameters for SMPL-based human mesh reconstruction. In order to verify the performance of the proposed method, the experiments are conducted on four datasets. The proposed method outperforms state-of-the-art methods on 3DPW and RICH benchmarks and achieves comparable results at Human3.6m and MOYO. The method appears reasonable, but some issues need to be clarified and modified. The main issues are as follows.
Major concerns:
1. In Section 1, it is necessary to highlight the innovation of this paper. For example, the contributions of this paper should be listed.
2. Figure 3 is too blurry. Please redraw it.
3. Section 2 should be a taxonomic summary of existing methods, not a detailed description of the three methods. Please revise this section.
4. In Section 3, the proposed method should be described in detail, such as Joint rotation tokens, Camera tokens, Transformer, Loss, and so on. Please supplement.
5. Implementation details should be added to Section 4 to describe the parameter settings during training, such as the learning rate, the batch size and so on.
6. What is the difference between the solid and dashed circles in Figure 9? And there is no significant difference in the performance shown in the images of “INT Model” and “Our Model”.
7. In Table 2, the proposed method should be compared with more state-of-the-art methods. Please supplement.
There are some typos about writing.
(1) On page 8, line 248, “farm” should be corrected to “frame”.
(2) On page 8, line 277, “Our” should be corrected to “our”.
(3) On page 9, line 297, “four” should be corrected to “two”.
Author Response
- In Section 1, it is necessary to highlight the innovation of this paper. For example, the contributions of this paper should be listed.
-> Before we respond, we would like to thank you very much for your thoughtful review comments.
In order to emphasize the innovation of this paper, additional supplementary explanations were given from line 76 to 80 below as a technology for rehabilitation treatment.
- Figure 3 is too blurry. Please redraw it.
-> We changed the picture in 152 line to high definition.
- Section 2 should be a taxonomic summary of existing methods, not a detailed description of the three methods. Please revise this section.
-> We wrote a taxonomic summary of the existing method as table1 on line 94.
And while we respect your opinion, the research project we are carrying out necessarily requires a comparative analysis of the process, so it was bound to be a large part.
- In Section 3, the proposed method should be described in detail, such as Joint rotation tokens, Camera tokens, Transformer, Loss, and so on. Please supplement.
-> We described the detailed conditions you requested from line 303 to 318.
- Implementation details should be added to Section 4 to describe the parameter settings during training, such as the learning rate, the batch size and so on.
-> We described the detailed conditions you requested from line 319 to 330.
- What is the difference between the solid and dashed circles in Figure 9? And there is no significant difference in the performance shown in the images of “INT Model” and “Our Model”.
-> We described the detailed conditions you requested from line 350 to 358.
- In Table 2, the proposed method should be compared with more state-of-the-art methods. Please supplement.
-> The most recent and reputable video-based human pose estimation model among existing models is the INT model introduced at ICLR 2023. Therefore, we selected this model as our main reference group and based our paper on it.
Comments on the Quality of English Language
There are some typos about writing.
(1) On page 8, line 248, “farm” should be corrected to “frame”.
-> line 255. Modification complete.
(2) On page 8, line 277, “Our” should be corrected to “our”.
-> line 287 Modification complete.
(3) On page 9, line 297, “four” should be corrected to “two”.
-> line 334 Modification complete.
In addition to that, we have completed the correction of the misspelled part.
I marked the modifications as a memo.
Author Response File: Author Response.pdf
Reviewer 2 Report
Review for the paper „Lightweight Three-Dimensional Pose and Joint Center Estimation Model for Rehabilitation Therapy“
The authors have shown a good knowledge of this topic. They did a good literature review. However, their method, that is, the technical details about it, was somehow weakly highlighted to me. It would be good for the authors to clearly point out what is new in their approach. There are also some ambiguities for me:
Line 263
The authors said: „The INT model uses a time model to focus on capturing rotation time information for each joint. “
Could you explain how your algorithm solved the rotation of each joint? If your algorithm is different than INT then note these differences.
Line 294
The authors said: „Training data and model setup: In this experiment, the model was trained by mixing 3D videos, 2D videos, and 2D image data“
Could you explain in detail how you combined these three resources?
Line 313
The authors said: „Conversely, our model produces an accurate pixel alignment mesh for joint prediction“
Could you explain in detail how your algorithm does this?
I think that the paper needs to be clarified a little, especially to describe in more detail the novelties in the algorithm, and the Methods chapter should be reworked. The work has potential, and some corrections are needed.
Comments for author File: Comments.pdf
Author Response
Line 263
The authors said: „The INT model uses a time model to focus on capturing rotation time information for each joint. “
Could you explain how your algorithm solved the rotation of each joint? If your algorithm is different than INT then note these differences.
-> Before we respond, we would like to thank you very much for your thoughtful review comments.
We revised line 280 to 282 for the detailed conditions you requested.
In addition, we wrote line 303 to line 330 with the contents of review 3.
Line 294
The authors said: „Training data and model setup: In this experiment, the model was trained by mixing 3D videos, 2D videos, and 2D image data“
Could you explain in detail how you combined these three resources?
-> We said the wrong thing. We put the dataset as a control variable to evaluate the 3d human pose estimation performance of other models and our proposed model. So we selected two datasets (3DPW, Human3.6Mdatasets) and introduced the experiment. We revised line 332 to 334 for the detailed conditions you requested.
Line 313
The authors said: „Conversely, our model produces an accurate pixel alignment mesh for joint prediction“
Could you explain in detail how your algorithm does this?
-> We revised line 303 to 330 for the detailed conditions you requested.
We added implementation details to Section 3, explaining parameter settings such as batch size and so on.
Other than that, we had some vocabulary errors, so we completed the correction overall.
Author Response File: Author Response.pdf
Reviewer 3 Report
The paper is good and easy to follow. But, there are many spelling mistakes in this paper. Overall, I think this paper needs substantial revision. I suggest the following improvement:
1. Figure 3 affects reading because it is very blurry, so it is recommended to redraw it.
2. This paper cites many framework figures for other methods, which I don't think is necessary.
3. The content on the left side of Figure 7 and Figure 1 is same.
4. There is no ablation experiment in the experiment to measure the impact of using different tokens on the experimental results.
5. The paper spends a lot of space to introduce related work, and the introduction of your own methods and experiments is not detailed.
There are a lot of mistakes in the paper, please check carefully, such as:
a. 24 line, in-formation->information
b. 34 line, animation->, animation
c. 50 line, double 'model' words
d. 248 line, single-farm->single-frame
e. 277 line, Our model ->our model
Author Response
- Figure 3 affects reading because it is very blurry, so it is recommended to redraw it.
-> Before we respond, we would like to thank you very much for your thoughtful review comments.
We changed the picture in 152 line to high definition.
- This paper cites many framework figures for other methods, which I don't think is necessary.
-> We wrote a taxonomic summary of the existing method as table1 on line 94. while we respect your opinion, the research project we are carrying out necessarily requires a comparative analysis of the process, so it was bound to be a large part.
- The content on the left side of Figure 7 and Figure 1 is same.
-> The corresponding left-hand plot is an existing plot that mixes temporal information from past and future frames to regress SMPL parameters from the time-enhanced features of that frame. But to introduce a new way of using tokens, I compared the pictures. The plot is set aside for comparison because it represents how each joint's motion is captured individually using the shared time encoder.
- There is no ablation experiment in the experiment to measure the impact of using different tokens on the experimental results.
-> The INT model was used with Cam_Token, focusing on Shape_Token, and the reason was to adjust the posture model, and this study was conducted using Joint_Token and Cam_Token.
The model that uses other tokens is the INT model. This point was not included in the paper, so it was reflected in the following phrase.
We described the detailed conditions you requested from line 280 to 282.
- The paper spends a lot of space to introduce related work, and the introduction of your own methods and experiments is not detailed.
-> We acknowledge the lack of explanation for the experiment. So, We described the detailed conditions you requested from line 350 to 358.
Comments on the Quality of English Language
There are a lot of mistakes in the paper, please check carefully, such as:
- 24 line, in-formation->information
- 34 line, animation->, animation
- 50 line, double 'model' words
- 248 line, single-farm->single-frame
- 277 line, Our model ->our model
-> We are sorry to show you this mistake. Overall, we reviewed the vocabulary and even marked it as a memo.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The response is incorrect. Please list a point-to-point response, and then highlight in the revised manuscript.
Extensive editing of English language required.
Author Response
We express infinite appreciation for your review.
We made an overall revision in English.
Please request a review of the contents.
Please feel free to let us know if you have any further opinions to review.
We will prepare carefully.
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors answered my questions, and gave additional explanations and now I have no more objections
Author Response
We express infinite appreciation for your review.
We made an overall revision in English.
Please request a review of the contents.
Please feel free to let us know if you have any further opinions to review.
We will prepare carefully.
Author Response File: Author Response.pdf
Reviewer 3 Report
1.In the experimental section, the comparison methods presented in Tab2 and Tab3 are too limited.
2.In the manuscript, "rehabilitation therapy" is mentioned multiple times in both the abstract and introduction sections. However, there are no corresponding demonstrations or examples provided in the subsequent sections.
3.The manuscript lacks ablation studies.
There are a lot of mistakes in the paper, please check carefully, such as:
1. 187 line, Integrate-> Integrating
2. 189 line, temporary-> temporal
3. 288 line, "The following Figure 8 is our model." This sentence lacks context. Maybe "The model is illustrated in Figure 8 below." would be clearer.
Author Response
We express infinite appreciation for your review.
We made an overall revision in English.
Please request a review of the contents.
Please feel free to let us know if you have any further opinions to review.
We will prepare carefully.
1.In the experimental section, the comparison methods presented in Tab2 and Tab3 are too limited.
->
We conducted experiments based on the open source provided by the INT paper along with the evaluation indicators in the INT paper.
MPJPE, an indicator introduced in other joint papers, is an index calculated by averaging the distance between the estimated coordinates of all joints and the correct coordinates.
Therefore, since this study is a joint-centered study, we selected it as the main evaluation index.
Other traditional papers have had two datasets.
In our paper, experiments were conducted based on four datasets.
Please review this.
2.In the manuscript, "rehabilitation therapy" is mentioned multiple times in both the abstract and introduction sections. However, there are no corresponding demonstrations or examples provided in the subsequent sections.
->
Thank you very much for your review comments.
We were in the process of revising, adding to your thoughts once. Below, we will send the contents in the added phrase.
Chapter 3
268, 278line
Chapter 4
336, 347 line
353line~357line
Chapter 5
393, 396 line
3.The manuscript lacks ablation studies
I would like to thank you very much for your reference to ablation study.
We chose the INT model because we wanted to see how the proposed model affects the experiment. In the existing INT model, we experimented by giving a token value for a particular SHAPE. And we carried out the experiment by adding more weight to the joint TOKEN. These experimental methods allow us to see the causality of the system.
This was re-stated in sections 388-395 in the conclusion.
Author Response File: Author Response.pdf
Round 3
Reviewer 1 Report
I agree to accept this paper.
None.
Author Response
We performed an overall English correction of this paper.
We received the review3's opinion and introduced and wrote two additional papers that are the birthplace of this technology in 40-51line.
Please review it based on the contents.
Author Response File: Author Response.pdf
Reviewer 3 Report
Most of my concerns are addressed.
The only further comment is that a few related works are not reviewed, including but not limited to:
a. Deep dual consecutive network for human pose estimation, CVPR 2021
b. Human pose estimation via deep neural networks, CVPR 2014
I would not go against the idea of carefully proof-reading the paper
Author Response
We performed an overall English correction of this paper.
We received the your opinion and introduced and wrote two additional papers that are the birthplace of this technology in 40-51line.
Please review it based on the contents.
Author Response File: Author Response.pdf