Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Towards Single 2D Image-Level Self-Supervision for 3D Human Pose and Shape Estimation

Appl. Sci. 2021, 11(20), 9724; https://doi.org/10.3390/app11209724

by Junuk Cha¹, Muhammad Saqlain¹

, Changhwa Lee², Seongyeong Lee², Seungeun Lee², Donguk Kim¹, Won-Hee Park^3,*

and Seungryul Baek^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2021, 11(20), 9724; https://doi.org/10.3390/app11209724

Submission received: 31 August 2021 / Revised: 11 October 2021 / Accepted: 12 October 2021 / Published: 18 October 2021

(This article belongs to the Special Issue Digital Image Processing and Analysis: Human and Computer Vision Applications)

Round 1

Reviewer 1 Report

The authors propose a self-supervised framework for 3D human pose and shape estimation using one 2D image as input. The novelty of the Paper is clearly expressed and sufficient for publication. The Paper is well written, results are clearly presented, and a comprehensive ablation study supports the initial thesis and is confirmed with obtained results.

Some minor suggestions:

Lines 27-29:Unclear sentence probably just needs the removal of the word "of" before citing [5].

Regarding the choice of using 0,90,180 and 270 rotation angles, since you obtained meaningful accuracy improvement for those simple angles,
there is the possibility that finding the most optimal angle would result in even more significant improvements. I think there is maybe a good direction for future work.

There is a possible typo in the name of datasets in Table 7. Is there a difference between Full and Full-Y datasets? Generally, it is hard to understand what exact datasets are used. Consider

I wonder, why using the Latin term"ablative" study? Consider using a common English term in the AI field -"Ablation study."

It would be very useful for the scientific community to make code for experiments in this Paper publicly available. This would help other scientists to reproduce the results of your Paper and alleviate further research rapidly.

Author Response

Response to the Reviewer Comments

Manuscript ID: applsci-1382845

Title: Towards Single 2D Image-level Self-supervision for 3D Human Pose and Shape Estimation

Response to the Reviewer:

Many thanks to the Reviewer for appreciation to our work and minor comments to further improve our paper. We have tried our best to address these important points to further improve the presentation quality of this paper. Below are our responses for your comments and the corresponding modification was reflected in the draft with yellow fonts.

Reviewer comments: The authors propose a self-supervised framework for 3D human pose and shape estimation using one 2D image as input. The novelty of the Paper is clearly expressed and sufficient for publication. The Paper is well written, results are clearly presented, and a comprehensive ablation study supports the initial thesis and is confirmed with obtained results.

Some minor suggestions:

Unclear sentence probably just needs the removal of the word "of" before citing [5].

Response: Thank you for clarification. The changes have been made in Line 27-29 and highlighted with yellow color.

Regarding the choice of using 0,90,180 and 270 rotation angles, since you obtained meaningful accuracy improvement for those simple angles, there is the possibility that finding the most optimal angle would result in even more significant improvements. I think there is maybe a good direction for future work.

Response: Thank you for your suggestion. Yes, there might be further improvement if we introduce a way to find the optimal angles. As mentioned, we are considering it as our future study. We added the discussion on our future work in the conclusion section.

There is a possible typo in the name of datasets in Table 7. Is there a difference between Full and Full-Y datasets? Generally, it is hard to understand what exact datasets are used. Consider

Response: Thank you for your comment. The difference between 'Full' and 'Full-Y' datasets lies in the use of the 'our own collections of wild YouTube dataset'. This was explained in Line 304-308 and Line 378-381; however we also feel that it was not sufficiently explained and tried to improve it per reviewer 1's comment. The wild YouTube data collection was also described in Line 328-331.

We have improved the explanation in three ways:

First, we modified the notation of 'Full-Y' to 'Full w/o Y' to make it more visible to readers, not feeling it as the typo.
Second, we improved the caption of Table 7 and texts in Line 304-307 and Line 378-382 to explain the difference between 'Full' and 'Full w/o Y' datasets clearer.
Third, we made the collected YouTube dataset publicly available in our GitHub repository and notice it in Line 332.

All changes have been highlighted in the yellow color.

I wonder, why using the Latin term "ablative" study? Consider using a common English term in the AI field -"Ablation study."

Response: Thank you for your correction. We have modified the term to 'ablation' in the new file. The changes have been highlighted with the yellow color.

It would be very useful for the scientific community to make code for experiments in this Paper publicly available. This would help other scientists to reproduce the results of your Paper and alleviate further research rapidly.

Response: Thank you for your suggestion. We have provided a GitHub link of our experimental code and data in the paper (Line 70). The changes have been highlighted in the yellow color.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors propose a deep learning architecture for 3D human pose and shape estimation. The key aspect of the proposed work is in the ability to obtain good performances when using self, weak, and semi-supervision during training. The authors combine some approaches already existing in the literature (neural rendering) with additional loss metric functions taking into account spatial relations (jigsaw and rot) and pixel values (inpaint). The results show promising performance and the architectural elements are correctly explained even if the reproducibility of the methods by the community could not be of simple implementation. I suggest the authors to share their implementation on an internet repository.

Author Response

Response to the Reviewer Comments

Manuscript ID: applsci-1382845

Title: Towards Single 2D Image-level Self-supervision for 3D Human Pose and Shape Estimation

Response to the Reviewer

Many thanks to the Reviewer for the appreciation of our work. Meanwhile, we have revised the paper to further improve the quality of this manuscript. Below is our response to your comment and the corresponding modification was reflected in the draft with yellow fonts.

Reviewer comments: The authors propose a deep learning architecture for 3D human pose and shape estimation. The key aspect of the proposed work is in the ability to obtain good performances when using self, weak, and semi-supervision during training. The authors combine some approaches already existing in the literature (neural rendering) with additional loss metric functions taking into account spatial relations (jigsaw and rot) and pixel values (inpaint). The results show promising performance and the architectural elements are correctly explained even if the reproducibility of the methods by the community could not be of simple implementation. I suggest the authors to share their implementation on an internet repository.

Response: Thank you for your suggestion. We have provided a GitHub link of our experimental code and data in the paper (Line 70) to help readers reproduce our results. The changes have been highlighted in yellow color.

Author Response File: Author Response.pdf

Article Menu

Towards Single 2D Image-Level Self-Supervision for 3D Human Pose and Shape Estimation

Manuscript ID: applsci-1382845

Manuscript ID: applsci-1382845

Further Information

Guidelines

MDPI Initiatives

Follow MDPI