3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposes an Instant Human Model (IHM) generation method for monocular vision-based 3D human reconstruction, which innovatively integrates the flexibility of Neural Radiance Fields (NeRF) with the detailed expression advantages of explicit mesh modeling. It provides a new solution to the problems of low accuracy and high reliance on high-cost data supervision in existing monocular dynamic human reconstruction methods. This paper provides comprehensive experiments that demonstrate the effectiveness of the proposed method. However, this work has some problems in writing, contribution, experiment, etc.
- The specific calculation logic of the voxel attention mechanism is not clarified. Although the formula is provided, the basis for assigning attention weights (e.g., whether voxel density, spatial position, or semantic information is incorporated) is not explained. Moreover, key parameters such as the number of iterations and learning rate for "updating voxel occupancy values via moving average" are not mentioned, affecting the reproducibility of the method
- The paper does not discuss the performance of its proposed method in several key challenging scenarios of monocular 3D human reconstruction. These scenarios include severe occlusions, extreme poses and complex lighting—all of which are widely recognized as core obstacles in this field. Without an analysis of how the method performs under such conditions, the existing experimental results fail to fully demonstrate the method’s robustness, as they cannot reflect its adaptability and stability when facing real-world reconstruction challenges beyond the tested dataset scenarios.
- Some sections are overly technical, such as those involving loss function derivations, and lack sufficient intuitive explanation. Additionally, the paper fails to provide a clearer motivation for each component—for instance, the reason why voxel attention performs better than simple occupancy grids—and supplementing this kind of motivation would greatly improve the readability of the paper.
4.The discussion section only emphasizes the method's advantages and lacks an objective analysis of limitations. It is suggested that the authors add a dedicated "Method Limitations" section to objectively analyze the impact of voxel resolution, SMPL model assumptions, and data preprocessing errors on reconstruction results. This can be achieved by comparing experimental results with different voxel resolutions or quantifying the correlation between SMPL parameter errors and reconstruction errors to clarify the method's applicable scope and future improvement directions.
- Some important related works are missed to discuss and analysis. Specifically, the high-fidelity hand reconstruction techniques (e.g., [R1]) and multi-person reconstruction in the wild ([R2]) are highly relevant for positioning the contributions and scope of the proposed IHM method.
[R1]Consistent 3d hand reconstruction in video via self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8), 9469-9485, 2023.
[R2]Multiply: Reconstruction of multiple people from monocular video in the wild. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp.109-118, 2024.
Comments for author File:
Comments.pdf
The writing should be improved.
Author Response
Comments 1:The specific calculation logic of the voxel attention mechanism is not clarified. Although the formula is provided, the basis for assigning attention weights (e.g., whether voxel density, spatial position, or semantic information is incorporated) is not explained. Moreover, key parameters such as the number of iterations and learning rate for "updating voxel occupancy values via moving average" are not mentioned, affecting the reproducibility of the method.
Response 1: Thanks for your comments. We have amended section 2.2 right after Eq. (5) by inserting two sentences that explicitly list the three inputs used for the attention weight (density, distance to SMPL surface, temporal frequency) and by stating the moving-average decay (0.95) and learning rate (1×10⁻⁴) used for 5 000 iterations.
Comments 2: The paper does not discuss the performance of its proposed method in several key challenging scenarios of monocular 3D human reconstruction. These scenarios include severe occlusions, extreme poses and complex lighting—all of which are widely recognized as core obstacles in this field. Without an analysis of how the method performs under such conditions, the existing experimental results fail to fully demonstrate the method’s robustness, as they cannot reflect its adaptability and stability when facing real-world reconstruction challenges beyond the tested dataset scenarios.
Response 2: Thanks for your comments. Three additional paragraphs have been added at the end of section 3.3. They report qualitative findings on (i) severe occlusion, (ii) extreme pose and (iii) complex lighting using the NeuMan “parking-lot”, “jogging” and “bike” clips.
Comments 3: Some sections are overly technical, such as those involving loss function derivations, and lack sufficient intuitive explanation. Additionally, the paper fails to provide a clearer motivation for each component—for instance, the reason why voxel attention performs better than simple occupancy grids—and supplementing this kind of motivation would greatly improve the readability of the paper.
Response 3: Thanks for your comments. To improve readability we have inserted a plain-English motivation sentence at every loss paragraph in Section 2.4. Each sentence briefly states the artifact the loss suppresses before presenting the mathematical form.
Comments 4: The discussion section only emphasizes the method's advantages and lacks an objective analysis of limitations. It is suggested that the authors add a dedicated "Method Limitations" section to objectively analyze the impact of voxel resolution, SMPL model assumptions, and data preprocessing errors on reconstruction results. This can be achieved by comparing experimental results with different voxel resolutions or quantifying the correlation between SMPL parameter errors and reconstruction errors to clarify the method's applicable scope and future improvement directions.
Response 4: Thanks for your comments. To illustrate the advantages and disadvantages of this method, we added Table 5 in the Discussion and accompanied it with explanatory text.
Comments 5: Some important related works are missed to discuss and analysis. Specifically, the high-fidelity hand reconstruction techniques (e.g., [R1]) and multi-person reconstruction in the wild ([R2]) are highly relevant for positioning the contributions and scope of the proposed IHM method.
Response 5: We have cited the methods provided in the suggestions and also added some of the latest methods in 2025, and discussed them in the introduction.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper tackles an important task of 3D human reconstruction from monocular vision. To solve this problem, the authors have proposed the Instant Human Model (IHM) generation method. The proposed method is interesting and technically sound. However, there are two major problems of this paper.
- Missing lots of related works published in this year.
All the papers cited and discussed in this manuscript are prior to or in 2024. The field of 3D human reconstruction is evolving rapidly, and the manuscript is missing crucial developments from 2025. - The authors have compared their method with the following existing methods:
a) Anim-NeRF, published in 2021
b) Neural body, published in 2021
c) InstantAvatar, published in 2023
These methods were published before 2023, and are not state-of-the-art methods.
There are lots of papers published in 2025, the authors should consider discuss these recent papers and compare with them, for example:
- [1] Guo, Chen, et al. "Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
- [2] Zhi, Yihao, et al. "StruGauAvatar: Learning Structured 3D Gaussians for Animatable Avatars from Monocular Videos." IEEE Transactions on Visualization and Computer Graphics (2025).
- [3] Tan, Jeff, et al. "Dressrecon: Freeform 4d human reconstruction from monocular video." 2025 International Conference on 3D Vision (3DV). IEEE, 2025.
- [4] Zhao, Yiqun, et al. "Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Videos." IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
Typos:
- line 26, [5-8] should be placed before the period, i.e., "[5-8]." instead of ".[5-8]".
- line 148, notation sigma and c should be subscripted.
Author Response
Comments 1: Missing lots of related works published in this year.
Response 1: Thanks for your comments. We have cited the methods provided in the proposal and discussed them in the introduction.
Comments 2: These methods were published before 2023, and are not state-of-the-art methods.
Response 2: Thanks for your comments. We excluded Vid2Avatar-Pro and Surfel-GIR from the main benchmark for two practical reasons that directly affect deployability. Vid2Avatar-Pro demands a 30-minute per-subject optimization phase (even after the heavy universal-prior pre-training) and peaks at 18 GB GPU memory, making it impossible to run on the laptop RTX-4090 setup used for all other methods. Surfel-GIR, although fast and relightable, produces a discrete surfel cloud whose normals are inconsistent and which contains 5–7 % non-manifold edges after Poisson reconstruction; this violates the watertight, animation-ready mesh requirement of our evaluation protocol.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript, titled "3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization," addresses a relevant and timely topic in the field of computer vision and 3D modeling. The article presents an approach called Instant Human Model (IHM), combining neural radiance fields (NeRF) with explicit mesh optimization. This combination aims to improve the accuracy and geometric consistency of single-camera 3D reconstruction, compared to existing methods such as InstantAvatar and HumanNeRF. The work is interesting and well-organized, but several aspects require clarification and improvement before being considered for publication.
The abstract is informative but too dense. It would benefit from being streamlined and better highlighting the main contributions of the article. It would also be desirable to explain more explicitly what differentiates the proposed method from existing approaches. The reported performance gain in SSIM (0.1%) seems small; it would be useful to discuss its real-world significance and practical implications. The introduction is well documented, but it contains redundancies and overly descriptive passages. It should place greater emphasis on the specific scientific problem and the novelty of the contribution. The reader does not always clearly perceive the fundamental difference between IHM and previous models. It would be desirable to explain the methodological choice of using InstantAvatar as a basis, and to justify how this choice offers a scientific or technical advantage.
The methodology is detailed and technically sound, but it could benefit from greater clarity. Some equations are presented without definitions of the notations, which makes it difficult to read. A summary diagram of the overall pipeline would provide a better understanding of how the proposed model works. It would also be useful to provide an intuitive explanation of the main modules before presenting the mathematical formulations. The author should clarify the computational complexity added by the voxel attention module, as well as how the model learns geometry and texture (simultaneously or sequentially). These clarifications would help to better understand the actual technical contribution of the method. The experimental results are well presented, but the improvements achieved remain modest. The tables show limited gains in PSNR and SSIM, which do not appear significant given the model's complexity. It would be important to include a visual comparison of the 3D reconstructions to illustrate the qualitative differences. Training and inference times should also be specified to assess the model's computational efficiency. The experimental study lacks details on the test conditions: hardware used, datasets, and training parameters. A discussion of failure cases or observed limitations would strengthen the credibility of the demonstration.
The discussion is relevant but overly descriptive. It should include a critical analysis of the method's weaknesses and possible avenues for improvement. It would be useful to assess the model's computational robustness (size, number of parameters, computational time) and to explore application prospects, including generalization to multiple subjects, complex scenes, or integration into augmented reality environments. This reflection would give the paper greater scientific depth. The conclusion summarizes the results well but remains too general. It should better highlight the specific contribution of this work compared to existing methods and open up concrete perspectives, such as optimizing the model for real-time applications or generalizing it to dynamic scenes.
Formally, the manuscript is well written overall, but some sentences are too long and could be simplified for greater clarity. The figures sometimes lack explanatory captions, and some references are redundant. The title is clear but could be slightly abbreviated to gain impact, for example: “Instant Human Model: Monocular 3D Human Reconstruction via Neural Fields and Mesh Optimization.”
From a scientific perspective, the work is promising but requires additional clarification to properly situate its contribution in relation to the state of the art. It would be useful to evaluate the model's ability to handle scenes with multiple individuals, occlusions, or rapid movements. The author could also compare his approach to more recent methods such as Gaussian Splatting. A discussion on the size of the dataset required for a stable reconstruction would strengthen the experimental evaluation.
Author Response
Comments 1: The abstract and introduction lack conciseness and fail to clearly differentiate the proposed IHM method's core novelty and significant advantage over existing approaches.
Response 1: Thanks for your comments. Specifically targeting a niche scenario for lightweight AR/VR applications – namely, generating a drivable avatar within 10 minutes from a mere 30-second mobile selfie video – our IHM model prioritizes user experience through significant speedup. The achieved 36× acceleration ratio, which drastically reduces user waiting time, holds greater product value in this context than a marginal 0.1% improvement in SSIM (a common image quality metric). We also added a brief description of this content in the penultimate paragraph of the Introduction.
Comments 2: Methodology Clarity & Justification: The technical description needs improved clarity.
Response 2: Thanks for your comments. To improve readability we have inserted a plain-English motivation sentence at every loss paragraph in Section 2.4. Each sentence briefly states the artifact the loss suppresses before presenting the mathematical form.
Comments 3: Insufficient Experimental Validation & Analysis: The experimental section has significant gaps.
Response 3: Thanks for your comments. Three additional paragraphs have been added at the end of section 3.3. They report qualitative findings on (i) severe occlusion, (ii) extreme pose and (iii) complex lighting using the NeuMan “parking-lot”, “jogging” and “bike” clips.
We have amended section 2.2 right after Eq. (5) by inserting two sentences that explicitly list the three inputs used for the attention weight (density, distance to SMPL surface, temporal frequency) and by stating the moving-average decay (0.95) and learning rate (1×10⁻⁴) used for 5 000 iterations.
Comments 4: Lack of Critical Discussion & Future Scope: The discussion is overly descriptive.
Response 4: Thanks for your comments. To illustrate the advantages and disadvantages of this method, we added Table 5 in the Discussion, along with explanatory text.
Comments 5: Title: Could be slightly shortened for impact
Response 5: Thanks for your comments. Keyword completeness – The 20-word string contains all critical search terms used in SCOPUS/Web of Science ("monocular", "3D human reconstruction", "neural fields", "mesh"), which significantly improves discoverability. We nevertheless shortened the subtitle by removing "Optimization" and "Explicit" to trim four words while preserving informativeness. If insists on a <15-word headline, we will happily condense it to "Instant Human Model: Fast Monocular 3D Reconstruction with NeRF and Mesh".
Reviewer 4 Report
Comments and Suggestions for AuthorsDear authors,
Although interesting, your paper have some issues that should be resolved:
- Discussion and Conclusion sections are weakly written and should be enriched.
- Some table with advantages and disadvantages of the proposed would be welcome.
- You haven't mention reference for NeuMan in line 248.
- Table 2: Anim-NeRF seems better than the proposed. You haven't addressed this issue.
- Why would training time play vital role to choose the proposed when the results are not better?
- Can you point to some niche where it is better to use the proposed vs Anim-NeRF?
Kind regards
Comments on the Quality of English LanguageLine 212"This paper trains" - rephrase. By "this " you mean the proposed method?
Line 215 "this paper calculates" also "this" means your proposal? Also, the paper cannot calculate or train.
Check the entire text for such errors.
Line 232: "Additionally, this chapter regularizes the overall mesh..." Chapter is wrong subject. Chapter is not a living being to perform an action.
Check the entire manuscript for such mistakes.
Furthermore, there are a lot of spaces where there shouldn't be and missing spaces where there should be, e.g.:
- line 24: "technology[1–4] , 3D"
- line 26: "entertainment. [5–8]3D"
- line 31: "by [9] , the"
Check the entire manuscript for such errors.
Author Response
Comments 1: Discussion and Conclusion sections are weakly written and should be enriched.
Response 1: Response: Thanks for your comments. We have divided the Discussion chapter into two subsections and added additional content to expand it.
Comments 2: Some table with advantages and disadvantages of the proposed would be welcome.
Response 2: Response: Thanks for your comments. To illustrate the advantages and disadvantages of this method, we added Table 5 in the Discussion, along with explanatory text.
Comments 3: You haven't mention reference for NeuMan in line 248.
Response 3: Response: Thanks for your comments. We have added the relevant references and the description of the dataset behind this sentence.
Comments 4: Table 2: Anim-NeRF seems better than the proposed. You haven't addressed this issue.
Response 4: Response: Thanks for your comments. On the Female-3-casual, Anim-NeRF has a slight lead (0.0006), but the processing speed of our method is several times that of this one. Compared with 180 seconds per frame of animi-nerf, which requires test time and pose optimization, the inference time (5 seconds per frame, pose freeze) of IHM is significantly lower. Relevant descriptions have been added in the sub-chapter Model Evaluation Metrics of Results.
Comments 5: Why would training time play vital role to choose the proposed when the results are not better?
Response 5: Thanks for your comments. Specifically targeting a niche scenario for lightweight AR/VR applications – namely, generating a drivable avatar within 10 minutes from a mere 30-second mobile selfie video – our IHM model prioritizes user experience through significant speedup. The achieved 36× acceleration ratio, which drastically reduces user waiting time, holds greater product value in this context than a marginal 0.1% improvement in SSIM (a common image quality metric). We also added a brief description of this content in the penultimate paragraph of the Introduction.
Comments 6: Can you point to some niche where it is better to use the proposed vs Anim-NeRF?
Response 6: Thanks for your comments. IHM has the advantage of being fast and efficient. The following are two scenarios as examples:
Example 1: Real-time Live Streaming. "Sports Stream (Fast Martial Arts + Single Cam): IHM renders at 5 s/frame, enabling real-time virtual avatar deployment. Anim-NeRF requires 180 s/frame, making live deployment impractical."
Example 2: Mobile Social AR Onboarding. "Mobile Social AR (30s Selfie): IHM reconstructs an animatable model in <10 min. Anim-NeRF needs >3h on a desktop GPU, losing critical social timeliness."
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed the concerns of the reviewers, I have no more questions.
Reviewer 2 Report
Comments and Suggestions for AuthorsThanks for the efforts and improvement made by the authors.
Reviewer 4 Report
Comments and Suggestions for AuthorsI hope that my comments resulted in better manuscript.
