3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization
Abstract
1. Introduction
2. Materials and Methods
2.1. Effective Dynamic Neural Radiance Fields
2.2. Voxel Attention Mechanism
2.3. Mesh Adaptive Optimization and Texture Generation
2.4. Training Loss
3. Results
3.1. Datasets and Baseline Methods
3.2. Experimental Setup and Training Details
3.3. Model Evaluation Metrics
3.4. Ablation Experiment
4. Discussion
4.1. Metrics
4.2. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, L.; Zhao, K. Report on Methods and Applications for Crafting 3D Humans. arXiv 2024, arXiv:2406.01223. [Google Scholar] [CrossRef]
- Pavlakos, G.; Weber, E.; Tancik, M.; Kanazawa, A. The One Where They Reconstructed 3D Humans and Environments in TV Shows. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Behrad, A.; Roodsarabi, N. 3D Human Motion Tracking and Reconstruction Using DCT Matrix Descriptor. ISRN Mach. Vis. 2012, 2012, 235396. [Google Scholar] [CrossRef]
- Wang, J.; Yoon, J.S.; Wang, T.Y.; Singh, K.K.; Neumann, U. Complete 3D Human Reconstruction from a Single Incomplete Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8748–8758. [Google Scholar]
- Alldieck, T.; Magnor, M.; Xu, W.; Theobalt, C.; Pons-Moll, G. Video based reconstruction of 3d people models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8387–8397. [Google Scholar]
- Jiang, L.; Li, M.; Zhang, J.; Wang, C.; Ye, J.; Liu, X.; Chai, J. Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images. arXiv 2021, arXiv:2106.11536. [Google Scholar]
- Retsinas, G.; Filntisis, P.P.; Danecek, R.; Abrevaya, V.F.; Roussos, A.; Bolkart, T.; Maragos, P. 3D Facial Expressions through Analysis-by-Neural-Synthesis. arXiv 2024, arXiv:2404.04104. [Google Scholar] [CrossRef]
- Pumarola, A.; Sanchez-Riera, J.; Choi, G.; Sanfeliu, A.; Moreno-Noguer, F. 3dpeople: Modeling the geometry of dressed humans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2242–2251. [Google Scholar]
- Ryu, N.; Gong, M.; Kim, G.; Lee, J.H.; Cho, S. 360° Reconstruction From a Single Image Using Space Carved Outpainting. arXiv 2023, arXiv:2309.10279. [Google Scholar]
- Dong, Z.; Guo, C.; Song, J.; Chen, X.; Geiger, A.; Hilliges, O. PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence. arXiv 2022, arXiv:2203.01754. [Google Scholar] [CrossRef]
- Weng, C.Y.; Curless, B.; Srinivasan, P.P.; Barron, J.T.; Kemelmacher-Shlizerman, I. HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video. arXiv 2022, arXiv:2201.04127. [Google Scholar]
- Jiang, T.; Chen, X.; Song, J.; Hilliges, O. Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16922–16932. [Google Scholar]
- Saito, S.; Huang, Z.; Natsume, R.; Morishima, S.; Kanazawa, A.; Li, H. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2304–2314. [Google Scholar]
- Saito, S.; Simon, T.; Saragih, J.; Joo, H. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. arXiv 2020, arXiv:2004.00452. [Google Scholar]
- Tu, Z.; Huang, Z.; Chen, Y.; Kang, D.; Bao, L.; Yang, B.; Yuan, J. Consistent 3d hand reconstruction in video via self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9469–9485. [Google Scholar] [CrossRef]
- Jiang, Z.; Guo, C.; Kaufmann, M.; Jiang, T.; Valentin, J.; Hilliges, O.; Song, J. Multiply: Reconstruction of multiple people from monocular video in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 109–118. [Google Scholar]
- Guo, C.; Li, J.; Kant, Y.; Sheikh, Y.; Saito, S.; Cao, C. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5559–5570. [Google Scholar]
- Zhi, Y.; Sun, W.; Chang, J.; Ye, C.; Feng, W.; Han, X. StruGauAvatar: Learning Structured 3D Gaussians for Animatable Avatars from Monocular Videos. IEEE Trans. Vis. Comput. Graph. 2025, 31, 7820–7833. [Google Scholar] [CrossRef]
- Tan, J.; Xiang, D.; Tulsiani, S.; Ramanan, D.; Yang, G. Dressrecon: Freeform 4d human reconstruction from monocular video. In Proceedings of the 2025 International Conference on 3D Vision (3DV), Singapore, 25–28 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 250–260. [Google Scholar]
- Zhao, Y.; Wu, C.; Huang, B.; Zhi, Y.; Zhao, C.; Wang, J.; Gao, S. Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 11502–11518. [Google Scholar] [CrossRef]
- Lorensen, W.; Cline, H. Marching cubes: A high resolution 3D surface construction algorithm. In Seminal Graphics: Pioneering Efforts That Shaped the Field; Association for Computing Machinery: New York, NY, USA, 1998. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv 2020, arXiv:2003.08934. [Google Scholar] [CrossRef]
- Martin-Brualla, R.; Radwan, N.; Sajjadi, M.S.M.; Barron, J.T.; Dosovitskiy, A.; Duckworth, D. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. arXiv 2021, arXiv:2008.02268. [Google Scholar] [CrossRef]
- Deng, K.; Liu, A.; Zhu, J.Y.; Ramanan, D. Depth-supervised NeRF: Fewer Views and Faster Training for Free. arXiv 2024, arXiv:2107.02791. [Google Scholar]
- Zhang, K.; Riegler, G.; Snavely, N.; Koltun, V. NeRF++: Analyzing and Improving Neural Radiance Fields. arXiv 2020, arXiv:2010.07492. [Google Scholar]
- Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. arXiv 2022, arXiv:2111.12077. [Google Scholar]
- Xing, Y.; Yang, Q.; Yang, K.; Xu, Y.; Li, Z. Explicit-NeRF-QA: A Quality Assessment Database for Explicit NeRF Model Compression. arXiv 2024, arXiv:2407.08165. [Google Scholar]
- Kouros, G.; Wu, M.; Shrivastava, S.; Nagesh, S.; Chakravarty, P.; Tuytelaars, T. Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction. arXiv 2023, arXiv:2308.08530. [Google Scholar]
- Billouard, C.; Derksen, D.; Sarrazin, E.; Vallet, B. SAT-NGP: Unleashing Neural Graphics Primitives for Fast Relightable Transient-Free 3D Reconstruction From Satellite Imagery. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 8749–8753. [Google Scholar] [CrossRef]
- Wang, Z. 3D Representation Methods: A Survey. arXiv 2024, arXiv:2410.06475. [Google Scholar] [CrossRef]
- Wang, X.; Guo, Y.; Deng, B.; Zhang, J. Lightweight Photometric Stereo for Facial Details Recovery. arXiv 2020, arXiv:2003.12307. [Google Scholar] [CrossRef]
- Bai, Y.; Wong, L.; Twan, T. Survey on Fundamental Deep Learning 3D Reconstruction Techniques. arXiv 2024, arXiv:2407.08137. [Google Scholar] [CrossRef]
- Vinodkumar, P.K.; Karabulut, D.; Avots, E.; Ozcinar, C.; Anbarjafari, G. Deep Learning for 3D Reconstruction, Augmentation, and Registration: A Review Paper. Entropy 2024, 26, 235. [Google Scholar] [CrossRef]
- Zheng, Z.; Yu, T.; Wei, Y.; Dai, Q.; Liu, Y. DeepHuman: 3D Human Reconstruction from a Single Image. arXiv 2019, arXiv:1903.06473. [Google Scholar] [CrossRef]
- Xiong, Z.; Kang, D.; Jin, D.; Chen, W.; Bao, L.; Cui, S.; Han, X. Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using Pixel-aligned Reconstruction Priors. arXiv 2023, arXiv:2302.01162. [Google Scholar]
- Chen, X.; Zheng, Y.; Black, M.J.; Hilliges, O.; Geiger, A. SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Chen, X.; Jiang, T.; Song, J.; Rietmann, M.; Geiger, A.; Black, M.J.; Hilliges, O. Fast-SNARF: A fast deformer for articulated neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11796–11809. [Google Scholar] [CrossRef]
- Laine, S.; Hellsten, J.; Karras, T.; Seol, Y.; Lehtinen, J.; Aila, T. Modular Primitives for High-Performance Differentiable Rendering. ACM Trans. Graph. (ToG) 2020, 39, 1–14. [Google Scholar] [CrossRef]
- Munkberg, J.; Hasselgren, J.; Shen, T.; Gao, J.; Chen, W.; Evans, A.; Müller, T.; Fidler, S. Extracting triangular 3d models, materials, and lighting from images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 8280–8290. [Google Scholar]
- Cignoni, P.; Callieri, M.; Corsini, M.; Dellepiane, M.; Ganovelli, F.; Ranzuglia, G. Meshlab: An open-source mesh processing tool. In Proceedings of the Eurographics Italian Chapter Conference, Salerno, Italy, 2–4 July 2008; Volume 2008, pp. 129–136. [Google Scholar]
- Tang, J.; Zhou, H.; Chen, X.; Hu, T.; Ding, E.; Wang, J.; Zeng, G. Delicate textured mesh recovery from nerf via adaptive surface refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17739–17749. [Google Scholar]
- Chen, Z.; Funkhouser, T.; Hedman, P.; Tagliasacchi, A. Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16569–16578. [Google Scholar]
- Rebain, D.; Matthews, M.; Yi, K.M.; Lagun, D.; Tagliasacchi, A. Lolnerf: Learn from one look. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1558–1567. [Google Scholar]
- Nealen, A.; Igarashi, T.; Sorkine, O.; Alexa, M. Laplacian mesh optimization. In Proceedings of the 4th International Conference on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia, Kuala Lumpur, Malaysia, 29 November–2 December 2006; ACM: New York, NY, USA, 2006. [Google Scholar]
- Yang, Z.; Chen, W.; Wang, F.; Xu, B. Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
- Chen, J.; Zhang, Y.; Kang, D.; Zhe, X.; Lu, H. Animatable Neural Radiance Fields from Monocular RGB Video. arXiv 2021, arXiv:2106.13629. [Google Scholar] [CrossRef]
- Jiang, W.; Yi, K.M.; Samei, G.; Tuzel, O.; Ranjan, A. NeuMan: Neural Human Radiance Field from a Single Video. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
- Peng, S.; Zhang, Y.; Xu, Y.; Wang, Q.; Shuai, Q.; Bao, H.; Zhou, X. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9054–9063. [Google Scholar]





| Train | Test | |||||
|---|---|---|---|---|---|---|
| Start | End | Skip | Start | End | Skip | |
| Male-3-casual | 0 | 455 | 4 | 456 | 675 | 4 |
| Male-4-casual | 0 | 659 | 6 | 660 | 872 | 6 |
| Female-3-casual | 0 | 445 | 4 | 446 | 647 | 4 |
| Female-4-casual | 0 | 335 | 4 | 335 | 523 | 4 |
| bike | 0 | 104 | 1 | 103 | 104 | 4 |
| seattle | 0 | 37 | 1 | 36 | 37 | 4 |
| lab | 0 | 102 | 1 | 101 | 102 | 4 |
| citron | 0 | 103 | 1 | 102 | 103 | 4 |
| jogging | 0 | 42 | 1 | 41 | 42 | 4 |
| parkinglot | 0 | 41 | 1 | 40 | 41 | 4 |
| Male-3-Casual | Male-4-Casual | Female-3-Casual | Female-4-Casual | |||||
|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | |
| Neural Body | 24.94 | 0.9428 | 24.71 | 0.9469 | 23.87 | 0.9504 | 24.37 | 0.9451 |
| Anim-NeRF | 29.37 | 0.9703 | 28.37 | 0.9605 | 28.91 | 0.9743 | 28.9 | 0.9678 |
| InstantAvatar | 29.65 | 0.973 | 27.97 | 0.9649 | 27.9 | 0.9722 | 28.92 | 0.9692 |
| Our | 29.9 | 0.974 | 28.19 | 0.9661 | 28.7 | 0.9737 | 29.57 | 0.9715 |
| InstantAvatar | Ours | |||
|---|---|---|---|---|
| PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | |
| bike | 24.18 | 0.9498 | 24.78 | 0.9515 |
| seattle | 26.57 | 0.9674 | 26.74 | 0.9685 |
| lab | 27.34 | 0.9731 | 27.41 | 0.9732 |
| citron | 24.95 | 0.9491 | 25.01 | 0.9526 |
| jogging | 23.4 | 0.9338 | 23.74 | 0.9368 |
| parkinglot | 24.14 | 0.952 | 24.13 | 0.9547 |
| Triangular Patch | Vertex | Disk Usage | |
|---|---|---|---|
| Before mesh optimization | 785,896 | 400,355 | 14.3 MB |
| after mesh optimization | 21,035 | 63,105 | 2.07 MB |
| Module | Advantage | Disadvantage |
|---|---|---|
| Voxel Attention | Captures finger-gap/cloth wrinkles | GPU memory +15% |
| Mesh Iteration | 70% fewer faces, 10× storage save | May produce non-manifold edge |
| Depth-free | Selfie video only, no extra sensor | Loose skirt can stick to legs |
| Hash-NGP backbone | 5 s/frame inference on RTX 4090 | SSIM gain marginal (+0.1%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, K.; Xie, X.; Li, W.; Liu, J.; Wang, Z. 3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization. Electronics 2025, 14, 4512. https://doi.org/10.3390/electronics14224512
Wang K, Xie X, Li W, Liu J, Wang Z. 3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization. Electronics. 2025; 14(22):4512. https://doi.org/10.3390/electronics14224512
Chicago/Turabian StyleWang, Kaipeng, Xiaolong Xie, Wei Li, Jie Liu, and Zhuo Wang. 2025. "3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization" Electronics 14, no. 22: 4512. https://doi.org/10.3390/electronics14224512
APA StyleWang, K., Xie, X., Li, W., Liu, J., & Wang, Z. (2025). 3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization. Electronics, 14(22), 4512. https://doi.org/10.3390/electronics14224512

