3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization

Wang, Kaipeng; Xie, Xiaolong; Li, Wei; Liu, Jie; Wang, Zhuo

doi:10.3390/electronics14224512

Open AccessArticle

3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization

by

Kaipeng Wang

^1,†

,

Xiaolong Xie

^2,†

,

Wei Li

^1,*

,

Jie Liu

¹ and

Zhuo Wang

¹

School of Software, Nanchang University, No. 235 Nanjing East Road, Nanchang 330096, China

²

Tongji University, Shanghai 200070, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(22), 4512; https://doi.org/10.3390/electronics14224512

Submission received: 12 September 2025 / Revised: 4 November 2025 / Accepted: 15 November 2025 / Published: 18 November 2025

(This article belongs to the Special Issue 3D Computer Vision and 3D Reconstruction)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Three-dimensional Human Reconstruction from Monocular Vision is a key technology in Virtual Reality and digital humans. It aims to recover the 3D structure and pose of the human body from 2D images or video. Current methods for dynamic 3D reconstruction of the human body, which are based on monocular views, have low accuracy and remain a challenging problem. This paper proposes a fast reconstruction method based on Instant Human Model (IHM) generation, which achieves highly realistic 3D reconstruction of the human body in arbitrary poses. First, the efficient dynamic human body reconstruction method, InstantAvatar, is utilized to learn the shape and appearance of the human body in different poses. However, due to its direct use of low-resolution voxels as canonical spatial human representations, it is not possible to achieve satisfactory reconstruction results on a wide range of datasets. Next, a voxel occupancy grid is initialized in the A-pose, and a voxel attention mechanism module is constructed to enhance the reconstruction effect. Finally, the Instant Human Model (IHM) method is employed to define continuous fields on the surface, enabling highly realistic dynamic 3D human reconstruction. Experimental results show that, compared to the representative InstantAvatar method, IHM achieves a 0.1% improvement in SSIM and a 2% improvement in PSNR on the PeopleSnapshot benchmark dataset, demonstrating improvements in both reconstruction quality and detail. Specifically, IHM, through voxel attention mechanisms and Mesh adaptive iterative optimization, achieves highly realistic 3D mesh models of human bodies in various poses while ensuring efficiency.

Keywords:

human reconstruction; neural radiance fields; instant human rendering; voxel attention mechanism; mesh optimization

1. Introduction

With the rapid development of Virtual Reality (VR) technology [1,2,3,4], 3D human reconstruction holds significant application value in the digital industries such as model assets, film production, military, and entertainment. Three-dimensional human reconstruction [5,6,7,8] is a crucial task in the fields of computer vision and graphics, with important research value in areas such as human behavior recognition, pose and capture, facial expression rendering, and clothing simulation.

Significant progress has been made in the field of 3D human reconstruction. As proposed by [9], the POP3D framework effectively addresses the longstanding challenges of generalization and fidelity in reconstructing from a single RGB image. However, the computation time associated with 3D model training increases significantly with the number of camera parameters. Currently, the framework requires approximately 7 h to reconstruct a single object using a single 3090 RTX GPU. There is also a global optimization method [10] that jointly learns skinning fields and surface normals in a canonical representation. This method effectively recovers surface details and supports the animation of human avatars in novel, unseen poses. However, to achieve high-quality textured 3D human model reconstruction, it relies on supervision from depth maps or scanned model assets.

Another novel approach is HumanNeRF [11], which delivers state-of-the-art results for free-viewpoint rendering of moving humans from monocular videos. By accurately modeling body poses and motion alongside a canonical optimization process, the method achieves high-fidelity results in this challenging scenario. Nevertheless, HumanNeRF models human shape and appearance in a pose-agnostic canonical space. To reconstruct models from images of humans in different poses, it incorporates human prior templates (e.g., animation skinning) and differentiable deformation and rendering algorithms to map the model to the posed space.

Among them, InstantAvatar [12] still needs improvement in terms of detail capture and reconstruction accuracy in complex scenes. Human as Points represents the human body as explicit point clouds and reconstructs 3D human shapes from single-view RGB images. While point-based representation offers great flexibility, it may fall short in achieving the fine detail resolution attainable by other methods, thus affecting the realism and fidelity of the reconstruction. Additionally, PIFu [13] and its extension, PIFuHD [14], are restricted in their widespread application due to high computational costs and input data quality requirements. HaMeR [15] introduces a self-supervised video framework that reduces hand-reconstruction error to 0.6 mm without requiring calibrated multi-view rigs, demonstrating that voxel-attention-style occupancy refinement can be directly ported to finger-level detail. MultiPly [16] jointly optimizes multi-person SMPL-X poses and shared neural radiance fields, enabling unconstrained outdoor scenes but at the cost of 4× inference time; extending IHM’s voxel grid to instance-encoded channels offers a lightweight route to the same multi-subject capability. Vid2Avatar-Pro [17] delivers high-fidelity avatars but needs 30 min and a heavy pretrained prior; StruGauAvatar [18] renders 60 fps with structured Gaussians yet yields bumpy, non-manifold surfaces; DressRecon [19] handles loose garments in 4D yet requires eight views; Surfel-GIR [20] offers relightable surfels but no watertight mesh. IHM keeps single-view input and outputs animation-ready meshes in 10 min, without universal prior pretraining.

From the analysis of these methods, we can see that although the combination of 3D models and images continues to improve in terms of accuracy and robustness, many methods still require high-performance hardware, and current technologies still perform inadequately in complex scenes or heavily occluded conditions. To reconstruct highly realistic 3D human models with arbitrary poses from monocular video, including realistic texture details such as clothing wrinkles in different poses, this paper proposes a novel Instant Human Model (IHM) generation method, combining the powerful flexibility of Neural Radiance Fields (NeRF) with the intuitive and controllable nature of explicit mesh modeling.

First, to more efficiently and accurately reconstruct and represent the 3D shape and texture of dynamic humans, the InstantAvatar method is used to learn the human body’s shape and appearance in the canonical space. Then, to optimize the original method of maintaining a voxel occupancy grid to skip sampling points in empty spaces, this paper introduces a voxel attention mechanism to enhance the importance of voxels during the reconstruction process, thereby improving reconstruction accuracy.

Next, the Marching Cubes algorithm [21] is used to extract a rough human mesh model from the enhanced voxel representation. At the same time, rendered images of the human body from different viewpoints are used to align the implicit and mesh models through model view projection (MVP) and mesh translation fields, further achieving adaptive iterative mesh optimization and learning high-quality texture maps. Given a canonical shape represented by an implicit function, we use IHM to define continuous fields on the surface, enabling detailed modeling of dynamic humans. By controlling SMPL parameters, arbitrary human pose meshes are quickly reconstructed, and specular and texture maps of the corresponding mesh model are learned.

Compared to methods relying solely on single images or depth map sequences, IHM combines the flexibility of NeRF in shape representation with the advantage of explicit mesh modeling in detail expression, achieving high-quality textured 3D human model reconstruction without relying on depth maps or scanning model assets for supervision. IHM is targeted at niche AR/VR scenarios where “30 s mobile phone selfies can drive Avatars within 10 min”. The reduction in user waiting time brought by a 36× speedup ratio is more valuable than a 0.1% increase in SSIM.

Experimental results show that IHM outperforms baseline methods across multiple datasets, particularly in applications such as shape reconstruction, animation generation, and texture transfer. The following sections will discuss these findings in four parts.

2. Materials and Methods

Neural Radiance Fields (NeRF) [22] is a method for representing the 3D geometry and appearance of scenes using neural networks. By inputting multiple 2D images from different viewpoints, NeRF can generate images from new perspectives. Training NeRF typically requires a large amount of paired data, including multi-view images and their corresponding camera parameters [23,24]. Research has shown that data diversity is crucial for model learning; a lack of sufficient training samples can lead to poor reconstruction results, especially in complex scenes. Recent works aimed at improving NeRF [25,26,27,28,29]. For example, NeRF++ attempts to address some challenges posed by complex environments (like outdoor scenes) but still operates under the assumption that multiple high-quality views of the scene are available.

With the development of Virtual Reality (VR) and Augmented Reality (AR) technologies, the demand for high-quality 3D human models has been increasing. Traditional 3D human reconstruction techniques [30] typically rely on explicit geometric modeling or multi-view stereo methods to reconstruct 3D models through images from multiple viewpoints. For example, photometric stereo methods [31] infer an object’s 3D structure by analyzing variations in illumination. While these methods excel at reconstructing static objects, they often require very precise camera parameters and complex post-processing for dynamic targets like humans.

In recent years, deep learning-based methods have gained widespread application in the 3D reconstruction field [32,33], including techniques based on Convolutional Neural Networks(CNNs) [34] and Generative Adversarial Networks (GANs) [35]. These methods typically learn an end-to-end mapping from images to models, generating 3D shapes directly from 2D images without the need for explicit geometric modeling. However, they often require a large amount of labeled data for supervised training and may produce 3D models that are blurry or low in detail.

Compared to traditional methods, a significant advantage of our proposed Instantaneous Human Model Generation Method is that it can generate 3D models from 2D video data without supervision. This flexibility allows it to operate effectively without a large amount of paired data, adapting to the difficulties of data collection in real-world scenarios. Our proposed instantaneous human model generation method makes significant improvements over existing work, particularly in terms of being unsupervised and utilizing multi-view implicit rendering, providing a more flexible and efficient solution for 3D human reconstruction. This method also shows a significant improvement in rendering quality, especially in handling dynamic poses and generating high-quality textured mesh models.

A flowchart for the Instant Human Rendering Method is shown in Figure 1. Given a monocular camera video sequence frame and the corresponding SMPL model parameters, IHM uses linear blend skinning in the first stage to transform sampled points from the canonical space to the deformed space. Then, these sampled points are input into a neural radiance field based on multi-resolution hash encoding to learn the volume density and color of the sampled points. The volume density is further input into a voxel attention mechanism to optimize the density grid in the canonical space, achieving accelerated convergence and rendering effects. In the second stage, a coarse mesh model in a specified pose is exported using the Marching Cubes method. By employing model-view projection and introducing a translation vector, the rendered image set and the mesh model in the corresponding pose are aligned to obtain the mesh point cloud under a specified view. This point cloud is then input into the neural radiance field based on multi-resolution hash encoding to render the projected image. The difference between the corresponding rendered image pixels in the first stage and the current rendered pixels is minimized. By translating the mesh grid vertices and eliminating or subdividing triangular facets, the projection error of the mesh model is optimized, making the mesh grid model more refined during the iteration process. After convergence, a high-fidelity texture map is generated.

2.1. Effective Dynamic Neural Radiance Fields

In the first stage, the human body shape and appearance in canonical space are modeled using a radiance field

F_{σ, c}

, which predicts the density

σ

and color c for each 3D sampling point in canonical space:

F_{σ, c} (x) \to σ, c

(1)

here, Instant-NGP is used to parameterize

F_{σ, c}

, achieving fast training and inference speeds by storing feature grids of different coarse scales using a hash table. To predict the texture and geometric properties of a query point in space, they read features at adjacent grid points and perform trilinear interpolation, then concatenate interpolated features from different levels. The concatenated features are finally decoded using a shallow MLP. To create animations and learn from posed images using articulation radiance fields, a deformed radiance field

F_{σ, c}^{'}

needs to be generated in the target pose:

{F^{'}}_{σ, c} (x^{'}) \to σ, c

(2)

It outputs the color and density for each point in pose space. In this chapter, the skinning weight field

w = \{w_{1}, \dots, w_{n b}\}

is used in canonical space to model articulations, where nb is the number of bones in the skeleton. To effectively reduce computational costs [36,37], a low-resolution voxel grid is used to represent the skinning weight field. The value of each grid point is determined as the skinning weight of its nearest vertex on the SMPL model. With this canonical skinning weight field and the target bone transformation matrices

B = \{B_{1}, \dots, B_{n b}\}

, a point x in canonical space is transformed into deformed space

x^{'}

through linear blend skinning, as follows:

x^{'} = \sum_{i = 1}^{n_{b}} w_{i} B_{i} x

(3)

the corresponding canonical space point for a point

x^{'}

in pose space is defined by the inverse linear skinning equation, establishing a mapping from a point

x^{'}

in pose space to its corresponding point in canonical space

{F^{'}}_{σ, c} (x^{'}) = F_{σ, c} (x^{*})

2.2. Voxel Attention Mechanism

Unlike InstantAvatar, which directly uses the voxel density grid, this method introduces a voxel attention mechanism to enhance the optimization effect of the SMPL voxel density grid in standard space. Referring to the empty space skipping strategy [12], during training, the cost of constructing an occupied grid in each training iteration is no longer negligible. To avoid this overhead, a single occupied grid is constructed for the entire sequence by recording the union of occupied areas in each individual frame. Specifically, at the beginning of training, a

64 \times 64 \times 64

voxel occupancy grid in Apose pose is initialized in normalized space, and its density is queried from the pose radiation field, which is then input into the voxel attention module to obtain new voxels, updated once per iteration by taking the moving average of the current occupancy value and the density queried from the pose radiation field in the current iteration. In this space, the global orientation and translation of the voxel occupancy grid are decomposed so that the union of occupied spaces is as tight as possible, further reducing unnecessary queries.

v_{p}^{'} = V_{a t t} (F_{σ, c}^{'} (v_{p}))

(4)

the pose radiation field

F^{'} σ

, cpredicts the sampled point density and color, which are rendered into a new view through volume rendering. Given a pixel, a ray

r = o + td

is cast, where o is the camera center and d is the ray direction. Points are sampled along the ray based on N intervals

{\{x^{'} i\}}^{N}

between the near and far planes. By mapping

{\{x^{'} i\}}^{N}

back to canonical space and querying from the corresponding NeRF model

F_{σ, c}

, the color and density of each point are queried from the pose radiation field

F_{σ, c}^{'}

. Then, the brightness and density of the queried points are accumulated along the ray to obtain the pixel color C,

C = \sum_{i = 1}^{N} \prod_{j < i} (1 - a_{i}) c_{j}, w i t h a_{i} = 1 - exp (σ_{i} δ_{i})

(5)

where

δ i = ∥x^{'} i + 1 - x^{'} i∥

is the distance between sampled points. Each attention weight

α_{v}

is computed by a 2-layer MLP that takes three inputs: (i) the current voxel density

σ_{v}

; (ii) the Euclidean distance from the voxel center to the nearest SMPL surface; (iii) the frequency of occupancy across the temporal window. The MLP outputs a scalar in

[0, 1]

, which is used to update the voxel occupancy via an exponential moving average with decay

0.95

and learning rate

1 \times 10^{- 4}

for 5000 iterations.

2.3. Mesh Adaptive Optimization and Texture Generation

In the Figure 2, to render an image, the mesh is rasterized, 3D positions are properly interpolated into image space, and joint optimization is continued for pixel color loss. The coarse mesh extracted from the density field by Marching Cubes often has defects. These defects include inaccurate vertices and densely and uniformly distributed triangular facets, leading to huge disk storage and slow rendering speed. The goal of this chapter is to recover a fine mesh similar to an artificial mesh by refining vertex positions and facet density.

Given an initial coarse mesh

M_{coarse} = V, F

, a trainable offset

Δ v_{i}

is assigned to each vertex

v_{i} \in V

. We use differentiable rendering [38] to optimize these offsets through backpropagation [39] of the image space loss gradient. Additionally, during experimentation, it was found that there is a certain displacement deviation between the implicit NeRF model and the mesh model. Therefore, we propose an overall mesh offset

Δ V = Δ {\bar{v}}_{i}

to align the implicit model with the explicit model, where

Δ {\bar{v}}_{i}

represents the average offset of all vertices.

During the training process, we reproject the 2D pixel rendering errors onto the corresponding mesh faces and accumulate triangular face errors. After a certain number of iterations, all triangular face errors

E_{f a c e}

are sorted to determine two thresholds:

\begin{matrix} e_{subdivide} = percentile (E_{face}, 95) \\ e_{decmiate} = percentile (E_{face}, 50) \end{matrix}

(6)

triangle faces with errors above

e_{subdivide}

undergo midpoint subdivision [40] to increase triangle density, while those with errors below

e_{dicimate}

are decimated and remeshed to reduce triangle density. After updating the mesh, vertex offsets and triangular face errors are reinitialized, and training continues. This process is repeated multiple times until the second phase is complete.

To export a textured surface mesh compatible with commonly used 3D hardware and software, a refined surface mesh

M_{f i n e}

is obtained after the second phase of training, but the appearance is still encoded in the 3D color mesh. To extract the appearance as texture images, UV coordinates for

M_{f i n e}

are first unfolded using XAtlas [41]. Subsequently, the diffuse color

c d

and specular features

f_{s}

of the surface are baked into separate images: a diffuse map

I_{d}

(RGB texture) and a specular map

I_{s}

[42], as shown in Figure 2. The two MLPs are consistent with those in Instant-NGP.

2.4. Training Loss

This term simply minimizes the pixel-wise color error between the rendered and the ground-truth image. We train the current model by minimizing the error between the rendered image pixels C and the corresponding real image pixels

C_{g t}

. Here, the more robust Huber loss

ρ

is chosen:

L_{r g b} = ρ (∥C - C_{g t}∥)

(7)

to suppress floating semi-transparent artifacts, we penalize any deviation of the rendered silhouette from the foreground mask. Additionally, to reduce floating artifacts in space, we calculate the loss between the transparency

α

value of the rendered 2D image and the transparency

α_{g t}

of the human body mask in the dataset.

L_{alpha} = ∥a - a_{g t}∥

(8)

we encourage the density field to converge to a binary inside/outside decision, thereby eliminating ghost voxels. Following LOLNeRF [43], we also add further regularization to encourage the NeRF model to predict solid surfaces:

L_{hard} = - log ({exp}^{- | a |} + {exp}^{- | a - 1 |}) + const

(9)

where const is a constant to ensure that the loss value is non-negative. Encouraging solid surfaces helps speed up rendering by terminating rays early when the cumulative opacity reaches 1.

Previous methods used to learn human avatars [38,39] typically encouraged models to use the SMPL body model as a regularizer to predict zero density for points outside the surface and solid density for points inside the surface, reducing artifacts near the body surface. However, this regularization makes strict assumptions about body shape and does not generalize well to loose clothing. Voxels outside the SMPL hull are forced to zero density, preventing shape leakage. We use a voxel attention mechanism to optimize the prediction of the occupancy grid, which is beneficial for better estimating human and clothing shapes, and defines an additional loss

L_{r e g}

that encourages zero density within blank cells of the occupancy grid:

L_{density} = \{\begin{matrix} | σ (x) | x is located in a blank area \\ 0 others \end{matrix}

(10)

to prevent geometric shape mutations, Laplacian [44] smoothing loss

L_{s m o o t h}

is applied to the mesh during the second stage of training in this chapter. Finally, we apply a standard Laplacian term to keep the mesh surface smooth. Additionally, we regularize the overall mesh offset and independent vertex offsets using L2 loss:

\begin{matrix} L_{offset 1} = {(Δ V)}^{2} \\ L_{offset 2} = \sum_{i} {(Δ v_{i})}^{2} \end{matrix}

(11)

where

L_{o f f s e t 1}

effectively aligns the NeRF model with the Mesh grid, and

L_{o f f s e t 2}

ensures that Mesh vertices do not deviate excessively from their original positions.

3. Results

To validate the effectiveness of the proposed method, we conducted experiments on multiple datasets, comparing the synthesis effects of new perspectives and poses of the human body. Experimental results show that our model outperforms baseline models on most tasks.

3.1. Datasets and Baseline Methods

PeopleSnapshot We conducted experiments on the PeopleSnapshot [45] dataset, which contains videos of humans rotating in front of a camera. This section follows the evaluation protocol defined in Anim-NeRF [46]. Since the pose parameters provided in this dataset are not perfect and do not always align with the images, while Anim-NeRF optimizes the human poses of training and test frames, we use the optimized human pose parameters to train the model and freeze them during training for a fair comparison.

NeuMan Due to the limited pose variations (self-rotation) of the human body during movement in the PeopleSnapshot dataset, to evaluate performance on more challenging test poses, this section also uses the Neuman dataset [47] in real-world scenarios for training. NeuMan provides large pose sequences in real scenarios, which can fully expose the influence of SMPL pose errors on reconstruction, and thus is used to evaluate the robustness of IHM. During the training process, we unify the partitioning strategy of the Neuman dataset to ensure a fair comparison between the current method and baseline methods.

Our method also supports creating custom datasets [47] for human reconstruction by following these steps: First, convert the captured real videos into image frame collections, then use detectron2 for image segmentation to obtain masks. Next, perform sparse 3D scene reconstruction based on colmap to obtain sparse point clouds. Establish dense correspondences between pixels and 3D object geometry using densepose_ rcnn. Use openpose for human keypoint detection. Perform monocular depth estimation with BoostingMonocularDepth. Use ROMP for the SMPL human body model parameter encoding, SMPL alignment, and contour optimization.

Meanwhile, the following methods are considered as baselines:

Anim-NeRF [46] This baseline uses an MLP-based NeRF to model human shape and appearance in canonical space. Given a pose, they first generate an SMPL body in the target pose. Then, for each query point in deformation space, its corresponding skinning weight is defined as the weighted average of the skinning weights of its K nearest vertices on the SMPL pose mesh. Finally, through skinning weights, query points can be transformed into canonical space according to inverse LBS.

Neural body [48] This baseline learns a set of latent codes anchored to a deformable SMPL mesh. These latent codes deform with the SMPL mesh and are decoded into radiance fields for different poses.

InstantAvatar [12] This baseline uses a multi-resolution hash-encoded neural radiance field to learn human appearance, combined with an SMPL voxelized density grid, and employs an empty space skipping strategy to achieve rendering acceleration.

3.2. Experimental Setup and Training Details

The proposed method was trained on a single NVIDIA A6000 GPU. To ensure fair comparisons, the data processing scripts provided by InstantAvatar were used to preprocess the original PeopleSnapshot and Neuman monocular 2D video datasets for training. The same parameters were applied across different methods to set up the training and testing datasets. Due to variations in video lengths across different sub-datasets, the division parameters also differ, as detailed in the Table 1.

3.3. Model Evaluation Metrics

To evaluate the appearance quality of the reconstructed avatar, we animate and render the reconstructed model using test frame poses from the PeopleSnapshot and Neuman datasets, respectively. We calculate the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) between the rendered images and real images to quantify the reconstruction effect. As shown in Table 2 and Table 3, the images generated in this section are significantly better than those of Neural Body [48] and achieve comparable quality to InstantAvatar.

The training duration and computational resource cost of our method are comparable to those of InstantAvatar. High-quality rendering results can be generated on a single A6000 in just 1 min, significantly faster than the multi-hour training requirements of other methods like Anim-NeRF [46] and Neural Body [48]. Meanwhile, our method can export high-fidelity textured mesh models in arbitrary poses within 30 min. Anim-NeRF achieves a marginally higher SSIM (0.0006) on Female-3-casual. However, IHM maintains an advantage for real-time applications due to its significantly lower inference time (5 s/frame with poses frozen) compared to Anim-NeRF’s 180 s/frame, which requires test-time pose optimization. Due to the lack of real images and mesh data for new poses, this section only provides a qualitative evaluation of new pose synthesis and mesh generation. Figure 3 illustrates the process from inputting a monocular 2D video to implicitly rendering and generating a textured, high-quality human mesh model.

To probe robustness beyond the clean studio sequences, we additionally evaluated IHM on three challenging scenarios from the NeuMan dataset: (i) severe occlusion (parking lot scene with pedestrians); (ii) extreme pose (jogging with 90° hip flexion); (iii) night-time complex lighting (bike scene with flickering street-lamp). While IHM preserves global body shape, we observe slight texture blurring under low-light and minor finger artifacts under heavy occlusion, indicating room for future improvement.

Figure 4 demonstrates a comparison of the rendering effects between the current method and InstantAvatar. Compared to InstantAvatar, our method can effectively reduce floating artifacts and improve detail performance, such as the details of the palm in Figure 4. This benefit is attributed to the further application of the voxel attention mechanism, which optimizes the SMPL voxel occupancy grid in canonical space and improves the accuracy of the transformation from canonical space to pose space, thereby enhancing image quality and strengthening robustness and rendering effects in complex real-world scenarios.

3.4. Ablation Experiment

This section conducts an ablation study on the geometric optimization stage. Specifically, we compare the complete model with variations that exclude mesh translation regularization or the iterative mesh refinement process. The results indicate that:

After removing mesh translation regularization, due to significant translational errors between the mesh projection and the image under the supervised camera pose, the resulting surface mesh exhibits poor rendering quality. Irregularities and severe vertex deviations make the model very rough. Additionally, because the iterative mesh refinement process cannot adequately handle such irregular surfaces, the mesh size increases. These irregular surfaces also lead to chaotic textures.

When iterative mesh refinement is removed, the triangle density tends to be uniform, resulting in increased mesh size and slightly poorer rendering quality. This is because the facet density cannot adaptively adjust based on the reprojection rendering error.

However, by utilizing translation vectors, our current method can learn a smooth and closed mesh model. Table 4 and Figure 5 demonstrate the effects of our method before and after mesh optimization. There is a significant reduction in mesh vertices and triangles, along with a notable decrease in disk memory usage.

4. Discussion

4.1. Metrics

This paper proposes a novel 3D human reconstruction method called the Instant Human Model (IHM) generation method. By integrating 2D video data into the canonical space, it generates a rough, pose-controllable 3D human model without any supervision. The method then aligns the rough mesh using multi-view implicit rendering images and learns texture maps while adaptively iterating to optimize mesh vertices and triangles, enabling the export of human models in arbitrary poses. Compared to baseline methods, this algorithm not only enhances rendering quality but also achieves high-quality textured mesh model generation.

4.2. Limitations

Table 5 summarizes the principal strengths and limitations of the Instant Human Model (IHM) as observed in our experiments. Each row pairs a key component with its main benefit and its most relevant drawback. Overall, the table shows that IHM is best suited for scenarios where fast, single-view acquisition and compact, animation-ready meshes are required, and where the listed drawbacks (e.g., extra GPU memory or occasional topological artifacts) are acceptable.

5. Conclusions

In contrast to traditional photography methods that rely on heavy equipment, the technique described here reduces equipment dependency and lowers the cost and complexity of data acquisition. This advantage allows the method to quickly construct personalized 3D models under low-cost conditions, potentially broadening the application scope of this technology significantly.

Author Contributions

K.W.: literature search, study design, data interpretation, writing. X.X.: literature search, figures, study design, data analysis, data interpretation, writing. W.L.: literature search, data collection, data analysis. J.L.: data collection, data analysis. Z.W.: data collection, data analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62361041 and the Natural Science Foundation of Jiangxi Province under Grant 20242BAB25053.

Data Availability Statement

The data presented in this study are available in Neuman Dataset at https://doi.org/10.48550/arXiv.2203.12575, reference number [47]. These data were derived from the following resources available in the public domain: https://github.com/apple/ml-neuman; and People Snapshot Dataset at https://doi.org/10.1109/CVPR.2018.00875, reference number [5]. These data were derived from the following resources available in the public domain: https://graphics.tubs.de/people-snapshot.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

Liu, L.; Zhao, K. Report on Methods and Applications for Crafting 3D Humans. arXiv 2024, arXiv:2406.01223. [Google Scholar] [CrossRef]
Pavlakos, G.; Weber, E.; Tancik, M.; Kanazawa, A. The One Where They Reconstructed 3D Humans and Environments in TV Shows. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Behrad, A.; Roodsarabi, N. 3D Human Motion Tracking and Reconstruction Using DCT Matrix Descriptor. ISRN Mach. Vis. 2012, 2012, 235396. [Google Scholar] [CrossRef]
Wang, J.; Yoon, J.S.; Wang, T.Y.; Singh, K.K.; Neumann, U. Complete 3D Human Reconstruction from a Single Incomplete Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8748–8758. [Google Scholar]
Alldieck, T.; Magnor, M.; Xu, W.; Theobalt, C.; Pons-Moll, G. Video based reconstruction of 3d people models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8387–8397. [Google Scholar]
Jiang, L.; Li, M.; Zhang, J.; Wang, C.; Ye, J.; Liu, X.; Chai, J. Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images. arXiv 2021, arXiv:2106.11536. [Google Scholar]
Retsinas, G.; Filntisis, P.P.; Danecek, R.; Abrevaya, V.F.; Roussos, A.; Bolkart, T.; Maragos, P. 3D Facial Expressions through Analysis-by-Neural-Synthesis. arXiv 2024, arXiv:2404.04104. [Google Scholar] [CrossRef]
Pumarola, A.; Sanchez-Riera, J.; Choi, G.; Sanfeliu, A.; Moreno-Noguer, F. 3dpeople: Modeling the geometry of dressed humans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2242–2251. [Google Scholar]
Ryu, N.; Gong, M.; Kim, G.; Lee, J.H.; Cho, S. 360° Reconstruction From a Single Image Using Space Carved Outpainting. arXiv 2023, arXiv:2309.10279. [Google Scholar]
Dong, Z.; Guo, C.; Song, J.; Chen, X.; Geiger, A.; Hilliges, O. PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence. arXiv 2022, arXiv:2203.01754. [Google Scholar] [CrossRef]
Weng, C.Y.; Curless, B.; Srinivasan, P.P.; Barron, J.T.; Kemelmacher-Shlizerman, I. HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video. arXiv 2022, arXiv:2201.04127. [Google Scholar]
Jiang, T.; Chen, X.; Song, J.; Hilliges, O. Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16922–16932. [Google Scholar]
Saito, S.; Huang, Z.; Natsume, R.; Morishima, S.; Kanazawa, A.; Li, H. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2304–2314. [Google Scholar]
Saito, S.; Simon, T.; Saragih, J.; Joo, H. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. arXiv 2020, arXiv:2004.00452. [Google Scholar]
Tu, Z.; Huang, Z.; Chen, Y.; Kang, D.; Bao, L.; Yang, B.; Yuan, J. Consistent 3d hand reconstruction in video via self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9469–9485. [Google Scholar] [CrossRef]
Jiang, Z.; Guo, C.; Kaufmann, M.; Jiang, T.; Valentin, J.; Hilliges, O.; Song, J. Multiply: Reconstruction of multiple people from monocular video in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 109–118. [Google Scholar]
Guo, C.; Li, J.; Kant, Y.; Sheikh, Y.; Saito, S.; Cao, C. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5559–5570. [Google Scholar]
Zhi, Y.; Sun, W.; Chang, J.; Ye, C.; Feng, W.; Han, X. StruGauAvatar: Learning Structured 3D Gaussians for Animatable Avatars from Monocular Videos. IEEE Trans. Vis. Comput. Graph. 2025, 31, 7820–7833. [Google Scholar] [CrossRef]
Tan, J.; Xiang, D.; Tulsiani, S.; Ramanan, D.; Yang, G. Dressrecon: Freeform 4d human reconstruction from monocular video. In Proceedings of the 2025 International Conference on 3D Vision (3DV), Singapore, 25–28 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 250–260. [Google Scholar]
Zhao, Y.; Wu, C.; Huang, B.; Zhi, Y.; Zhao, C.; Wang, J.; Gao, S. Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 11502–11518. [Google Scholar] [CrossRef]
Lorensen, W.; Cline, H. Marching cubes: A high resolution 3D surface construction algorithm. In Seminal Graphics: Pioneering Efforts That Shaped the Field; Association for Computing Machinery: New York, NY, USA, 1998. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv 2020, arXiv:2003.08934. [Google Scholar] [CrossRef]
Martin-Brualla, R.; Radwan, N.; Sajjadi, M.S.M.; Barron, J.T.; Dosovitskiy, A.; Duckworth, D. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. arXiv 2021, arXiv:2008.02268. [Google Scholar] [CrossRef]
Deng, K.; Liu, A.; Zhu, J.Y.; Ramanan, D. Depth-supervised NeRF: Fewer Views and Faster Training for Free. arXiv 2024, arXiv:2107.02791. [Google Scholar]
Zhang, K.; Riegler, G.; Snavely, N.; Koltun, V. NeRF++: Analyzing and Improving Neural Radiance Fields. arXiv 2020, arXiv:2010.07492. [Google Scholar]
Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. arXiv 2022, arXiv:2111.12077. [Google Scholar]
Xing, Y.; Yang, Q.; Yang, K.; Xu, Y.; Li, Z. Explicit-NeRF-QA: A Quality Assessment Database for Explicit NeRF Model Compression. arXiv 2024, arXiv:2407.08165. [Google Scholar]
Kouros, G.; Wu, M.; Shrivastava, S.; Nagesh, S.; Chakravarty, P.; Tuytelaars, T. Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction. arXiv 2023, arXiv:2308.08530. [Google Scholar]
Billouard, C.; Derksen, D.; Sarrazin, E.; Vallet, B. SAT-NGP: Unleashing Neural Graphics Primitives for Fast Relightable Transient-Free 3D Reconstruction From Satellite Imagery. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 8749–8753. [Google Scholar] [CrossRef]
Wang, Z. 3D Representation Methods: A Survey. arXiv 2024, arXiv:2410.06475. [Google Scholar] [CrossRef]
Wang, X.; Guo, Y.; Deng, B.; Zhang, J. Lightweight Photometric Stereo for Facial Details Recovery. arXiv 2020, arXiv:2003.12307. [Google Scholar] [CrossRef]
Bai, Y.; Wong, L.; Twan, T. Survey on Fundamental Deep Learning 3D Reconstruction Techniques. arXiv 2024, arXiv:2407.08137. [Google Scholar] [CrossRef]
Vinodkumar, P.K.; Karabulut, D.; Avots, E.; Ozcinar, C.; Anbarjafari, G. Deep Learning for 3D Reconstruction, Augmentation, and Registration: A Review Paper. Entropy 2024, 26, 235. [Google Scholar] [CrossRef]
Zheng, Z.; Yu, T.; Wei, Y.; Dai, Q.; Liu, Y. DeepHuman: 3D Human Reconstruction from a Single Image. arXiv 2019, arXiv:1903.06473. [Google Scholar] [CrossRef]
Xiong, Z.; Kang, D.; Jin, D.; Chen, W.; Bao, L.; Cui, S.; Han, X. Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using Pixel-aligned Reconstruction Priors. arXiv 2023, arXiv:2302.01162. [Google Scholar]
Chen, X.; Zheng, Y.; Black, M.J.; Hilliges, O.; Geiger, A. SNARF: Differentiable Forward Skinning for Animating Non-Rigid Neural Implicit Shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Chen, X.; Jiang, T.; Song, J.; Rietmann, M.; Geiger, A.; Black, M.J.; Hilliges, O. Fast-SNARF: A fast deformer for articulated neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11796–11809. [Google Scholar] [CrossRef]
Laine, S.; Hellsten, J.; Karras, T.; Seol, Y.; Lehtinen, J.; Aila, T. Modular Primitives for High-Performance Differentiable Rendering. ACM Trans. Graph. (ToG) 2020, 39, 1–14. [Google Scholar] [CrossRef]
Munkberg, J.; Hasselgren, J.; Shen, T.; Gao, J.; Chen, W.; Evans, A.; Müller, T.; Fidler, S. Extracting triangular 3d models, materials, and lighting from images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 8280–8290. [Google Scholar]
Cignoni, P.; Callieri, M.; Corsini, M.; Dellepiane, M.; Ganovelli, F.; Ranzuglia, G. Meshlab: An open-source mesh processing tool. In Proceedings of the Eurographics Italian Chapter Conference, Salerno, Italy, 2–4 July 2008; Volume 2008, pp. 129–136. [Google Scholar]
Tang, J.; Zhou, H.; Chen, X.; Hu, T.; Ding, E.; Wang, J.; Zeng, G. Delicate textured mesh recovery from nerf via adaptive surface refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17739–17749. [Google Scholar]
Chen, Z.; Funkhouser, T.; Hedman, P.; Tagliasacchi, A. Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16569–16578. [Google Scholar]
Rebain, D.; Matthews, M.; Yi, K.M.; Lagun, D.; Tagliasacchi, A. Lolnerf: Learn from one look. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1558–1567. [Google Scholar]
Nealen, A.; Igarashi, T.; Sorkine, O.; Alexa, M. Laplacian mesh optimization. In Proceedings of the 4th International Conference on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia, Kuala Lumpur, Malaysia, 29 November–2 December 2006; ACM: New York, NY, USA, 2006. [Google Scholar]
Yang, Z.; Chen, W.; Wang, F.; Xu, B. Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
Chen, J.; Zhang, Y.; Kang, D.; Zhe, X.; Lu, H. Animatable Neural Radiance Fields from Monocular RGB Video. arXiv 2021, arXiv:2106.13629. [Google Scholar] [CrossRef]
Jiang, W.; Yi, K.M.; Samei, G.; Tuzel, O.; Ranjan, A. NeuMan: Neural Human Radiance Field from a Single Video. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Peng, S.; Zhang, Y.; Xu, Y.; Wang, Q.; Shuai, Q.; Bao, H.; Zhou, X. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9054–9063. [Google Scholar]

Figure 1. Flowchart for instant human rendering method.

Figure 2. Mesh adaptive optimization and texture generation flowchart.

Figure 3. Rendering effects of the current method compared to InstantAvatar on the Female-4-Casual Subset of the PeopleSnapshot dataset and the Citron, Parking Lot, and Lab Scenes in the Neuman dataset.

Figure 4. Effects of the Current Method on New Pose and New View Synthesis, and Textured Mesh Grid Generation across different sub-datasets.

Figure 5. Using the Male-3-Casual theme as an example. The refined mesh has significantly fewer triangular faces compared to the original.

Table 1. The division parameters.

	Train			Test
	Start	End	Skip	Start	End	Skip
Male-3-casual	0	455	4	456	675	4
Male-4-casual	0	659	6	660	872	6
Female-3-casual	0	445	4	446	647	4
Female-4-casual	0	335	4	335	523	4
bike	0	104	1	103	104	4
seattle	0	37	1	36	37	4
lab	0	102	1	101	102	4
citron	0	103	1	102	103	4
jogging	0	42	1	41	42	4
parkinglot	0	41	1	40	41	4

Table 2. Comparison of results on the representative PeopleSnapshot sub-dataset.

	Male-3-Casual		Male-4-Casual		Female-3-Casual		Female-4-Casual
	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑
Neural Body	24.94	0.9428	24.71	0.9469	23.87	0.9504	24.37	0.9451
Anim-NeRF	29.37	0.9703	28.37	0.9605	28.91	0.9743	28.9	0.9678
InstantAvatar	29.65	0.973	27.97	0.9649	27.9	0.9722	28.92	0.9692
Our	29.9	0.974	28.19	0.9661	28.7	0.9737	29.57	0.9715

Table 3. Comparison of results on the Neuman dataset.

	InstantAvatar		Ours
	PSNR↑	SSIM↑	PSNR↑	SSIM↑
bike	24.18	0.9498	24.78	0.9515
seattle	26.57	0.9674	26.74	0.9685
lab	27.34	0.9731	27.41	0.9732
citron	24.95	0.9491	25.01	0.9526
jogging	23.4	0.9338	23.74	0.9368
parkinglot	24.14	0.952	24.13	0.9547

Table 4. Statistics of the mesh grid for Male-3-Casual before and after optimization.

	Triangular Patch	Vertex	Disk Usage
Before mesh optimization	785,896	400,355	14.3 MB
after mesh optimization	21,035	63,105	2.07 MB

Table 5. Pros and cons of the proposed Instant Human Model (IHM).

Module	Advantage	Disadvantage
Voxel Attention	Captures finger-gap/cloth wrinkles	GPU memory +15%
Mesh Iteration	70% fewer faces, 10× storage save	May produce non-manifold edge
Depth-free	Selfie video only, no extra sensor	Loose skirt can stick to legs
Hash-NGP backbone	5 s/frame inference on RTX 4090	SSIM gain marginal (+0.1%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, K.; Xie, X.; Li, W.; Liu, J.; Wang, Z. 3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization. Electronics 2025, 14, 4512. https://doi.org/10.3390/electronics14224512

AMA Style

Wang K, Xie X, Li W, Liu J, Wang Z. 3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization. Electronics. 2025; 14(22):4512. https://doi.org/10.3390/electronics14224512

Chicago/Turabian Style

Wang, Kaipeng, Xiaolong Xie, Wei Li, Jie Liu, and Zhuo Wang. 2025. "3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization" Electronics 14, no. 22: 4512. https://doi.org/10.3390/electronics14224512

APA Style

Wang, K., Xie, X., Li, W., Liu, J., & Wang, Z. (2025). 3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization. Electronics, 14(22), 4512. https://doi.org/10.3390/electronics14224512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization

Abstract

1. Introduction

2. Materials and Methods

2.1. Effective Dynamic Neural Radiance Fields

2.2. Voxel Attention Mechanism

2.3. Mesh Adaptive Optimization and Texture Generation

2.4. Training Loss

3. Results

3.1. Datasets and Baseline Methods

3.2. Experimental Setup and Training Details

3.3. Model Evaluation Metrics

3.4. Ablation Experiment

4. Discussion

4.1. Metrics

4.2. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI