Taxonomy and Survey of Current 3D Photorealistic Human Body Modelling and Reconstruction Techniques for Holographic-Type Communication

: The continuous evolution of video technologies is now primarily focused on enhancing 3D video paradigms and consistently improving their quality, realism, and level of immersion. Both the research community and the industry work towards improving 3D content representation, compression, and transmission. Their collective efforts culminate in the striving for real-time transfer of volumetric data between distant locations, laying the foundation for holographic-type communication (HTC). However, to truly enable a realistic holographic experience, the 3D representation of the HTC participants must accurately convey the real individuals’ appearance, emotions, and interactions by creating authentic and animatable 3D human models. In this regard, our paper aims to examine the most recent and widely acknowledged works in the realm of 3D human body modelling and reconstruction. In addition, we provide insights into the datasets and the 3D parametric body models utilized by the examined approaches, along with the employed evaluation metrics. Our contribution involves organizing the examined techniques, making comparisons based on various criteria, and creating a taxonomy rooted in the nature of the input data. Furthermore, we discuss the assessed approaches concerning different indicators and HTC.


Introduction
Technological advancements have initiated the era of HTC.As explained in [1], HTC involves the transition from one person's actual location to another without the need for a physical traversal of the intervening space.However, the actual enablement of a truly immersive holographic experience necessitates the creation of convincing Mixed Reality (MR) environments, incorporating virtual elements and lifelike human avatars.The support of natural interactions between the virtual participants (the avatars), manipulated by real individuals, is one of the greatest aspects distinguishing future HTC from conventional voice and video-based modes of communication.Moreover, besides HTC, many other applications in the field of healthcare, such as remote consultations, surgical training, remote collaboration, remote rehabilitation, etc. [2][3][4][5][6][7][8]; education, including remote collaborative learning, remote guest speakers, anatomy education, etc. [9][10][11][12][13]; entertainment, such as interactive storytelling, Augmented Reality (AR)/Virtual Reality (VR) gaming, live concerts, etc. [14][15][16][17][18]; and e-commerce, in particular virtual try-on [19][20][21] will benefit from a visually authentic and interactive human appearance.Both academia and industry attempt to automate the detailed acquisition of 3D human pose and shape.The availability of sophisticated 3D acquisition equipment and powerful reconstruction algorithms has made realistic avatar generation possible.In fact, significant advancements are made in this field as digital avatars progressively acquire greater lifelike qualities, leading to increased trust among individuals.However, replicating real social interactions, including eye contact, body language, and conveying emotions through nonverbal cues (such as touch) and social signals (such as coexistence, closure, and intimacy), remains a formidable challenge, even with the current state of technological advancements.Therefore, the creation of lifelike human avatars that accurately represent both human appearance and behavior is a prominent subject within the realm of holographic experiences.
So far, a multitude of distinct methods for 3D human modelling and reconstruction have been developed, and tremendous efforts are still ongoing in this direction.However, the existing methods exhibit significant diversity in terms of whether they employ a parametric model, the chosen reconstruction approach, the dataset utilized, and, most notably, the input data type.Previous surveys have placed emphasis on variations in the parametric modelling of the 3D human body shape [22], as well as on the types of reconstruction approaches, such as traditional, regression-based, or optimization-based methods, among others [23][24][25][26].In contrast, beyond reviewing parametric models, datasets, and evaluation metrics, our work endeavors to provide a clear distinction among diverse 3D modelling approaches based on the type of input data.To this end, we survey existing methods of 3D human body modelling and reconstruction techniques and establish a taxonomy categorizing the methods into image-based, video-based, and depth-based approaches.

Parametric Human Body Models
Building lifelike human avatars and the subsequent animation is one of the main challenges facing HTC.There is a need for creating accurate representations of HTC participants, which necessitates the detailed reconstruction of 3D digital human models.The generation of such models requires individual-or population-based anthropometric data.Anthropometric measurements are used to describe a person's physical appearance.They are estimates of the distances (both linear and curved) between anatomical landmarks or circumferences at specific human body regions of interest.Height (stature), weight (body mass), upright sitting height, triceps skinfold (upper arm girth), arm circumference (upper arm girth), abdominal circumference (waist circumference), calf circumference, knee height, and elbow breadth are all common anthropometric measurements [28].The anthropometric database must be extremely thorough in order to be credible for a specific group and to account for multivariate coherences.
Constructing an accurate human body model from various types of input data, such as single images, multi-view images, videos, or depth maps, is a great challenge.Existing methods for fitting a pose to the input data typically rely on parametric, yet statistical, human body models.Such an approach usually requires the indication of body joints, which is mostly carried out manually, but automatic and semi-automatic methods [29] also exist.Further, deep neural networks have been recently used to compute statistical models' parameters [30].These types of modelling techniques have become an integral component in the recent methods for 3D human body reconstruction and animation.Here, we present two of the most popular statistical body models and one that is a promising improvement of the second presented model.

SCAPE
The authors of [31] introduced SCAPE (Shape Completion and Animation of People).It presents a data-driven approach for creating a human body model that considers variations in both shape and posture.Their methodology involves the development of two distinct models for body deformation: one that accounts for deformations resulting from changes in an individual's pose and another that captures deformations across various body shapes among different individuals.To accomplish this, a specific dataset of human body scans is collected.It comprises a pose dataset containing multiple pose scans of a specific individual and a body shape dataset containing scans of multiple individuals in similar poses.
SCAPE considers the pose and shape deformations over each of the mesh triangles, p k , with triangle points, respectively, x k,1 , x k,2 , and x k,3 .Particularly, deformations are applied over the triangle edges vk,j = x k,j − x k,1 , where j = 2, 3.A specific triangle's deformation is given in Equation (1).
where v k,j corresponds to the edges of the transformed triangle, R l[k] is related to the rigid rotation matrix that is the same for all triangles in the mesh that belong to the specific body part l[k], and Q k is a linear transformation matrix that is associated with the non-rigid poseinduced deformations and is specific to each triangle.S k is another linear transformation matrix that corresponds to body-shape-induced deformations.Both deformation matrices, Q k and S k , are not known but can be obtained by using the preliminary model-learned parameters {a k } and {U, µ}, such that corresponds to a joint angle that is representative of the relative joint rotations of two rigid parts adjacent to the same joint and β corresponds to the body shape parameters.Both the joint rotations and the body shape parameters are provided by the user.On the other hand, {a k } corresponds to the learned SCAPE model parameters that are related to the pose-induced body deformations, and {U, µ} are the learned PCA parameters that capture the space of model shape deformations.Finally, given all the transformation matrices, R, Q, and S, associated with a specific pose and body shape, a completely new body mesh, Y, can be synthesized according to Equation (2), where y j,k corresponds to the specific triangle points of the generated model.
Figure 3 illustrates a block diagram detailing the SCAPE body model generation process.It delineates three separate blocks representing the input parameters, namely the SCAPE mean template shape; the user-defined parameters, R and β; and the learned SCAPE parameters, {a k } and {U, µ}.The last two sets of parameters are essential for constructing the pose-induced (Q k ) and the shape-induced (S k ) deformation matrices.Within the block SCAPE model generation, these matrices, along with R l[k] , modify the SCAPE template mesh and thus generate the new body model.
Although SCAPE has high fidelity, it lacks the ability to capture a strong correlation between body shape and muscle deformation, for which a more expressive model is needed.This may be due to the fact that the model is learned separately for pose and shape variations.However, developing a method that simultaneously uses scans from different people with different poses would require a different approach.

SMPL
More recent methods for 3D human body reconstruction use an SMPL (Skinned Multi-Person Linear) Model [32].Similar to SCAPE, the body model is also examined in two different aspects: identity-dependent shape and non-rigid pose-dependent shape.In contrast to SCAPE, where mesh triangles are primarily utilized, an SMPL considers a vertex-based skinning approach.A mean template mesh, T ∈ R 3N , in the zero pose θ * facilitates the model, where N is the total number of vertices.The model is also defined by the following functions.A blend shape function, B S (β, S) : R |β| → R 3N , takes as an input the shape parameters, β, and a set of learned body shape parameters, S, and as an output a blend shape sculpting the subject identity according to Equation (3): The function J(β) : R |β| → R 3K predicts the location of the K skeletal joints with respect to the subject-specific body shape according to Equation (4): where V shaped = T + B S (β; S).A pose-dependent blend shape function, B P (θ, P ) : R |θ| → R 3N , takes as input the pose parameters, θ, and a set of learned body pose parameters, P, and as an output blend shapes effected by pose-dependent deformations, considering Equation ( 5).
Then, a blend skinning function, W(•), rotates the mesh vertices around the estimated joint centers with respect to the set of learned blend weights W ∈ R 3N×K .The resulting model is described by M(β, θ, T, S, J , P, W ) : R |β|×|θ| → R 3N and is finally defined in Equation ( 6) where {T, W, S, J , P } is the full set of SMPL model parameters.Except for the mean template model, T, the rest are the learnable model pose (W, J , P ) and model shape (S) parameters obtained during the training.In contrast, β and θ are passed from the user and control the learned parameters, generating a completely new body model.T P (β, θ, T, S, P ) accounts for the offset from the template model caused due to identity-dependent and pose-dependent shape deformations Equation (7).
T P (β, θ, T, S, P ) = T + B S (β, S) + B P (θ, P ) Figure 4 visualizes a block diagram of the SMPL human body model generation process.The figure defines three separate blocks for input parameters, namely the SMPL mean template shape, T, the user parameters, β and θ, and the learned SMPL parameters, W, S, J , P. The user parameters and the SMPL learned parameters are used to generate the shape blend shapes, the joint location of the new body shape, and the pose blend shapes.In the block shape and pose correction, the obtained shape and pose blend shapes are added to the SMPL template mesh in order to create a template offset, T P .The generation of a new body model is indicated in the SMPL model generation block, where the offset of the template mesh, the predicted joint locations, the pose parameters, and the blend weights are passed to a standard blend skinning function W(•).
Since its creation in 2015, the SMPL model has been extensively utilized in various reconstruction algorithms due to its open-source nature, compatibility with diverse datasets, and widespread popularity, making it a cornerstone in 3D human body research.

STAR
Although the SMPL model is widely adopted due to its intuitive parametrization, it suffers from several drawbacks, indicated by [33].The first limitation that is considered is the huge number of parameters due to the use of global blend shapes.Since each vertex in the mesh is related to every joint in the kinematic tree, the pose-corrective offsets may capture a spurious long-range correlation, resulting in less realistically generated models.The authors of [33] define the STAR (Sparse Trained Articulated Human Body Regressor), where subsets of mesh vertices that are influenced by specific joint movements are learned.This is reflected by applying per-joint pose correctives and obtaining better results according to deformation realism and a reduced number of model parameters.A second limitation of the SMPL model is the separate examination of the pose-dependent deformation and the body shape.The authors of the STAR argue for the simultaneous consideration of both body pose and BMI (Body Mass Index) by learning shape-dependent pose-corrective blend shapes.Third, the SMPL training dataset is presented by not so many body scans, limiting the shape space.In contrast, the STAR model is trained with additional body scans, resulting in better model generalization.
Similar to the SMPL model, the STAR is a vertex-based model that also factors the body shape into the subject's identity shape and pose-dependent deformations.However, contrary to the SMPL model, the authors assume that pose-corrective deformation is a function of both body pose, θ ∈ R |θ| , and shape, β ∈ R |β| .Additionally, during training, a subset of vertices that is relevant to a specific joint, j, is learned, so the pose-corrective blend shape function is applied to it.A template model, T ∈ R 3N , where N is the total number of vertices, is subject to deformation by a shape-corrective blend shape function, B S , as a meaning of the subject's identity and a pose-corrective blend shape function, B P , as a meaning of the subject's pose with assumed realism in shape.
The shape-corrective blend shape function B S (β; S) : R |β| → R 3N is defined in Equation ( 8): where β are the shape coefficients and S is a set of learned parameters that express the principal components capturing the shape variability space.Further, the pose-corrective blend shape function with respect of the subject's identity pose, B P (q, β 2 ; K, A) : R |q|×1 → R 3N , and BMI is defined in Equation (9): In this case, a pose-corrective function is applied for each joint, j, in the kinematic tree independently by B j P (q ne (j), β 2 ; K j , A j ), where K is the total number of joints (the root joint is not considered), q ne (j) ⊂ q is a subset that contains a single joint, j, and its direct neighbors in the kinematic three, β 2 corresponds to the PCA coefficient of the second principal component, which is highly related to the BMI, K j ∈ R 3N×|q ne (j)|+1 is a linear regressor weight matrix, and A j corresponds to the activation weights for each vertex.The last two terms are learned during training.
The template mesh with an added pose-and shape-corrective offset, T P , is defined in Equation (10): Finally, a standard skinning function, W, is applied considering the transformed mesh, T P , the full set of predicted body joints, J(β, J , T, S) = J (T + B S (β; S)), J ∈ R 3K , and a learned set of blend weight parameters, W. The STAR model is defined in Equation (11): A block diagram of the STAR human body model generation process is given in Figure 5. Since the STAR model builds on the SMPL model, both block diagrams look quite similar.However, STAR modifications are highlighted in the light red blocks.First, the input learned parameters include K and A instead of P.Then, the pose blend shape function takes different input parameters and is applied for each joint, j, in the kinematic tree.The generation of the template offset and the new STAR model in the respective blocks are similar to those of the SMPL model, with a difference in the function parameters.The remaining few blocks are the same as those of the SMPL model.

Human Body Datasets
Human bodies are flexible, moving in various ways and deforming their clothing and muscles.Another complicating issue, like the occlusion of different body parts during movement, may necessitate comprehensive scene modelling in addition to the peoples in the scenario.Such image understanding scenarios push, for example, the avatar body animation system's ability to use prior knowledge and structural correlations by constraining estimates of unseen body components using limited visible information.Insufficient data coverage is one of the most significant issues for trainable systems.So, many researchers have concentrated their efforts on creating publicly available datasets that can be used to build operational systems for realistic scenarios.
One of the largest motion capture datasets is the Human 3.6 M dataset [34].It consists of 3.6 million fully visible human poses and corresponding images.All of them are captured by a high-speed motion capture system.The recording setup consists of 15 sensors (4 calibrated high-resolution progressive scan cameras that acquire video data at 50 Hz, 1 time-of-flight sensor, and 10 motion cameras), using hardware and software synchronization.This allows for accurate capture and synchronization.The dataset contains activities performed by 11 professional actors (6 male, 5 female) in 17 scenarios-taking photos, discussing, smoking, talking on the phone, etc.Also, accurate 3D joint positions and joint angles from high-speed motion capture systems are provided.Other useful additions are 3D laser scans of the actors, high-resolution videos, and accurate background subtraction.
The MPI-INF-3DHP [35] is a 3D human body pose estimation dataset.It consists of both constrained indoor and complex outdoor scenes.The dataset comprises eight actors (4 male, 4 female) enacting eight distinct activity sets, each lasting about a minute.With a diverse camera setup, 14 cameras in total, over 1.3 M frames have been obtained, with 500 k originating from cameras at chest height.The dataset provides genuine 3D annotations and a skeleton compatible with the "universal" skeleton of Human 3.6 M. To bridge the gap between studio and real-world conditions, chroma-key masks are available, facilitating extensive scene augmentation.The test set, enriched with various motions, camera viewpoints, clothing varieties, and outdoor settings, aims to challenge and benchmark pose estimation algorithms.
The Synthetic Humans for Real Tasks (SURREAL) dataset [36] contains 6.5 M frames of synthetic humans, organized into 67,582 continuous sequences.The SMPL [32] body model is employed to generate these synthetic bodies, with body deformations distinguished by pose and intrinsic shape.Created in 2017, it is the first large-scale person dataset to generate depth, body parts, optical flow, 2D/3D pose, surface, normal, and ground truth for Reed Green Blue (RGB) video input.The provided images are photorealistic renderings of people in different shapes, textures, viewpoints, and poses.
Dynamic Fine Alignment Using Scan Texture (DFAUST) [37] is considered a 4D dataset.It consists of high-resolution 3D scans of moving non-rigid objects, captured at 60 fps.A new mesh registration method is proposed.It uses both 3D geometry and texture information to register all scans in a sequence according to a common reference topology.The method makes use of texture constancy across short and long time intervals, as well as dealing with temporal offsets in shape and texture.
Microsoft Common Objects in Context (MS COCO) [38] is a large-scale object detection, segmentation, and captioning dataset.It consists of many other objects, but also humans and human photos.The dataset offers recognition in context, superpixel stuff segmentation, and 250,000 people with key points.
Leeds Sports Pose (LSP) [39] and its extended version-LSPe [40]-are human body joint detection datasets.The LSPe dataset contains 10,000 images gathered from Flickr searches for the tags 'parkour', 'gymnastics', and 'athletics' and uses poses that are challenging to estimate.Each image has a corresponding annotation that might not be highly accurate because it is gathered from Amazon Mechanical Turk.Each image is annotated with up to 14 visible joint locations.
The Bodies Under Flowing Fashion (BUFF) dataset, as delineated by the authors of [41], offers over 11,000 3D human body models engaged in complex movements.It is distinctive in its inclusion of videos featuring individuals in clothing paired with 3D models devoid of clothing textures.This dataset emerges from a multi-camera active stereo system, utilizing 22 pairs of stereo cameras, color cameras, speckle projectors, and white-light LED panels operating at varied frame rates.This system outputs 3D meshes averaging around 150 K vertices, capturing subjects in two distinct clothing styles.Of the initial six subjects, the data from one were withheld, resulting in a public release of 11,054 scans.To derive a semblance of "ground truth", the subjects were captured in minimal attire, with the dataset's accuracy showcased by the proximity of more than half of the scan points to the mean of the estimates.The BUFF dataset efficiently captures detailed aspects of human movement while also considering the impact of different clothing on a body's shape and motion.
The HumanEva datasets [42], comprising HumanEva-I and HumanEva-II, offer a blend of video recordings and motion capture data from subjects performing predefined actions.HumanEva-I encompasses data from four subjects executing six distinct actions, each with synchronized video and motion capture, and one with only motion capture.This dataset leverages seven synchronized cameras, utilizing multi-view video data coupled with pose annotations.On the other hand, HumanEva-II focuses on two subjects, both of whom are also present in HumanEva-I, performing an extended "Combo" sequence, resulting in roughly 2500 synchronized frames.The data, collected under controlled indoor conditions, capture the intricacies of natural movement, albeit with challenges posed by illumination and grayscale imagery.
The UCLA Human-Human-Object Interaction (HHOI) dataset [43] is a novel RGB-D (Reed Green Blue-Depth) video collection detailing both human-human and humanobject-human interactions, captured using a Microsoft Kinect v2 sensor.Comprising three human-human interactions--hand shakes, high-fives, and pull ups--and two human-object-human interactions--throw and catch and hand over a cup-the dataset features an average of 23.6 instances for each interaction.These instances are performed by eight actors, recorded from multiple angles, and spanning 2-7 s at a frame rate of 10-15 fps.While objects within the dataset are discerned using background subtraction on both RGB and depth images, the Microsoft Kinect v2's skeleton estimation is also utilized.The dataset, divided into four distinct folds for training and testing, ensures no overlap of actor combinations between the sets.The training algorithm demonstrates robust convergence within 100 iterations, operating on a standard 8-core 3.6 GHz and yielding an average synthesis speed of 5 fps using an unoptimized Matlab code.
To address prevalent challenges in viewpoint invariant pose estimation, a novel technical solution has been presented in [44].It integrates local pose details into a learned, viewpoint-invariant feature space.This approach enhances the iterative error feedback model to incorporate higher-order temporal dependencies and adeptly manage occlusions via a multi-task learning methodology.Complementing this endeavor is the introduction of the Invariant Top View (ITOP) dataset, a comprehensive collection of 100 K depth images capturing 20 individuals across 15 diverse actions, encompassing a wide range of views, from front, top, to side, inclusive of occluded body segments.Each image in the ITOP dataset is meticulously labeled with precise 3D joint coordinates relative to the camera's perspective.With its unique blend of front/side and top views--the latter captured from ceiling-mounted cameras--the ITOP dataset stands as a significant resource for benchmarking and furthering advancements in viewpoint-independent pose estimation.
The 3D Human Body Model dataset established by the authors of [45] is a synthetic dataset that consists of 20,000 three-dimensional models of human bodies in static poses and an equal gender distribution.It is generated with the STAR parametric model [33].While generating the models, two primary considerations were maintained: the natural Range of Motion (ROM) for each joint and the prevention of self-intersections in the 3D mesh.Existing research on the human ROM was referenced to define the limitations of joint rotations.Despite adhering to ROM constraints, certain non-idealities sometimes result in self-intersections in areas like the pelvic region, knees, and elbows.To address this, each vertex of the mesh is associated with a specific bone group.Self-intersections between non-adjacent bone groups are considered forbidden, and an algorithm flags such meshes as invalid.
The 3D Poses in the Wild Dataset (3DPW) [46] offers a unique perspective by capturing scenarios in challenging outdoor environments.This extensive dataset encompasses more than 51,000 frames featuring seven different actors donning 18 distinct clothing styles.The data collection process involves the use of a handheld smartphone camera to record the actions of one or two actors.Notably, 3DPW enhances its utility by providing highly accurate mesh ground truth annotations.These annotations are generated by fitting the SMPL model to the raw ground truth markers.
The Max Planck Institute for Informatics (MPII) dataset serves to evaluate the accuracy of articulated human pose estimation.This dataset comprises approximately 25,000 images, featuring annotations for over 40,000 individuals, including their body joints.These images were systematically compiled, capturing a wide array of everyday human activities.In total, the dataset encompasses 410 different human activities, with each image labeled according to the specific activity depicted.The images are extracted from YouTube videos.
In addition to the well-known datasets mentioned earlier, there exist several others employed in the 3D reconstruction methods reviewed.A comprehensive list of these datasets, including the ones previously mentioned, is presented in Table 1.These datasets are compared based on the availability of various types of data, such as RGB images, frame sequences, depth maps, multi-view perspectives, 2D poses, 3D poses, and 3D body meshes.The respective papers that utilize each dataset are also referenced.The dash indicates missing or unspecified types of data.In instances where RGB frame sequences are provided, it is automatically assumed that RGB images are also accessible.

Evaluation Metrics
Applying evaluation metrics is crucial to quantitatively assessing the reconstruction quality of generated models in the field of 3D human body reconstruction.Here, we briefly introduce some of the most commonly utilized metrics for this purpose.These metrics provide a standardized and objective means of evaluating the accuracy and fidelity of reconstructed 3D human body models, ensuring a reliable assessment of the reconstruction process.

•
Mean Per-Joint Position Error (MPJPE) The MPJPE [58] is a common metric that evaluates the performance of human pose estimation algorithms.It measures the mean distance in mm between the skeleton joints of the ground truth 3D pose and the joints from the estimated pose.The formulation is provided in Equation ( 12): where N S corresponds to the total number of skeleton joints, m f ,S (i) is a function that returns the coordinates of the i-th joint of skeleton S in frame f , and m ( f ) g t ,S (i) is a function that refers to the i-th joint of the skeleton in the ground truth frame.A commonly used modification of the MPJPE is the Procrustes-aligned MPJPE (PA-MPJPE), which is calculated in a similar way with the difference that the reconstructed model and the ground truth one are previously aligned using the Procrustes algorithm.

• Mean Average Vertex Error (MAVE)
The MAVE [56] is used to find the averaged distance between the vertices of the reconstructed 3D human model and the vertices of the ground truth data.It is defined by where N is the total number of vertices of the 3D model, ϑ i is a vertex from the predicted 3D human body model, and θi is a vertex from the corresponding ground truth data.

• Chamfer Distance
The symmetric point-to-point Chamfer distance measures the similarity between two point clouds P and Q.A common formulation is given in Equation ( 14): where x and y are points from P and Q, respectively, and |P| and |Q| are the total number of points in P and Q.The utilization of the min function refers to measuring the distance of a point from one point cloud to its nearest neighbors in the other point cloud.A modified version of the Chamfer distance, where the sum is replaced by a max function, is provided in Equation (15): • Vertex-to-Surface Distance (VSD) The VSD metric quantifies the average distance between the vertices of a point cloud and their corresponding points on the surface of a triangular mesh.These surface points can either be the vertices of the mesh or points that reside on its faces or edges.The authors of [94] incorporate this metric into their 3D model-fitting algorithm, employing a lifted optimization technique.Points on the surface are defined via surface coordinates, represented as u = {p, v, w}.Here, p ∈ N denotes the triangle's index where the point is situated, and v ∈ [0, 1], w ∈ [0, 1 − v] are the coordinates within the unit triangle.Therefore, the 3D coordinates of a point on the surface can be defined as shown in Equation ( 16), where v 1 , v 2 and v 2 are the vertices of the p-th triangle.
The distance between a point from a point cloud and its correspondence on the surface of the mesh can be further calculated as described in Equation ( 17), where D is the number of points x i in the point cloud and U = {u i } D i=1 are the surface coordinates of those points.
Within the confines of the algorithm described in [94], the mesh under consideration is parametric.This implies that its vertex coordinates are contingent on the parameter vector, θ.Given that the algorithm adopts lifted optimizations, both θ and the surface coordinates, U, are optimized concurrently.

Taxonomy of Existing 3D Human Body Modelling and Reconstruction Techniques
In this section, we identify three distinct types of visual data utilized as input for 3D human body modeling and reconstruction: single or multiple images, video data, and depth map data.Correspondingly, Figure 6 presents a taxonomy of existing 3D human body modeling and reconstruction techniques based on the input data.In essence, 3D human body modeling is a task in computer vision and computer graphics aimed at generating a 3D photorealistic representation of the human body.This task often involves, but is not limited to, processes such as data acquisition, 2D pose estimation through the detection of 2D body joints, camera calibration and triangulation in the case of multiple views, 3D pose estimation to derive the pose of the future 3D model, model shape optimization when using parametric body models, surface reconstruction and texturing for obtaining a detailed 3D representation of the model's surface, and post-processing and refinement to increase the quality of the reconstructed 3D model.However, it is challenging to compose a concrete framework of operations that applies universally across different input data types and reconstruction approaches.Nevertheless, several key steps, which are visualized in Figure 7, are commonly accomplished in most of the examined methods, including 3D pose estimation, coarse 3D shape estimation, 3D shape refinement, and texture recovery.The blocks surrounded with dash lines indicate that the utilization of the parametric body model and the appliance of 3D pose estimation is sometimes omitted by some of the examined algorithms, most of which are deep learning based.The figure also illustrates the different types of input data.

Image
Video Depth map

Image Data
In this subsection, we will explore significant works focused on 3D model reconstruction from 2D images.Initially, we will delve into 3D reconstruction techniques that leverage a single image as input, followed by methods based on multiple images.

Single Image
The authors of [95] focus on estimating both the shape and the pose of a person from a single image, with only a rough estimate of the height.They use a database of over 2400 subjects and utilize SCAPE to build their own parametric human body model.Further, the authors apply "Silhouettes, Edges, and Shading" to create an output 3D model that reflects both the pose and the shape of the human from the input 2D image.The authors derive that shadowing may significantly improve the estimation of human exact measurements.
In [47], an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image is developed.The authors utilize the SMPL model to encode the mesh of a 3D human body.An accent is placed on the 3D body representation, the exploita-tion of iterative 3D regression with feedback, and on a factorized adversarial prior.Many 2D (LSP, LSPe, MPII, and MS COCO) and 3D (MPI-INF-3DHP and Human 3.6 M) datasets are utilized.The authors report great results for some of the lowest MPJPE values.
BodyNet [60] is an end-to-end trainable neural network that generates a 3D body shape from a single RGB image.The input 2D images are taken from the SURREAL and Unite the People datasets.All of the selected images depict people with different clothing and are snapped from different camera viewpoints.There are four main aspects in the proposed workflow: volumetric inference for 3D human shape, multi-view re-projection loss on the silhouette, multi-task learning with intermediate supervision, and fitting a parametric body model.The SMPL model is fit to the output of BodyNet for evaluation purposes.The approach yields cutting-edge results, and there is a strong belief that it can serve as a trainable building block for upcoming techniques utilizing 3D body data.Surface error, Voxel, and Silhouette Intersection over Union (IoU) are exploited for evaluation metrics.
The authors of [48] retrieve beyond skinned detailed human body shape models from a single image in a coarse-to-fine manner.They combine the robustness of parametric body models with the flexibility of free-form deformations by proposing a novel learning-based framework.Specifically, the SMPL model is utilized for obtaining an initial parametric mesh model whose surface is further defined by performing non-rigid 3D deformation on the mesh.A deep learning approach is applied to each stage of the proposed network.Initially, an SMPL mesh is estimated from the input image.Then, all other stages serve as refinement phases that predict the mesh deformation to finally produce a detailed human shape.The authors use three datasets together to conduct their experiments: the WILD dataset-used for training and testing and constructed by 5 free pre-existing human datasets-Human 3.6 M, MS COCO, LSP, LSPe, MPII, MPI-INF-3DHP, and the Unite the People dataset; the RECON dataset-used for evaluation and constructed by the authors through traditional multi-view 3D reconstruction techniques; and the SYN dataset-also used for evaluation and constructed by the authors by rendering synthetic human mesh models from the PVHM dataset [96].The authors report results that outperform other SMPL-based approaches when running their three custom-created datasets.However, further improvements are needed to reduce errors in the depth direction.
In [61], a method for 3D human shape reconstruction from a polarization image is proposed.The method is based on a dedicated deep learning approach called Structure from Polarization.It consists of two main stages.The first stage estimates the surface normal from a single polarization image.The second stage estimates the human body shape and pose using the already available surface normal and the raw polarization image.Body shape refinement is also considered.For the body shape and pose estimation, the SMPL model is utilized.The authors use the synthetical SURREAL dataset, as well as one realworld dataset-the Polarization Human Shape and Pose Dataset (PHSPD).Empirical results are derived by using the Mean Angle Error (MAE) for the normal estimation evaluation and the MPJPE metric.
The authors of [49] introduce a method for recovering a complete 3D mesh of a human body from a single image.They develop a deep learning approach based on a generative adversarial network, which consists of a specially designed shape-pose-based generator and a multi-source discriminator.The SMPL model is an important part of the shape-posebased generator that outputs the generated human body mesh.The training is performed on multiple different datasets, namely LSP, LSPe, MS COCO, MPI-INF-3dHP, Motion and Shape capture (MoSh), SURREAL, and Human 3.6 M. The proposed method is evaluated through pose and segmentation evaluation metrics.Specifically, for the pose evaluation, the MPJPE is utilized.
In [50], the issues related to the absence of high-resolution images for the task of 3D human model reconstruction are addressed by developing an RSC-Net (RSC stands for resolution-aware network, a self-supervision loss, and a contrastive learning scheme), a deep-based resolution-aware network that is able to handle images with arbitrary resolution.An accent is placed on the 3D human pose and shape representation, as SMPL is utilized.The authors also present a temporal recurrent module that is able to extend the single-image model to low-resolution videos.The model is trained using multiple datasets, such as the Human 3.6 M, MPI-INF-3DHP, LSP, LSPe, MPII, and MS COCO datasets.The evaluation results obtained using the MPJPE and PA-MPJPE demonstrate better performance among other algorithms.
The work of [51] proposes a pose grammar for a 3D human body model recovering in a natural way.The method takes an estimated 2D pose as an input and learns a generalized 2D-3D mapping function for 3D pose estimation.The proposed deep grammar network consists of two important components: a base 3D pose network that encodes appearance and geometry features from the input image and the detected 2D pose and a 3D pose grammar network, based on bi-directional recurrent neural network that encodes human body dependencies and relations.The authors use the Human 3.6 M, HumanEva [42], and HHOI databases.In order to generate additional training samples, they also utilize a novel Pose Sample Simulator.The results are given as a comparison between the estimated pose and the ground truth in mm through the Average Euclidean Distance.
In [72], a novel 3D object representation, specifically aimed at enhancing the efficiency and accuracy of monocular real-time 3D human reconstruction, named the Fourier Occupancy Field (FOF) is introduced.Specifically, the FOF presents a 3D object as a 2D field, converting the occupancy field along the z-axis into a Fourier series and retaining only the initial few terms.It demonstrates the capability to represent high-quality 3D human geometries using a 2D map aligned with images, bridging the gap between 2D image data and 3D geometries.Experimental validation was conducted using both publicly available datasets (Thuman2.0and Twindom) and real-world captured data.The VSD and Chamfer distance are used for evaluating the results.
The authors of [52] introduce POCO (pose and shape estimation with confidence).It is a novel framework aimed at addressing the challenge of 3D human pose and shape estimation from 2D images, while also providing a measure of uncertainty in its estimations.The model infers both body parameters, specifically leveraging SMPL parameters, and the accompanying regression uncertainty in a single feed-forward network pass.POCO is based on a dual conditioning strategy that includes an image-conditioned base density function and a pose-conditioned scale.The model is trained on the MS COCO, Human 3.6 M, MPI-INF-3D, MPII, and LSPe datasets.An evaluation is conducted on the 3DPW, 3D Occlusion Human (3DOH), and 3DPW-OCC datasets.The MPJPE, PA-MPJPE, and Per-vertex error (PVE) are used as evaluation metrics.
In [75], a methodology for estimating whole-body human parameters from a single image, addressing the challenges of monocular human body estimation in wild conditions, is presented.The method, called "KBody", employs a predict-and-optimize approach that seeks to balance three traits, pose, shape, and pixel alignment, while also effectively managing partial images.KBody's methodology aims to improve fitting quality via the introduction of virtual joints, which are tailored to fit estimated data and facilitate a harmonious interaction with silhouette constraints.Further, to manage images with missing information, the method utilizes an appearance-prior approach, completing them in a structurally plausible way.A variant of SMPL, SMPL-X [74], is utilized.Performance is assessed via the Procrustes-aligned vertex-to-vertex error (PA-V2V), scale-corrected pervertex Euclidean error in a neutral pose (PVE-T-SC) [76], and IoU.The expressive Hands and Faces (EHF) and Sports Shape and Pose 3D (SSP3D) datasets are used for conducting the experiments.
Table 2 summarizes the examined papers for 3D human body modeling and reconstruction based on single-image input data and compares them by assets, constraints, the utilization of parametric models, datasets, and evaluation metrics.

Multiple Images
In [97], a CNN-based approach for accurate 3D human body reconstruction from silhouettes is proposed.The authors contribute with the creation of extensive, realistic synthetic data at a larger scale; the adoption of a multi-task learning strategy for the prediction of multiple outputs, including shape, 3D joint positions, pose angles, and body volume; and the introduction of a novel network architecture that incorporates known body measurements (e.g., height) and per-pixel segmentation confidence as additional inputs.The SMPL parametric model is utilized.The authors conduct their experiments on the CAESAR dataset [98] and assess the achieved results by leveraging the mean distance as an evaluation metric.
An approach using multiple images and angles for 3D human body modelling is developed by [99], and a method for automatic 3D character reconstruction from frontal and lateral monocular 2D RGB views is proposed.The template mesh of the SMPL model is used in the first stage for obtaining a body model from the frontal view.Then, this modified SMPL model is inputted into a second stage, where it is further modified by the lateral view.The method focuses on two main aspects: the shape and the texture of the model.Their custom dataset consists of front-view and side-view photos of people.
In the work presented in [78], the authors suggest a method for reconstructing 3D human body models from multiple images.This approach involves learning an implicit function for representing 3D shapes, relying on multi-scale features derived from multi-stage end-to-end neural networks.Since the approach excludes the utilization of a geometrical prior derived from parametric human body models, the current approach is considered model-free.The experiments are conducted over two datasets, which are the Articulated dataset and Clothed Auto Person Encoding (CAPE) dataset [79].A quantitative evaluation is performed on both datasets using the VSD, Chamfer distance, and IoU.
The authors of [53] introduce a multi-view human body mesh translator (MMT) model for 3D human body mesh estimation.Specifically, it is a non-parametric deep-learningbased model that leverages a vision transformer.It performs feature-level fusion, which combines multi-view features to generate contextualized embeddings for the purpose of decoding the output mesh representation.Consequently, the MMT takes multiple images as input and fuses their features in both the encoding and the decoding stages.As a result, a representation embedded with global information for the human body model is composed.Experiments are conducted on the Human 3.6 M and Human Multiview Behavioral Imaging (HUMBI) datasets, and the model performance is assessed by utilizing the MPJPE, PA-MPJPE, and MPVE evaluation metrics.
In [66], a novel meta-optimization technique is introduced that is specifically designed to navigate scenarios wherein accurate initial guesses (e.g., certain poses and shapes at specific camera angles) are not available for rendering and reconstructing 3D human figures.The covariance matrix adaptation annealing method is utilized, allowing for the easy incorporation of domain knowledge of hierarchical human anatomy.The SMPL model is used.The authors conducted their experiments on the People Snapshot Dataset [62] and Human 3.6.Further, reprojection errors and the MPJPE are employed for the result evaluation.
In [81], the authors focus on 3D clothed human body reconstruction based on multiple views and poses.They benefit from the geometry prior provided by the SMPLX model in order to learn the latent codes of a posed mesh by taking multiple images as an input.WCPA dataset is used for training and testing and quantitative evaluation is performed via calculating the Chamfer distance of different strategies on the test dataset.
Table 3 summarizes the examined papers for 3D human body modeling and reconstruction based on multiple-image input data and compares them by assets, constraints, the utilization of parametric models, datasets, and evaluation metrics.

Video Data
The process of 3D human body reconstruction from video data involves the process of generating a 3D model of a person's body by analyzing a video sequence.This technique leverages multiple frames of a person's movements to create an accurate and detailed representation of their body shape and posture.
The authors of [54] implement a fully convolutional model for 3D pose estimation from sequences of 2D key points.Exploiting temporal convolutions is very important for modelling time dependencies among series.The authors employ dilated convolutions that facilitate the modelling of long-term time dependencies while maintaining efficiency.Further, a semi-supervised training approach is introduced for settings where labeled data of 3D ground truth pose is not available.Two video datasets for training the model are utilized: Human 3.6 M and HumanEva-I.The evaluation of the proposed method is performed mainly by computing the MPJPE and its variants.
In [62], a method for reconstructing a textured 3D human body model from a single monocular video of a moving person is introduced.The goal of the authors is to generate a personalized human avatar of the captured subject that correctly reflects its body shape, hair, and clothes.Building a textured map and an underlying skeleton rigged to the surface is also considered.The method combines three important steps: pose reconstruction using the SMPL model, consensus shape estimation via transforming the collection of dynamic body poses into a common reference frame, and frame refinement and texture map generation.The experiments are performed on the DFAUST and BUFF datasets.Since these datasets consist of 3D scans of moving people, a virtual camera rendering 2D video sequences is implemented.The VSD is used as a quantitative evaluation metric.
The approach for 3D human modelling that the authors of [55] propose is to combine the accuracy of the optimization-based methods with the promptness of the deep-based regression methods.They introduce SPIN: SMPL optimization in the loop.This approach utilizes a deep neural network to initialize an iterative optimization routine for fitting the SMPL parametric model to 2D joints within the training loop.Already-fitted estimates of the model are subsequently used to supervise the network.The experiments are conducted on multiple datasets, such as Human 3.6 M, MPI-INF-3DHP, LSP, LSPe, 3DPW, MPII, COCO.The mean reconstruction error, MPJPE, Area Under the Curve (AUC), and Percentage of Correct Key points (PCK) evaluation metrics are exploited.
MotioNet [56] is a deep neural network that performs a 3D human motion reconstruction from a monocular video.The authors claim that this method is the first data-driven approach that directly outputs a kinematic skeleton, which can be used for motion representation.The motion datasets that are used in this work are the Carnegie Mellon University (CMU) (containing 2605 captured elementary actions and dance moves performed by 144 subjects), Human 3.6 M, and HumanEva datasets.The results are organized by the specific motion, and the current approach declares one of the lowest MPJPEs.
In [57], a 3D uplifting model for the purposes of trying on clothes virtually in real-time is introduced.This functionality is applicable to e-commerce and other fashion-related purposes.To keep the system's universality, the authors developed it to be compatible with conventional devices like smartphones and tablets.This implies that their approach is limited to monocular cameras or RGB video streams only.The framework consists of the following steps: skeleton reconstruction and pose estimation, human body recovery and adjustment to the estimated pose, garment mapping and reshaping, the projection of the result to a real-time image, and pose refinement and model alignment for proper body overlay.The datasets used during the implementation are MS COCO and Human 3.6 M. The achieved quality and quickness of the results depend on the visual characteristics of the input data, such as the image contrast and the color palette.
Implementing methods for 3D human body model reconstruction from low-resolution video is valuable.The authors of [50] upgrade their method for reconstructing models from a single image with arbitrary resolution quality to reconstruction from video, again with arbitrary resolution quality.The RSC-Net, used for single-image inputs, is extended by incorporating a temporal post-processing step in order to handle video inputs.The 3DPW, Human 36 M, InstaVariety [100], and MPI-INF-3DHP datasets of video sequences are utilized for conducting the experiments.The quantitative evaluation is performed using the MPJPE, PA-MPJPE, and acceleration error.
The authors of [85] propose a methodology for modelling animatable human avatars with dynamic garments.Their method is implemented by applying a Neural Radiance Field (NeRF)-based representation and managing cloth deformations on multiple hierarchical levels, all while utilizing a conditional variational auto-encoder to discern node-related variables for facilitating realistic and dynamic animation.The model is trained end-toend using only RGB videos.The SMPL model is utilized.The datasets that are used are Dynacap, DeepCap, and ZJU-MoCap.The Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) are used for the quantitative evaluation.
In [67], a method for human reconstruction and synthesis from monocular RGB videos is explored, which is found challenging due to issues like clothing texture, occlusions, and pose changes.The authors counter the common use of NeRFs and implicit methods, which are often chosen for their ability to represent clothed humans.Their approach is based on the optimization of an SMPL+D mesh and the utilization of a multi-resolution texture representation using only RGB images, binary silhouettes, and sparse 2D key points as inputs.The method demonstrates enhanced capability in capturing geometric details compared to traditional visual hull mesh-based methods.It also shows notable improvements and speedups in novel pose synthesis compared to NeRF-based methods, without the latter's typical, unwanted artifacts.Experiments are conducted on the ZJU-MoCap, People-Snapshot, and Self-Recon datasets.For the geometry reconstruction evaluation, the Chamfer distance and VSD metric are utilized.
The authors of [59] employ PoseBERT, a transformer-based module for temporal 3D human modelling using monocular RGB videos.The SMPL parametric model is utilized.The AMASS dataset [101] is used for training the model.For evaluation purposes, the 3DPW, MPI-INF-3DHP, Multiperson Pose Test Set in 3D (MuPoTS-3D), and Advanced Industrial Science and Technology (AIST) datasets are employed.The MPJPE, PA-MPJPE, and MPVE are the main metrics used for assessing the achieved results.
Table 4 summarizes the examined papers for 3D human body modeling and reconstruction based on video input data and compares them by assets, constraints, the utilization of parametric models, datasets, and evaluation metrics.

Depth Map Data
For the purposes of 3D human body reconstruction, some approaches [58,[63][64][65]92] exploit depth maps that are generated by specific systems.These kinds of systems use structured light or the Time of Flight principle to measure the depth of an object, which shows how far from the system the object is.Released in 2010 for gaming purposes, Microsoft Kinect v1 has led a large number of academics to explore its possibilities outside of just the video gaming experience.It uses a structured light method in which the radiation is a known sparkle pattern on the scene.Dissimilar to it, Microsoft Kinect V2 utilizes the Time of Flight method, in which the entire scene is flooded with light, and the depth is determined by the time it takes each photon particle to return to the sensor [102].
In [63], a deep-learning-based approach for human body reconstruction from a single RGB image in a calibration-free context is proposed.The novelty of the method is the way the system is trained, in which a multi-view analysis of depth images is benefited from.

Discussion
Several significant conclusions can be drawn from a comprehensive analysis of the reviewed papers.Based on the nature of the input data, i.e., single or multiple images, video data, or depth maps, we can categorize three primary approaches to 3D human body reconstruction, which form the foundation of our taxonomy.While there may be variations within these approaches, and even among different methods within a specific approach, they share some essential elements in the processing context.These common features include data acquisition and the necessity of employing datasets, 3D pose estimation, 3D shape estimation, and possibly texture recovery, regardless of whether a parametric body model serves as a geometric prior.The evaluation of the proposed methods and the resulting outcomes are another common aspect among the examined works.
We can also categorize the presented methods based on the resulting 3D-type representation of the human body model.While the majority of works focus on reconstructing a 3D body mesh, there are alternative representations that rely on implicit neural representations [81,85].Additionally, the authors of [72] introduce a unique 3D geometry representation termed the Fourier Occupancy Field.
Further, our investigation revealed the existence of around 35 different datasets currently utilized for the task of 3D human body reconstruction.Notably, some well-known datasets, such as Human 3.6 M, MPI-INF-3DHP, MS COCO, LSP, LSPe, 3DPW, MPII, and SURREAL, are frequently employed in the reviewed papers.Additionally, many of the methods make use of multiple datasets to ensure more generalized results.However, some works advocate for the expansion of data usage to enhance method performance [51,61,78,97,99].
Regarding the use of quantitative evaluation metrics, it was observed that the most commonly used metrics for pose estimation evaluations are the MPJPE and PA-MPJPE.Additionally, the VSD and Chamfer distance are frequently exploited for comparing a reconstructed 3D human model with the ground truth.
According to the quality of the avatar's appearance, recovering color, texture, hair, and garments is essential.However, only a few works attempt to recover these aspects.Specifically, there are limited works addressing the recovery of hair [62], color and texture [50,63,67], and clothes [62,81].Conversely, some works acknowledge these challenges and declare the impossibility of recovering hair [47,49,50,57,60,95] and clothing [57].Nevertheless, the challenge of accurately capturing real individuals' interactions remains largely unresolved, as the majority of works primarily focus on achieving realistic avatar appearances.
Reconstructing a complete body model also involves addressing issues related to missing information and self-occlusions.A subset of papers [56,64,75,92,99] confront these challenges, achieving promising results.Another challenge arises from the limitations imposed by lighting conditions [95,99], which can hinder the reconstruction process.Consequently, certain works either do not succeed in reconstructing the full body shape [64] or do not prioritize body shape recovery at all [51,52,54,56,65].However, their contributions to body pose estimation are included in this review, as they represent a crucial step in the 3D reconstruction process.
Finally, the real-time implementation of algorithms for human body reconstruction presents a notable challenge among the reviewed papers.Only two of the examined methods declare real-time implementation [49,57].Conversely, other works report extended execution times [53,55,66] and cite a substantial processing overhead [47,50,58].This high computational demand poses a significant obstacle, especially in applications like HTC.An approach such as the one in [103] may be applicable, but it is not included in this review because it is based on a conceptual idea that is still not implemented in practice.

Conclusions
In this paper, we conducted a comprehensive review of existing methodologies for 3D human body recovery.Our approach to establishing a taxonomy for 3D reconstruction techniques was primarily based on the nature of the input data.Specifically, we categorized 3D human body reconstruction into three distinct categories: those reliant on single or multiple image data, video data, and depth map data.Each of the methods that we examined was thoroughly assessed in the context of the datasets employed, the utilization of geometric priors through parametric body models, and the evaluation metrics applied.Additionally, we provided insights into the strengths and limitations associated with each approach.Subsequently, we performed an in-depth analysis of the reviewed methods.
In conclusion, while considerable progress has been made in the field of 3D human body recovery and reconstruction, it is yet to be fully optimized for applications like HTC.Achieving realism in avatars must extend beyond merely replicating the appearance of real individuals.There is a need for further research and development to enable avatars to effectively convey authentic emotions and interactions, all of which must occur in real time.

Figure 3 .
Figure 3. Block diagram of the SCAPE body model generation process.

Figure 6 .Figure 7 .
Figure 6.Taxonomy of 3D human body modelling and reconstruction techniques based on input data.

Shape Blend Shape Fucnction Joint Location Prediction Fucnction Pose Blend Shape Fucnction Shape correction User parameters SMPL model generation SMPL Mean Template Shape Shape and pose correction Figure 4. Block
diagram of the SMPL body model generation process.

Table 1 .
Datasets and reviewed papers that have referred to them

Table 2 .
Comparison of 3D modelling approaches based on single-image input.

Table 3 .
Comparison of 3D modelling approaches based on multiple images input.

Table 4 .
Comparison of 3D modelling approaches based on video input

Table 5 .
Comparison of 3D modelling approaches based on depth map data