UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction

Li, Xihang; Cheng, Xianguo; Chen, Fang; Shi, Furui; Li, Ming

doi:10.3390/electronics14173522

Open AccessArticle

UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction

by

Xihang Li

^1,*

,

Xianguo Cheng

^1,2,*,

Fang Chen

¹,

Furui Shi

¹

and

Ming Li

³

¹

College of Mechanical and Automotive Engineering, Ningbo University of Technology, Ningbo 315336, China

²

Fenghua Research Institute, Ningbo University of Technology, Ningbo 315211, China

³

Shanghai Key Laboratory of Intelligent Manufacturing and Robotics, Shanghai University, Shanghai 201900, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(17), 3522; https://doi.org/10.3390/electronics14173522

Submission received: 1 August 2025 / Revised: 27 August 2025 / Accepted: 2 September 2025 / Published: 3 September 2025

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a novel Underclothing Body Shape Perception Network (UBSP-Net) for reconstructing parametric 3D human models from clothed full-body 3D scans, addressing the challenge of estimating body shape and pose beneath clothing. Our approach simultaneously predicts both the internal body point cloud and a reference point cloud for the SMPL model, with point-to-point correspondence, leveraging the external scan as an initial approximation to enhance the model’s stability and computational efficiency. By learning point offsets and incorporating body part label probabilities, the network achieves accurate internal body shape inference, enabling reliable Skinned Multi-Person Linear (SMPL) human body model registration. Furthermore, we optimize the SMPL+D human model parameters to reconstruct the clothed human model, accommodating common clothing types, such as T-shirts, shirts, and pants. Evaluated on the CAPE dataset, our method outperforms mainstream approaches, achieving significantly lower Chamfer distance errors and faster inference times. The proposed automated pipeline ensures accurate and efficient reconstruction, even with sparse or incomplete scans, and demonstrates robustness on real-world Thuman2.0 dataset scans. This work advances parametric human modeling by providing a scalable and privacy-preserving solution for applications to 3D shape analysis, virtual try-ons, and animation.

Keywords:

parametric human modeling; model registration; body shape under clothing; 3D scanning

1. Introduction

With the development of non-contact 3D scanning technologies such as LiDAR and structured light, it has become possible to capture 3D point cloud data for objects with unprecedented accuracy [1]. This provides a foundation for extracting fine models from point clouds, playing a crucial role in tasks such as 3D shape analysis [2], 3D content generation [3], and mesh registration [4]. Although direct scanning of naked bodies can accurately obtain human geometric shapes, this process is inconvenient for most people and also violates their privacy [5]. The same issue arises when scanning with tight-fitting clothing. Therefore, it is highly necessary to predict the shape and pose of a body under clothing during body scanning. However, this task is extremely challenging. Firstly, the complex non-rigid deformation of clothing affects the pose and shape space of the subject, which brings uncertainty into the estimation of body shape under clothing [6]. In addition, the method of data collection and the design of the inference model for body shape under clothing also have a significant impact on the final reconstruction accuracy.

Based on deep learning methods, predicting the shape of the body under clothing and registering it using a parametric model have become the mainstream approach to parametrically reconstructing the body under clothing [7]. Among them, methods based on learning implicit functions [8,9] exhibit significant advantages in processing sparse or incomplete body scan point clouds, as they can utilize prior knowledge to fill in missing information and predict the shape of the body under clothing. Secondly, in order to generate high-precision human models, it is necessary to sample 3D points at a high resolution and rely on the Marching Cubes [10] algorithm to extract meshes or surfaces, which incurs a significant memory overhead. In addition, directly extracting features from scan point clouds to learn and predict the global latent parameters of a parametric body model (such as an SMPL model [11]) is challenging [12], and the process of initializing the registration to a standard model based on this also lacks reliability.

Fitting a parametric model to the predicted internal body shape typically relies on marker points to achieve high-precision fitting [13,14]. However, it is difficult to obtain marker points. In addition, when the number of marker points is limited, the fitting process easily falls into the locally optimal solution when the pose, shape, and clothing change significantly [15]. Some methods [8,9] based on implicit representation have attempted to improve the fitting accuracy by predicting the correspondence between body parts and the parametric model. However, this correspondence often lacks clarity and has difficulty accurately handling joint rotation problems, especially when the predicted internal body shape is relatively coarse. In contrast, learning to predict the point-to-point correspondence between the internal body and the parametric model can more effectively enhance the fitting accuracy and robustness [7].

This paper focuses on the reconstruction of a parametric 3D model of the body and a parametric 3D model containing clothing information from a full-body human point cloud acquired using a general 3D body scanning device, respectively, in order to subsequently control the model for free variation. Specifically, we propose the Underclothing Body Shape Perception Network (UBSP-Net), a body shape perception network for 3D parametric reconstruction of the body which can simultaneously predict the internal body point cloud and a parametric model reference point cloud with point-to-point correspondence. We consider the internal body point cloud as an offset of the external scan sampling points, as the external scan point cloud serves as an initial value close to the internal body’s shape, and learning this point offset relationship has higher effectiveness and stability. The key insight for SMPL model registration is that there is a point-to-point correspondence between the predicted parametric model reference point cloud and the internal body point cloud. Therefore, by designing a reasonable optimization function, the registration process can be made more efficient and reliable. To process external scan point clouds, this paper first fits the SMPL model to the internal body point cloud and then optimizes the parameters of the human displacement model (SMPL+D [11]) to reconstruct a parametric 3D human body model, including clothing details. It should be noted that due to situations where wearing overly loose or atypical clothing may exceed the distribution range of the dataset, resulting in significant uncertainty, the scope of this study mainly focuses on common clothing types, including short T-shirts, long T-shirts, long gowns, long-sleeved shirts, polo shirts, suit jackets, shorts, and pants.

The rest of this paper is organized as follows. Section 2 reviews related work on body shape and pose estimation under clothing. Section 3 introduces the basic mathematical formulations of the SMPL model and its variants. Section 4 presents the architecture of the proposed UBSP-Net and its key methodologies. Section 5 discusses the dataset generation process. Section 6 provides the experimental results and performance comparisons. Finally, Section 7 concludes this paper and outlines future directions. The main contributions of this paper can be summarized as follows:

We propose UBSP-Net, a novel deep learning network that simultaneously infers the internal body point cloud and a VSMPL reference point cloud with point-to-point correspondence from a clothed 3D scan, enabling robust and efficient parametric model registration. By modeling the internal body point cloud as coordinate offsets from the external scan and leveraging body part label probabilities, UBSP-Net achieves a superior accuracy in capturing detailed body shapes, even with sparse or incomplete scans, outperforming existing methods like IP-Net and PTF in terms of the Chamfer distance errors and inference speed.
We introduce an automated pipeline that seamlessly integrates internal body shape inference with SMPL and SMPL+D model registration, providing a scalable and privacy-preserving solution for parametric 3D human reconstruction. This pipeline optimizes the SMPL parameters for the internal body and extends to SMPL+D for clothed models, incorporating a local shield term to prevent irrational deformations in non-clothed regions, thus ensuring high-fidelity reconstruction across various common clothing types and enabling applications to 3D shape analysis, virtual try-ons, and animation.

2. Related Work

2.1. Body Shape and Pose Under Clothing

Two-dimensional image-based reconstruction methods for the pose and shape of the naked body are very popular [16]. These methods typically regress [17] or fit [18,19] a parametric body model onto the joints and contours of an image of a clothed human. For example, Wehrbein et al. [20] propose a probabilistic approach based on normalizing flows, which supervises the estimation of 3D human meshes by utilizing the distribution of heatmaps generated by a 2D pose detector. Chen et al. [21] propose an improved method for 3D human pose and shape estimation which enhances the limb reconstruction accuracy and pose estimation by introducing a left–right limb appearance consistency module, full perspective projection, and global position supervision. However, due to the ambiguity inherent in the 2D-to-3D conversion process, these methods are unable to accurately predict the body shape and pose of the human body in space. Moreover, these methods heavily rely on accurate 2D joint detection, which can be problematic for complex poses or occlusions, leading to a significant loss of accuracy in body shape prediction. In comparison, it is more reliable to infer the shape of the body under clothing from a 3D body scan point cloud. Yu et al. [22] combine volumetric dynamic reconstruction and a parametric body template to simultaneously reconstruct the detailed geometry, non-rigid motion, and internal body shape from a single depth camera. Zhang et al. [23] utilize 4D data from multi-view images and visual shell sequences to estimate body shape while wearing clothes. Lazova et al. [24] non-rigidly align SMPL models with 3D scans through multi-view rendering and minimization of the reprojection errors at the joint points. IP-Net [8] and PTF [9] utilize a hierarchical representation learning approach to predicting the internal body’s surface and external garments’ surfaces from a point cloud of a clothed human body and then reconstruct the internal body and garment human models through registration. Body PointNet [6] directly outputs dense mesh vertices with the same encoding as the SMPL model via a Multilayer Perceptron (MLP), which utilizes the mesh encoding of the SMPL model to generate mesh faces. However, these methods still struggle to handle large-scale clothing deformations and may not generalize well to different clothing types.

2.2. Human Body Model Registration

Classical body registration techniques often employ a variation of the Iterative Closest Point (ICP) algorithm [13,25], which aligns the template mesh by either deforming it or optimizing the parameters of a parametric model, frequently incorporating additional data like color patterns [26] or markers [13,27]. While these optimization-driven approaches can yield precise results for complex poses if initialized correctly, they are vulnerable to local minima and lack full automation without proper initialization. Additionally, these methods require manual input or pre-computed markers, making them unsuitable for fully automated, real-time applications. PTF [9] applies the least squares approach to matching rigid-body transformations and determining the relative rotations of the joints, which is used to initialize the pose parameters of the SMPL model. Some techniques, such as [4,14], use pre-computed 3D joint coordinates of the model as the starting values. In these methods, the input model is rendered from different viewpoints, and 2D joint locations are detected via OpenPose [28]. These 2D joint positions are then converted into 3D by minimizing the 3D reprojection loss relative to the 2D coordinates. Initially, the pose of the SMPL model is optimized to fit the input model, followed by subsequent optimization of the shape, pose, and global translation. However, this approach generally requires a mesh model as the input, making it inefficient for point cloud inputs and unsuitable when the point cloud is sparse. Moreover, when the point cloud is noisy or incomplete, the optimization can lead to incorrect registration, further decreasing the accuracy. To improve the registration efficiency, some methods attempt to directly regress the parameterized human body model parameters from point clouds, such as ArtEq [12], which trains a part detection network by utilizing local SO(3) invariance and performs shape and pose regression using articulated SE(3) shape-invariant and pose-equivariant networks. FPCR-Net [29] uses back point cloud prediction, SO(3) equivariant feature extraction, and self-attention-based soft aggregation to directly regress SMPL pose and shape parameters from a single front human body point cloud. However, these methods still face challenges in terms of scalability and generalization across different body types and clothing styles. They typically require large datasets and substantial computational resources for training, and exhibit limited robustness to noise.

3. Fundamentals

SMPL [11] employs two low-dimensional parameter spaces to represent the shape and pose of the human body. The body shape parameter,

β \in R^{10}

, is derived through a principal component analysis (PCA) of a dataset of human bodies that share the same pose but vary in shape. The pose parameter,

θ \in R^{72}

, corresponds to the parametric axial angles for global rotation and 23 joint rotations. Starting from a mean body shape template

P_{T} \in R^{3 \times 6890}

, SMPL translates the body shape parameter

β

and the pose parameter

θ

into vertex displacements, which are then applied to

P_{T}

. First, the body shape blend function

B_{S} (β) : R^{| β |} \mapsto R^{3 \times 6890}

computes vertex displacements that represent various body shapes. Then, the pose blend function

B_{P} (θ) : R^{| θ |} \mapsto R^{3 \times 6890}

generates vertex displacements reflecting the deformation of the soft tissues during motion. These two displacement parts are superimposed onto

P_{T}

to construct the resting body mesh

T_{P} (β, θ) = P_{T} + B_{S} (β) + B_{P} (θ)

, which incorporates both the body’s shape and pose. The joint positions

J (β) : R^{| β |} \mapsto R^{72}

at rest are regressed from

β

. Then,

θ

is used to create the rigid-body transformation

\{B_{b}\} = \{B_{1}, \dots, B_{24}\}

, which is applied to

T_{P} (β, θ)

using the standard linear blend skinning (LBS) function

W (\cdot)

with the skinning weights

W \in R^{6890 \times 24}

to generate a dynamic human body mesh. Finally, a global translation

t \in R^{3}

is added to finalize the mesh model. In conclusion, the SMPL model defines a function

M (β, θ, t)

that takes

β

,

θ

, and

t

as the inputs and outputs a human body mesh.

M (β, θ, t) = W (P_{T} + B_{S} (β) + B_{P} (θ); W, \{B_{b}\}) + t

(1)

To implement SMPL registration, an alternative construction method for SMPL is also utilized. In [15], the SMPL function is extended from the surface to the 3D volume, referred to here as VSMPL for clarity and to distinguish it from traditional SMPL. VSMPL propagates the original functions

B_{S} (β)

,

B_{P} (θ)

and the skinning weights

W

into the 3D space, resulting in the functions

g^{b_{S}}

,

g^{b_{P}}

,

g^{W}

. Additionally, the function

g^{I}

is introduced to map each reference point

r \in P_{R}

(where

P_{R}

represents the reference point cloud in VSMPL) to the nearest mesh point on the standard template. In conclusion, VSMPL defines a function

g^{M} (r, θ, β, t)

that takes

r

,

β

,

θ

, and

t

as the inputs and outputs the corresponding body point for

r

.

g^{M} (r, β, θ, t) = \sum_{k = 1}^{K} g_{k}^{w} (r) G_{k} (θ, β) (g^{I} (r) + g^{b_{P}} (r) θ + g^{b_{S}} (r) β) + t

(2)

where

G_{k} (θ, β) \in S E (3)

is the 4 × 4 transformation matrix of the K-th body part; further details can be found in [15]. To simplify the presentation, we use

G^{M} (P_{R}, β, θ, t)

to represent the human body point cloud derived from the VSMPL reference point cloud

P_{R}

after applying the parameter adjustments.

4. Method

4.1. UBSP-Net

Our UBSP-Net takes a 3D human body scan point cloud

P_{S} \in R^{N \times 3}

(N represents the number of sampling points) with an arbitrary pose, body shape, and clothing as the input and outputs an internal body point cloud

P_{I} \in R^{N \times 3}

and a VSMPL reference point cloud

P_{R} \in R^{N \times 3}

with point-to-point correspondence. Additionally, it predicts the part label probability

I_{S} \in R^{N \times Z}

(we divide the human body into 14 semantic parts,

Z = 14

) for the sampling points, which are used to assist in the inference of

P_{I}

and

P_{R}

. The structure of UBSP-Net is shown in Figure 1. Recently, Point Transformer [30] has achieved impressive results in many 3D point cloud tasks. We have borrowed the idea of its encoder to extract the input point cloud features

F_{S} \in R^{N \times 48}

. The design ideas for each part of the network are as follows.

Part label probability prediction: Although the point part labels are not directly used for the registration of the SMPL model, they are beneficial for helping the network understand the structure of the human point cloud. The prediction method is similar to traditional point cloud segmentation tasks, consisting of an MLP decoder composed of two fully connected layers. The feature size of the hidden layer is

[B, 128, N]

, where B represents the batch. Finally, the body part label probability

I_{S}

of each point is output through a Softmax layer.

Internal body point cloud prediction: Based on the fact that external clothing covers the body’s surface, we directly derive the corresponding internal body point cloud from the external scan, providing a robust initial estimate for subsequent prediction. We further assume that the internal point cloud can be represented as offsets from the scan points, such that each input scan point

s \in P_{S}

is regressed to its corresponding internal point

b \in P_{I}

. The internal body point cloud

P_{I}

is represented by learning the offset of each

b

relative to

s

, as shown in Figure 2. This significantly reduces the complexity of the network and only requires computations on the sampling point cloud while maintaining point-to-point correspondence with the sampling point cloud. For different body parts, the degree of offset of the clothing relative to the human body typically varies. For example, the coverage of clothing on the torso, arms, and legs can differ significantly. In such cases, directly computing the coordinate offsets for each point (e.g., using a hard aggregation method) may result in overly rigid mappings, especially in transition areas between body parts (e.g., shoulders, waist). To address this issue, we adopt a soft aggregation approach, which weights the coordinate offset features of the sampled points according to the part label probabilities, thereby producing an offset matrix:

M_{S} = \sum_{i}^{N} I_{S} (x_{i}, p_{z}) Ω (x_{i})

(3)

where

M_{S} \in R^{N \times 3}

is the coordinate offset matrix,

Ω (x_{i})

represents the sampling point coordinate offset characteristics of the point

x_{i} \in P_{S}

, and

I_{S} (x_{i}, p_{z})

represents the probability that point

x_{i}

belongs to body part

p_{Z}

.

In the implementation, the MLP decoder consisting of two fully connected layers, where the hidden layer has a feature size of

[B, 128 \times 14, N]

, outputs a sample point coordinate offset feature

Ω

of size

[B, 3 \times 14, N]

, which is converted into

[B, 3, 14, N]

by a Reshape operation. Subsequently, the internal body point cloud feature

Ω

after the Reshape operation is weighted and summed with the part classification probability

I_{S}

through Equation (3) to obtain the coordinate offset matrix

M_{S}

.

VSMPL reference point cloud prediction: The predicted internal body point cloud has a point-to-point correspondence with the scanned sample point cloud. Therefore, if the VSMPL reference point cloud also has a point-to-point correspondence with the internal body point cloud, this correspondence can be used to initialize the registration.

We regress the scan point cloud sampling to its intermediate representation on VSMPL (the reference point cloud), i.e., each input scan-sampled point

s \in P_{S}

regresses to its corresponding reference point

r \in P_{R}

on VSMPL. Note that the correspondence mapping is discontinuous when there are occlusions and contacts in the scan. For example, when different parts are in proximity or in contact, the nearby scan sampling point

s

needs to be mapped to the coordinates of the distant

r = (x, y, z)

, which is difficult. Inspired by [7,8,15], continuous

x, y, z

correspondences are regressed only within the same part, as shown in Figure 3. However, hard classification of labels based on the argmax operation is not trivial and the transition points between two neighboring parts are ambiguous. In this case, soft aggregation methods can effectively solve this problem. By weighting the coordinate offset features of each sampling point based on part label probabilities, smooth and continuous mapping between parts can be achieved. This not only avoids the ambiguity of hard aggregation methods in transition regions but also effectively handles interference caused by contact between different parts. Therefore, the soft aggregation method is also adopted here, with part label probabilities used to weight the output of the VSMPL reference point cloud

P_{R}

:

P_{R} = \sum_{i}^{N} I_{S} (x_{i}, p_{z}) D (x_{i})

(4)

where

D (x_{i})

denotes the VSMPL reference point cloud feature of point

x_{i} \in P_{S}

, and

I_{S} (x_{i}, p_{z})

denotes the probability that point

x_{i}

belongs to body part

p_{z}

.

In the implementation, firstly, an MLP decoder composed of two fully connected layers is utilized, where the feature size of the hidden layer is

[B, 128 \times 14, N]

, and the output size is

[B, 3 \times 14, N]

for the VSMPL reference point cloud feature

D

. Then, it is converted into

[B, 3, 14, N]

through a Reshape operation. Finally, the Reshaped VSMPL reference point cloud feature

D

is weighted and summed with the part label probability

I_{S}

using Equation (4) to obtain the VSMPL reference point cloud

P_{R}

.

It should be noted that we employ 1D convolution with a kernel size of 1 to replace the traditional fully connected layer. This makes it possible to use ready-made group convolution [31] in PyTorch (version 2.0.0) to build VSMPL reference point cloud prediction and internal body point cloud prediction decoders, which reduces the computational complexity, decreases the number of parameters, and increases the efficiency of parallel computation.

4.2. The Loss Function

Our network is trained using supervised learning, and the following loss function is included in the training process.

For the part classification of points, similar to the general point cloud segmentation task, to predict the cross-entropy loss

L_{C E}

between the body part label

I_{S} (x_{i}, p_{k})

and the ground truth body part label

\tilde{I_{S}} (x_{i}, p_{k})

,

L_{C E} = cross − entropy (I_{S} (x_{i}, p_{k}), \tilde{I_{S}} (x_{i}, p_{k}))

(5)

For the internal body point cloud

P_{I}

, the MSE loss

L_{I E}

between the predicted internal body point cloud

P_{I}

and the ground truth internal body point cloud

\tilde{P_{I}}

L_{I E} = {∥\tilde{P_{I}} - P_{I}∥}^{2}

(6)

For the VSMPL reference point cloud

P_{R}

, the MSE loss

L_{R E}

between the predicted internal body point cloud

P_{R}

and the ground truth internal body point cloud

\tilde{P_{R}}

L_{R E} = {∥\tilde{P_{R}} - P_{R}∥}^{2}

(7)

Thus, the total loss function can be defined as

L = ω_{C E} L_{C E} + ω_{I E} L_{I E} + ω_{R E} L_{R E}

(8)

where

ω_{C E}

,

ω_{I E}

, and

ω_{R E}

are the weights that control each contribution, respectively. In our implementation,

ω_{C E} = 1

,

ω_{I E} = 100

, and

ω_{R E} = 100

. The larger values of

ω_{I E}

and

ω_{R E}

reflect the higher accuracy requirements for predicting the internal body point cloud and the reference point cloud, as these two components play a crucial role in 3D reconstruction. In contrast, the smaller value of

ω_{C E}

indicates that although part classification is essential for the optimization process, its impact on the overall training is relatively minor. While this combination of weights may not be optimal, it yields satisfactory results.

4.3. SMPL/SMPL+D Model Registration

Predicted internal body point clouds can be used directly for some basic applications, such as body shape estimation and dimensional measurements. However, they lack parametric control and have no mesh surface, while the input external body scan may suffer from similar problems. These problems can be effectively solved by introducing SMPL/SMPL+D models for registration, which enables parametric control and generation of human body models with mesh faces to improve the usefulness of the models.

SMPL model registration of the internal body point cloud: There is a point-to-point correspondence between the internal body point cloud

P_{I}

predicted by our network and the VSMPL reference point cloud

P_{R}

, and SMPL and VSMPL share model parameters, so initialization of the alignment process can be performed directly from the point level. The objective function for fitting the SMPL model to the internal body point cloud is defined as

E_{SMPL} (P_{R}, β, θ, t) = w_{corr} E_{corr} + w_{dist} E_{dist} + w_{reg} E_{reg}

(9)

where

E_{corr}

is the distance between the corresponding points.

E_{dist}

is the overall distance between the target point cloud and the generated model point cloud.

E_{reg}

is the regularization term.

w_{corr}

,

w_{dist}

, and

w_{reg}

are the weights. The strategy for adjusting various weights during the optimization process can be found in reference [15].

For the corresponding point term

E_{corr}

, an absolute error loss is used for the measure.

E_{corr} (P_{R}, β, θ, t) = \frac{1}{|P_{I}|} {∥G^{M} (P_{R}, β, θ, t) - P_{I}∥}_{1}

(10)

where

{∥ \cdot ∥}_{1}

denotes the

L_{1}

norm.

For the distance term

E_{dist}

, the symmetric Chamfer distance error

E_{scd} (P_{R})

between the VSMPL reference point cloud

P_{R}

and the standard SMPL human body template point cloud

P_{T}

is established. In addition, the symmetric Chamfer distance error

E_{scd} (P_{R}, β, θ, t)

between the VSMPL-generated human body point cloud and the predicted internal body point cloud

P_{T}

is established.

E_{dist} (P_{R}, β, θ, t) = E_{scd} (P_{R}) + E_{scd} (P_{R}, β, θ, t)

(11)

E_{scd} (P_{R}) = \frac{1}{|P_{R}|} \sum_{x \in P_{R}} min_{y \in P_{T}} {∥ x - y ∥}_{2} + \frac{1}{|P_{T}|} \sum_{x \in P_{T}} min_{y \in P_{R}} {∥ x - y ∥}_{2}

(12)

E_{scd} (P_{R}, β, θ, t) = \frac{1}{|G^{M} (P_{R}, β, θ, t)|} \sum_{x \in G^{M} (P_{R}, β, θ, t)} min_{y \in P_{I}} {∥ x - y ∥}_{2} + \frac{1}{|P_{I}|} \sum_{x \in P_{I}} min_{y \in G^{M} (P_{R}, β, θ, t)} {∥ x - y ∥}_{2}

(13)

The regularization term

E_{reg}

contains a gestalt prior

E_{prior}

from [32] and a constraint term

E_{cons}

that limits excessive changes in

β

.

E_{reg} (β, θ) = E_{prior} (θ) + E_{cons} (β) = E_{prior} (θ) + \frac{1}{|β|} {‖ β ‖}_{2}

(14)

Since SMPL shares body shape

β

, pose

θ

, and global translation

t

parameters with VSMPL, the SMPL model with a mesh can be obtained by inputting the above optimized parameters into Equation (1).

SMPL+D model registration of the external scan point cloud: Based on the SMPL model parameters obtained above, the parameters of SMPL+D are optimized to fit the external body scan. The fitting in non-clothing regions (e.g., hands) is noteworthy, as optimizing the vertex displacement

D

for the full body may lead to non-rational deformations in the body region. We add the local shield term to limit the displacement optimization of the local vertices. The objective function for fitting the SMPL+D model to the external scan point cloud is defined as

E_{SMPLD} (β, θ, t, D) = ω_{ddist} E_{ddist} + ω_{dreg} E_{dreg} + ω_{shield} E_{shield}

(15)

where

E_{ddist}

is the overall distance between the target point cloud and the generated model point cloud.

E_{dreg}

is the regularization term.

E_{shield}

is the local shield term.

ω_{ddist}

,

ω_{dreg}

, and

ω_{shield}

are the weights. The strategy for adjusting various weights during the optimization process can be found in reference [15].

For the distance item

E_{ddist}

, since the input external scan does not necessarily include mesh faces, the symmetric Chamfer distance is used here to measure the distance between the SMPL+D model vertex and the external scan point cloud:

E_{ddist} (β, θ, t, D) = \frac{1}{| M (β, θ, t, D) |} \sum_{x \in M (β, θ, t, D)} min_{y \in P_{S}} {∥ x - y ∥}_{2} + \frac{1}{|P_{S}|} \sum_{x \in P_{S}} min_{y \in M (β, θ, t, D)} {∥ x - y ∥}_{2}

(16)

For the regularization term

E_{dreg}

, a Laplacian loss

E_{lap}

between the initial mesh

M_{i n t a l}

(the SMPL model mesh that aligns the predicted internal point cloud) and the SMPL mesh through optimization is used to constrain the excessive deformation of the mesh. In addition, a loss term

E_{D}

for the vertex displacement

D

is introduced to limit the excessive displacement of the vertices.

\begin{matrix} E_{dreg} (β, θ, t, D) & = ω_{lap} E_{lap} (β, θ, t, D) + ω_{D} E_{D} (D) \\ = ω_{lap} \frac{1}{| M_{intal} |} {∥lap (M_{intal}) - lap (M (β, θ, t, D))∥}^{2} + ω_{D} \frac{1}{| D |} {∥D∥}_{2} \end{matrix}

(17)

For the local shielding term

E_{shield}

, the region to be shielded in the SMPL model is extracted, and then the vertex index of the shielded part is determined by the nearest point query algorithm. For vertex displacements

D^{'} \in D

within the index, the loss term that limits the optimization is as follows:

E_{shield} (D^{'}) = \frac{1}{| D^{'} |} {∥D^{'}∥}_{2}

(18)

5. Dataset Generation

In order to train our UBSP-Net, in addition to the external scanned point cloud samples

P_{S}

as inputs, the corresponding internal body point cloud

P_{I}

, the reference point cloud

P_{R}

of VSMPL, and the part labels

I \in R^{Z}

at each point are required as supervised data. We generate training and test data based on the CAPE dataset [33], which contains 148,584 aligned dressed/minimum dressed SMPL meshes covering 8 common clothing types for 10 male and 5 female subjects. All of the aligned meshes have the SMPL topology shown in Figure 4. The extraction method for each supervised data point is explained in detail below.

For the part label I of the input scan sampling point, first, the SMPL model is pre-partitioned into 14 semantic body parts based on the vertex indexes and randomly sampled on the outer surface model to obtain the scanning sample point cloud

P_{S}

. Then, the nearest point query algorithm is used to find the nearest points of the scanning point cloud

P_{S}

to the vertices of the dressed SMPL model, thus transferring the body part labels from the dressed SMPL model to these scanning sampling points:

I (o) = arg min_{m \in M} ∥ o - m ∥

(19)

where

o \in P_{S}

is one of the scan sampling points,

I (o)

denotes the body part label of point

o

, and

m

is a vertex of the dressed SMPL model mesh M with a known part label.

For the internal body point cloud

P_{I}

, through the nearest point query algorithm, we find the nearest point

o^{'}

to each point

o

in

P_{S}

on the mesh surface of the minimally dressed SMPL model and take it as the internal body point cloud:

P_{I} = \{o^{'} \in M ∣ o^{'} = arg min_{c \in M} ∥ o - c ∥\}

(20)

For the VSMPL reference point cloud

P_{R}

, let

o \in P_{S}

be one of the scanned points, and

o^{'}

be the nearest point of

o

on the minimally dressed SMPL model.

A, B, C

are the vertices of the mesh surface where

o^{'}

is located.

A^{'}, B^{'}, C^{'}

are the mesh vertices on the original SMPL T-pose template corresponding to

A, B, C

. The coordinates of

o^{'}

can be expressed as

o^{'} = a A + b B + c C (a + b + c = 1)

. Then, the corresponding point

v \in P_{R}

on the SMPL model T-pose template can be calculated as

v = a A^{'} + b B^{'} + c C^{'}

. The above construction method is shown in Figure 5. The point-to-point mapping from

P_{S}

to

P_{I}

to

P_{R}

can be realized by applying the above method to each scanning sample point.

Through the above steps, it is ensured that the points in the scan sampling point cloud

P_{S}

, the internal body point cloud

P_{I}

, and the VSMPL reference point cloud

P_{R}

have a strict point-to-point correspondence, and the sample dataset is shown in Figure 6. Therefore, the network can learn this correspondence to estimate the internal body shape and initialize the registration process.

6. Experimentation and Analysis

In this section, we first introduce the experimental implementation details, including the dataset segmentation, and the training setup. Subsequently, ablation experiments are conducted to examine and analyze the impact of fused part label probability features on the prediction results. Additionally, qualitative and quantitative comparative analyses are performed with related methods in terms of internal body shape prediction and parametric model registration. Finally, the practicality of this method is verified through real-world scanning experiments. All experiments were conducted on a graphics workstation equipped with a 12th Gen Intel(R) Core(TM) i9-12900K processor (Intel Corporation, Santa Clara, CA, USA), 64 GB of memory, and an NVIDIA GeForce RTX 3060 graphics card (NVIDIA Corporation, Santa Clara, CA, USA).

Dataset: In this paper, the proposed method is used to construct a dataset, generating approximately 10,000 datasets through random sampling, with a fixed number of 10,000 sampling points for the input point cloud. We use data for 12 subjects for the model training and using data for 3 subjects for the testing, which is the same as the division of the training and testing dataset in PTF [9].

Training: All input scanned sampled point clouds are scaled to 1.5 m in height and moved to the center position. Training is performed using the Adam optimizer with a learning rate of 0.001, the batch size set to 8, and the other parameters defaulted. Training takes about 3 days.

6.1. The Influence of the Sampling Point Quantity

The influence of different sampling point quantities on the prediction accuracy of the three branches is shown in Table 1. In this section, the mean point-to-point Euclidean distance error is used to measure the accuracy of the VSMPL reference point cloud and the internal body point cloud, with the units in mm. The percentage of correctly classified point clouds is used to assess the accuracy of the part label predictions. For fairness of comparison, the model was retrained on our dataset for each sampling point quantity.

It can be observed that the increase in the number of sampling points has little effect on the accuracy of the part label predictions. This is because the distribution of the part labels is relatively uniform, and a small number of point clouds are sufficient to represent the features of each part. However, as the number of sampling points increases, the prediction accuracy of both the VSMPL reference point cloud and the internal body point cloud improves. This may be because more sampling points provide more detailed and richer information, which captures the complex geometric shapes and details better. Considering the trade-off between accuracy and the computational cost, 10,000 points are chosen as the default for subsequent experiments, as they achieve a near-optimal precision while avoiding the higher overhead associated with 12,000 points.

6.2. The Influence of the Part Label Probability

In order to evaluate the effect of the part label probability on the internal body point cloud and the VSMPL reference point cloud, we design an alternative network structure. Namely, the same encoder as the proposed network is used but without the predicted body labeling probability in the decoding stage to weight the predicted internal body point cloud and the VSMPL reference point cloud. This means that the model does not take into account the variability between different body parts when generating the final point cloud prediction. The average point-to-point Euclidean error is used here to measure the accuracy of the VSMPL reference point cloud and internal body point cloud predictions. The unit of measurement is mm. Table 2 demonstrates the influence of part label probability weighting on the accuracy of the internal body point cloud and VSMPL reference point cloud predictions.

By comparing the differences in the prediction accuracy of the VSMPL reference point cloud, it can be seen that the part label probability weighted approach provides a significant improvement in the prediction accuracy. This improvement is mainly attributed to the fact that the network is able to capture the variability between different body parts better and fine-tune the predictions of the point cloud accordingly. It is worth noting that although the part label probability weighting strategy also helps with the prediction of the internal body point cloud to a certain extent, its enhancement effect is more limited compared to that for the VSMPL reference point cloud. It may be that the prediction of internal body point clouds relies more on the overall body structure than on single body part features. However, even so, part label probability weighting is still a useful strategy.

6.3. Comparison of Internal Body Shape Predictions

Internal body shapes are typically estimated using implicit representation-based methods such as IP-Net [8] and PTF [9], which classify spatial points and extract surfaces via the Marching Cubes [10] algorithm. We primarily compare our method with these baselines. Unlike IP-Net and PTF, which represent the results as meshes, our method outputs a point cloud (Figure 7). To validate the effectiveness of removing clothing, for our method, we also show a view of the predicted internal body point cloud overlaid with the scanned mesh. It can be seen that the internal body shape predicted using our method clearly maintains detailed features such as the hands, while with the other reference methods, they are more blurred. In addition, where different parts are close to each other, the implicit-representation-based method produces shapes that are far away from the body, as shown by the dashed circles in Figure 7, while our method has almost no influence.

For quantitative comparisons, 10,000 points are resampled on the surface of the target model as the ground truth to ensure the fairness of the comparison. For IP-Net, PTF, 10,000 points are sampled on the predicted internal body mesh for comparison. A broader evaluation metric, the Chamfer distance, is used here to describe the average Euclidean distance between the reconstructed model and the ground truth. The unit of measurement is mm. The Chamfer distance errors of the different methods are shown in Table 3, where it can be seen that our internal body shape prediction accuracy is higher than that of the competing methods, and the error distribution is more concentrated. Table 4 demonstrates the number of model parameters and the forward inference time for each method. Our network model has a smaller number of parameters. In addition, compared to IP-Net and PTF, which need to query the classification of each point in space and extract the mesh surface, our method only needs to compute the offsets of the scan sampled points. The forward inference time is improved by about two orders of magnitude.

6.4. Parametric Model Reconstruction Evaluation

The minimally dressed and dressed mesh models in the CAPE dataset are the SMPL and SMPL+D models, respectively. Therefore, the average Euclidean error between the vertices of the reconstructed model can be used to measure the registration accuracy. The unit of measurement is mm. IP-Net [8] predicts the correspondence with 14 parts of the SMPL mesh and then uses a method similar to the ICP to recover the SMPL shape and pose parameters. LoopReg [15] extends this idea by incorporating an SMPL optimization step during the network training process. Since it does not predict internal body shapes, only the external SMPL+D registration results are compared here. PTF [9] segments the body into 24 parts for local transformation regression, initializes the pose parameters through least squares fitting, and then adopts the same registration strategy as IP-Net.

To provide a comprehensive evaluation of the proposed method, two parametric human reconstruction approaches based on single frontal 2D images (PostureHMR [34] and ReFit [18]) are introduced, and quantitative comparisons with our method are presented in Table 5. The experimental results show that the proposed method achieves lower vertex errors in both the internal SMPL and external SMPL+D reconstructions. Additionally, the reconstruction efficiency of the internal SMPL reconstruction is approximately five times higher than that of IP-Net and PTF, primarily due to a more efficient internal body shape estimation process and the use of point-to-point distance calculations during registration, which are computationally less intensive than the mesh-to-mesh distance calculations employed by conventional methods. While parametric human reconstruction methods based on 2D images offer high convenience, their reconstruction results generally exhibit larger vertex errors and do not meet the precision requirements of high-accuracy application scenarios.

The qualitative comparison of the internal SMPL reconstruction is shown in Figure 8. In order to simulate the problem of there being holes in the point cloud acquired by the actual 3D body scanning system, some of the input point clouds are randomly cropped to varying degrees in the experiments. The figure also shows the internal body shape predicted by each method. IP-Net and PTF divide the predicted internal body into multiple parts, and the alignment is realized using the part-level correspondence between SMPL and the predicted internal body mesh. However, this correspondence is actually ambiguous for joint rotations, even though PTF divides the body into 24 parts. In addition, these implicit-representation-based methods suffer from more severe breakage when reconstructing the mesh surface of self-intersecting parts of the body (e.g., the hand–waist intersection), further exacerbating the ambiguity of the joint rotations. In contrast, we use corresponding points sampled from the scans to lightly represent the internal body shape and use the point-to-point relationship between the predicted VSMPL reference point cloud and the internal body point cloud to align the SMPL model, which greatly mitigates the joint rotation ambiguity problem. Although inputting broken body scans results in outputting broken internal body point clouds, robust registration can still be achieved due to the point-to-point correspondence between the VSMPL reference point cloud and the internal body point cloud.

Registration of the external SMPL+D model is performed based on the internal SMPL model parameters, jointly optimizing the shape, pose, displacement, and per-vertex offsets. A qualitative comparison with the competing methods is shown in Figure 9. Implicit-representation-based approaches convert the input scans into voxel representations and simultaneously extract dual-layer surfaces for the body and clothing, which in practice leads to a significant loss of detailed information, such as clothing wrinkles, in human scans. As observed, SMPL+D models reconstructed by IP-Net and PTF tend to be overly smooth. Moreover, these competing methods do not account for deformations in non-clothing regions, resulting in distortion of the local shapes of the body. In contrast, our method does not transform the scan point clouds into other canonical forms, effectively preserving the original clothing wrinkles. Additionally, we define local masking regions (e.g., the head and hands in the first row of Figure 9 and only the hands in the second and third rows) to ensure that non-clothing parts remain free from distortion.

6.5. The Real Scan Reconstruction Experiment

The reconstruction results for real scanned samples from the THuman2.0 dataset [35], which contains 500 high-quality human scans captured using a dense DSLR camera rig, are shown in Figure 10, including the internal body point cloud, the SMPL registration model, and the SMPL+D registration model, none of the poses in which are present in the training set. To demonstrate the effectiveness of removing clothing, the internal body point cloud and the SMPL registration model are shown overlaid with the scanned body mesh so that the predicted gap between the internal body and clothing can be clearly seen.

In addition to the qualitative visualization, we conduct a quantitative evaluation using the Chamfer distance and point-to-surface error between the reconstructed SMPL model and the scanned mesh. The average Chamfer distance is 84.4 mm, and the point-to-surface error is 30.7 mm, indicating that the reconstructed internal body is geometrically close to the actual scan while maintaining the predicted gap introduced by clothing.

It should be noted that since the SMPL model has only two hand joints, it is not possible to register more complex hand poses such as clenched fists, resulting in less effective hand registration. Most of the reconstructed SMPL models are located inside the clothes, with a few interpenetrations occurring, mainly in the shoulders, chest, and upper thighs. The reason for this is that the fabric pressure in these regions is high, which causes the clothing to fit closely to the body, which is consistent with the actual situation.

7. Conclusions

This paper presents UBSP-Net, a novel framework for reconstructing parametric 3D human models from clothed full-body scans, addressing the fundamental challenge of accurately estimating body shape and pose beneath clothing. The proposed framework employs a dual-prediction strategy to concurrently predict the internal body point cloud and a VSMPL reference point cloud with explicit point-to-point correspondence. External scan points are utilized as the initialization to enhance the stability and computational efficiency. In addition, the integration of body part label probabilities and a soft aggregation mechanism helps to accurately predict internal body point clouds and VSMPL reference point clouds, reducing the ambiguity caused by joint rotation.

Comprehensive evaluations on the CAPE dataset demonstrate that UBSP-Net outperforms mainstream methods such as IP-Net and PTF. For internal body shape prediction, it reduces the Chamfer distance errors by approximately 24% compared with those for IP-Net and 41% compared with those for PTF while offering inference speeds 77 times faster than IP-Net and 43 times faster than PTF. For parametric model registration, UBSP-Net achieves vertex error reductions of 32%/43% compared with those for IP-Net and 20%/34% compared with those for PTF for internal SMPL and external SMPL+D models, respectively, with a significantly lower registration latency. Real-world validation using the Thuman2.0 dataset further corroborates its robustness and practical applicability across various clothing types.

The proposed framework offers a scalable, efficient, and privacy-preserving solution for parametric human body modeling, with potential utility in 3D shape analysis, virtual garment fitting, and digital animation. Future research directions include extending the method to accommodate loose-fitting and multi-layered garments; improving the hand pose registration to capture fine-grained gestures; and leveraging temporal cues from 4D scan sequences to enhance dynamic pose estimation and expand its range of applications.

Author Contributions

Conceptualization: X.L. and M.L.; methodology: X.L.; software: X.L.; validation: X.C., F.C., and M.L.; formal analysis: X.C. and F.S.; investigation: M.L. and F.S.; resources: X.C.; data curation: X.L. and F.S.; writing—original draft preparation: X.L.; writing—review and editing: X.L.; visualization: F.C.; supervision: X.C.; project administration: X.C.; funding acquisition: X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available on request from the authors.

Acknowledgments

We gratefully acknowledge the computing resource support provided by the Fenghua Research Institute of Ningbo University of Technology.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Haleem, A.; Gupta, P.; Bahl, S.; Javaid, M.; Kumar, L. 3D scanning of a carburetor body using COMET 3D scanner supported by COLIN 3D software: Issues and solutions. Mater. Today Proc. 2021, 39, 331–337. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Liu, S.; Chen, W.; Li, H.; Hill, R. Equivariant point network for 3D point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14514–14523. [Google Scholar]
Lin, C.H.; Gao, J.; Tang, L.; Takikawa, T.; Zeng, X.; Huang, X.; Kreis, K.; Fidler, S.; Liu, M.Y.; Lin, T.Y. Magic3D: High-resolution text-to-3D content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 300–309. [Google Scholar]
Li, X.; Li, G.; Li, T.; Mitrouchev, P. Human body construction based on combination of parametric and nonparametric reconstruction methods. Vis. Comput. 2024, 40, 5557–5573. [Google Scholar] [CrossRef]
Li, X.; Li, G.; Li, T.; Lv, J.; Mitrouchev, P. Design of a multi-sensor information acquisition system for mannequin reconstruction and human body size measurement under clothes. Text. Res. J. 2022, 92, 3750–3765. [Google Scholar] [CrossRef]
Hu, P.; Kaashki, N.N.; Dadarlat, V.; Munteanu, A. Learning to estimate the body shape under clothing from a single 3-D scan. IEEE Trans. Ind. Inform. 2020, 17, 3793–3802. [Google Scholar] [CrossRef]
Li, X.; Li, G.; Li, M.; Song, H. Parametric body reconstruction based on a single front scan point cloud. IEEE Trans. Vis. Comput. Graph. 2024, 31, 5816–5828. [Google Scholar] [CrossRef] [PubMed]
Bhatnagar, B.L.; Sminchisescu, C.; Theobalt, C.; Pons-Moll, G. Combining implicit function learning and parametric models for 3D human reconstruction. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Cham, Switzerland, 2020; pp. 311–329. [Google Scholar]
Wang, S.; Geiger, A.; Tang, S. Locally aware piecewise transformation fields for 3D human mesh registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7639–7648. [Google Scholar]
Lorensen, W.E.; Cline, H.E. Marching cubes: A high resolution 3D surface construction algorithm. In Seminal Graphics: Pioneering Efforts That Shaped the Field; Association for Computing Machinery: New York, NY, USA, 1998; pp. 347–353. [Google Scholar]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2; Association for Computing Machinery: New York, NY, USA, 2023; pp. 851–866. [Google Scholar]
Feng, H.; Kulits, P.; Liu, S.; Black, M.J.; Abrevaya, V.F. Generalizing neural human fitting to unseen poses with articulated SE(3) equivariance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7977–7988. [Google Scholar]
Xie, H.; Zhong, Y. Structure-consistent customized virtual mannequin reconstruction from 3D scans based on optimization. Text. Res. J. 2020, 90, 937–950. [Google Scholar] [CrossRef]
Zheng, Z.; Yu, T.; Liu, Y.; Dai, Q. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3170–3184. [Google Scholar] [CrossRef] [PubMed]
Bhatnagar, B.L.; Sminchisescu, C.; Theobalt, C.; Pons-Moll, G. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3D human mesh registration. Adv. Neural Inf. Process. Syst. 2020, 33, 12909–12922. [Google Scholar]
Chen, D.; Song, Y.; Liang, F.; Ma, T.; Zhu, X.; Jia, T. 3D human body reconstruction based on SMPL model. Vis. Comput. 2023, 39, 1893–1906. [Google Scholar] [CrossRef]
Choutas, V.; Müller, L.; Huang, C.H.P.; Tang, S.; Tzionas, D.; Black, M.J. Accurate 3D body shape regression using metric and semantic attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 2718–2728. [Google Scholar]
Wang, Y.; Daniilidis, K. Refit: Recurrent fitting network for 3D human recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 14644–14654. [Google Scholar]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar]
Wehrbein, T.; Rudolph, M.; Rosenhahn, B.; Wandt, B. Utilizing uncertainty in 2D pose detectors for probabilistic 3D human mesh recovery. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 5852–5862. [Google Scholar]
Chen, S.; He, Y. Knowledge-embedded Transformer for 3D Human Pose Estimation. IEEE Trans. Instrum. Meas. 2025, 74, 5031811. [Google Scholar]
Yu, T.; Zheng, Z.; Guo, K.; Zhao, J.; Dai, Q.; Li, H.; Pons-Moll, G.; Liu, Y. Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7287–7296. [Google Scholar]
Zhang, C.; Pujades, S.; Black, M.J.; Pons-Moll, G. Detailed, accurate, human shape estimation from clothed 3D scan sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4191–4200. [Google Scholar]
Lazova, V.; Insafutdinov, E.; Pons-Moll, G. 360-degree textures of people in clothing from a single image. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Québec City, QC, Canada, 16–19 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 643–653. [Google Scholar]
Li, J.; Hu, Q.; Zhang, Y.; Ai, M. Robust symmetric iterative closest point. ISPRS J. Photogramm. Remote Sens. 2022, 185, 219–231. [Google Scholar] [CrossRef]
Bogo, F.; Romero, J.; Pons-Moll, G.; Black, M.J. Dynamic FAUST: Registering human bodies in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6233–6242. [Google Scholar]
Yao, Y.; Deng, B.; Xu, W.; Zhang, J. Quasi-newton solver for robust non-rigid registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7600–7609. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Li, X.; Cheng, X.; Chen, F.; Shi, F.; Li, M. FPCR-Net: Front Point Cloud Regression Network for End-to-End SMPL Parameter Estimation. Sensors 2025, 25, 4808. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16259–16268. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
Pons-Moll, G.; Pujades, S.; Hu, S.; Black, M.J. ClothCap: Seamless 4D clothing capture and retargeting. ACM Trans. Graph. (ToG) 2017, 36, 1–15. [Google Scholar] [CrossRef]
Ma, Q.; Yang, J.; Ranjan, A.; Pujades, S.; Pons-Moll, G.; Tang, S.; Black, M.J. Learning to dress 3D people in generative clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6469–6478. [Google Scholar]
Song, Y.P.; Wu, X.; Yuan, Z.; Qiao, J.J.; Peng, Q. Posturehmr: Posture transformation for 3D human mesh recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9732–9741. [Google Scholar]
Yu, T.; Zheng, Z.; Guo, K.; Liu, P.; Dai, Q.; Liu, Y. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5746–5756. [Google Scholar]

Figure 1. A structural diagram of uUBSP-Net. The dot indicates the corresponding point.

Figure 2. Schematic diagram of corresponding point offset.

Figure 3. A schematic diagram of the mapping of corresponding points within the same parts.

Figure 4. A schematic diagram of the CAPE dataset. The dressed body model and the corresponding minimally dressed body model are displayed adjacent to each other, with all meshes maintaining the topological structure of the SMPL model.

Figure 5. A schematic diagram of VSMPL reference point cloud extraction. The dots represent the corresponding points, and the dashed coils represent mesh enlarged views. Among these,

o^{'}

and

v

are corresponding points under the same triangle coordinate formula in the same index triangle.

Figure 5. A schematic diagram of VSMPL reference point cloud extraction. The dots represent the corresponding points, and the dashed coils represent mesh enlarged views. Among these,

o^{'}

and

v

are corresponding points under the same triangle coordinate formula in the same index triangle.

Figure 6. The dataset sample. First, the CAPE dataset human model is segmented into 14 body parts using a predefined segmentation template. Then, the scan point cloud sampling

P_{S}

, the internal body point cloud

P_{I}

, and the VSMPL reference point cloud

P_{R}

required for network training are obtained sequentially using the aforementioned method. The dots represent the corresponding points.

Figure 6. The dataset sample. First, the CAPE dataset human model is segmented into 14 body parts using a predefined segmentation template. Then, the scan point cloud sampling

P_{S}

, the internal body point cloud

P_{I}

, and the VSMPL reference point cloud

P_{R}

required for network training are obtained sequentially using the aforementioned method. The dots represent the corresponding points.

Figure 7. Comparison of different methods for internal body shape prediction.

Figure 8. Qualitative comparison of different methods for predicting internal body shape and SMPL registration. The SMPL registration model is colored with the vertex-to-vertex error compared to the ground truth model.

Figure 9. Comparison of SMPL+D model reconstruction.

Figure 10. Real scan sample reconstruction.

Table 1. The influence of different numbers of sampling points on the predictions. The results are presented as (mean, standard deviation). The best results are bolded.

Sampling Points	Part Labels (%)	VSMPL Reference Point Cloud	Internal Body Point Cloud
2000	96.7	(100.4, 59.7)	(7.6, 1.5)
4000	97.2	(97.8, 59.9)	(7.2, 1.6)
6000	97.4	(96.1, 56.3)	(7.1, 1.4)
8000	97.6	(93.9, 56.3)	(6.7, 1.5)
10,000	97.3	(88.7, 56.9)	(6.4, 1.4)
12,000	97.1	(89.0, 55.4)	(6.4, 1.5)

Table 2. Influence of part label probability weighting on prediction. Results are presented as (mean, standard deviation). Best results are bolded.

Method	VSMPL Reference Point Cloud	Internal Body Point Cloud
Unweighted labeling probability	(127.0, 61.5)	(7.8, 1.5)
Weighted labeling probability	(88.7, 56.9)	(6.4, 1.4)

Table 3. Chamfer distance errors in predicting internal body shapes using each method. Best results are bolded.

Method	Mean	Standard Deviation
IP-Net [8]	238.3	278.0
PTF [9]	306.2	288.6
Ours	180.9	44.5

Table 4. Number of parameters and forward inference time for each method. Best results are bolded.

Method	Parameters	Time
IP-Net [8]	35.0 M	54.2 s
PTF [9]	34.1 M	30.4 s
Ours	7.8 M	0.7 s

Table 5. Model registration and running time comparison. Best results are in bold.

Method	SMPL	Time	SMPL+D	Time
IP-Net [8]	25.0	256.5 s	24.8	60.4 s
PTF [9]	21.1	261.2 s	21.4	60.1 s
LoopReg [15]	-	-	16.0	24.2 s
PostureHMR [34]	113.7	1.9 s	-	-
ReFit [18]	96.4	45.1 s	-	-
Ours	16.9	52.0 s	14.1	23.6 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Cheng, X.; Chen, F.; Shi, F.; Li, M. UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction. Electronics 2025, 14, 3522. https://doi.org/10.3390/electronics14173522

AMA Style

Li X, Cheng X, Chen F, Shi F, Li M. UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction. Electronics. 2025; 14(17):3522. https://doi.org/10.3390/electronics14173522

Chicago/Turabian Style

Li, Xihang, Xianguo Cheng, Fang Chen, Furui Shi, and Ming Li. 2025. "UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction" Electronics 14, no. 17: 3522. https://doi.org/10.3390/electronics14173522

APA Style

Li, X., Cheng, X., Chen, F., Shi, F., & Li, M. (2025). UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction. Electronics, 14(17), 3522. https://doi.org/10.3390/electronics14173522

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UBSP-Net: Underclothing Body Shape Perception Network for Parametric 3D Human Reconstruction

Abstract

1. Introduction

2. Related Work

2.1. Body Shape and Pose Under Clothing

2.2. Human Body Model Registration

3. Fundamentals

4. Method

4.1. UBSP-Net

4.2. The Loss Function

4.3. SMPL/SMPL+D Model Registration

5. Dataset Generation

6. Experimentation and Analysis

6.1. The Influence of the Sampling Point Quantity

6.2. The Influence of the Part Label Probability

6.3. Comparison of Internal Body Shape Predictions

6.4. Parametric Model Reconstruction Evaluation

6.5. The Real Scan Reconstruction Experiment

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI