^{*}

^{*}

^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Non-contact human body measurement plays an important role in surveillance, physical healthcare, on-line business and virtual fitting. Current methods for measuring the human body without physical contact usually cannot handle humans wearing clothes, which limits their applicability in public environments. In this paper, we propose an effective solution that can measure accurate parameters of the human body with large-scale motion from a Kinect sensor, assuming that the people are wearing clothes. Because motion can drive clothes attached to the human body loosely or tightly, we adopt a space-time analysis to mine the information across the posture variations. Using this information, we recover the human body, regardless of the effect of clothes, and measure the human body parameters accurately. Experimental results show that our system can perform more accurate parameter estimation on the human body than state-of-the-art methods.

Non-contact human body measurement plays an important role in surveillance, physical healthcare, on-line business and virtual fitting. Usually, we must acquire human body models before biometric measurements. Laser range scanners can provide human body reconstruction, which can be used for accurate biometric measurements. However, laser range scanners cost from $40,000 to $500,000 and require people to wear tight clothing or almost no clothes. Therefore, laser range scanners cannot be an everyday choice for human body measurement in the short term. Recently, marker-less multi-view systems [

When measuring the human body with large-scale motion, the first priority is recovering accurate pose parameters. Recently, the technology of motion capture [

In addition to the large-scale motion, the effect of clothes is another challenge for accurate body measurement. The KinectAvatar [

In this paper, we present a novel approach to measuring the human body with large-scale motion from a single Kinect sensor, regardless of whether people wear clothes or not. In our approach, we combine pose detection with pose tracking as a multi-layer filter to estimate accurate pose parameters from the monocular Kinect sensor. Then, we estimate a consistent model of people who are engaged in large-scale motion. Afterward, we mitigate the effect of clothes through space-time analysis, and we measure the body parameters accurately from the human model.

In summary, our contributions are: (1) A multi-layer framework for accurate human motion capture in the monocular, noisy and low-resolution condition. The combination of pose detection, pose tracking and failure detection achieves a fully automatic process of human motion estimation. (2) The application of a space-time analysis to mitigate the effect of clothes, which makes it possible to apply our system non-intrusively in public environments.

In this paper, we present a system for measuring humans wearing clothes with large-scale motion. At first, a video sequence of people acting in diverse poses is captured by a monocular Kinect sensor (

After the motion capture step, we estimate the models according to the depth maps for different poses. Because we have recovered accurate pose parameters for every frame, we can transform all of the models of the different poses into a standard pose, and then, a spatial-temporal average model can be reconstructed (Section 4.2,

In our overall framework, we solve human pose parameters and shape parameters separately. To do so, we adopt the SCAPE model, which is a parametric method of modeling human bodies that factors the complex non-rigid deformations induced by both pose and shape variation and is learned from a database of several hundred laser scans. The database that we have used to train the SCAPE model is from [

In the following parts of this section, we will represent the pose parameters that are estimated for the human body as ^{36}, where the first six degrees of freedom represent the absolute root position and orientation, and the other degrees of freedom represent the relative joint rotations. These joints are the neck (3 DoF), upper back (1 DoF), waist (2 DoF), left and right shoulder (3 DoF), elbow (2 DoF), wrist (1 DoF), hip (3 DoF), knee (2 DoF) and ankle (1 DoF). Additionally, we denote

The core of our pose-tracking module follows the model registration algorithm in which the pose parameter, ^{36}, can be solved as a MAP problem [_{i}_{i}_{i}_{i}_{m}

We formulate the likelihood term and the prior term similar to Wei

Depth maps can hardly reveal the motion of roll, so we attempt to find cues from the RGB images. Let _{i}_{i}_{i}_{-1} and _{i}_{-1} be the observed RGB image and depth map for the last frame. First, we find and match the key points on _{i}_{i}_{-1} using the ORBalgorithm [_{rgb}

Note that the RGB image term has the same form as the extra term in the MAP formula; we can optimize this term in the same way that we optimize the extra term in

Note that the details of the SCAPE model, especially the silhouette, might not match the observed depth perfectly; we cannot evaluate the silhouette term in the same way as the depth image term. We should find a robust way to build the correspondence explicitly or implicitly between the rendered and the observed silhouette images. In our practice, we adopt the Coherent Point Drift (CPD) algorithm [_{render}

After adding the RGB image term and the silhouette term, we can describe the pose-tracking problem as:

In the upper equations, _{render}_{depth}_{extra}_{depth}_{silhouette}_{extra}_{rgb}_{s}

Another problem in the optimization process is calculating the derivative,
_{x}, t_{y}, t_{z}, θ_{0}_{1},…, _{n}

In the upper equation, _{g}_{x}, t_{y}, t_{z}_{m}_{m}_{0}

Because the pose parameter,

Then, the rotation matrix mapping in

Consider a vertex, _{i}_{m}_{m}

Then, the derivative can be represented as:

In the upper equation,

After the pose-tracking module, we build a failure-detection module to automatically detect failed pose-tracking results. The failure-detection module judges a failed pose-tracking instance by using the proportion of the unexplained area to the right matched area on the depth map. We project the model rendered from the tracked pose parameters to the observed depth map, and we define the right matched area as the overlapping area where the difference between the rendered pixel and the observed pixel is no more than 6

The pixels belong to the observed depth map, but do not belong to the rendered depth map;

The pixels belong to the rendered depth map, but do not belong to the observed depth map;

The pixels overlap where the difference between the observed map and the rendered map is more than 6

When the proportion is more than 15%, we consider this pose tracking result to be a failed pose tracking.

In this section, first, we use the first five frames to initialize a rough shape parameter for the SCAPE model. Although the initial shape parameter cannot be very accurate, it can be a baseline for the subsequent steps. Afterward, for each frame, a SCAPE model is optimized using the depth map at that time. Then, we transform all of these models into the template pose (the T-pose in

Before accurately reconstructing the human model, we must estimate a rough shape parameter as the baseline. In addition, the shape parameter can be used in initializing a SCAPE model to track the human motion. In our system, we use the first five frames of the sequence to solve the shape parameter. The process of generating a SCAPE model can be described as a problem of minimizing the objective function as [_{k}_{j,k}, y_{1,}_{k}_{j,k}_{j,k}

When we estimate the shape parameter, the pose parameter can be seen as a constant vector. Therefore, the process of generating the SCAPE model can be reformulated as:

In the upper equation, _{average}

Because the Kinect sensor captures a partial view of an object, every frame from it can provide us with only 2.5D information. To recover the complete 3D information, we should synthesize the 2.5D information from different views along the time axis. At every frame, we attach the 2.5D information to the SCAPE model using the uniform Laplacian Coordinate [

In the upper equation,

When we obtain the optimized models at every frame, we transform them into template pose using the inverse LBSmodel. Then, we weigh every vertex in these models to obtain a spatial-temporal average model. To decide the weight function here, the property of the Kinect z-direction error should be understood. In real applications, we find that the z-direction measurement error of a Kinect sensor increases when the regularized dot product of the vertex normal and sensor projection normal decreases (

In the upper equation,

In the upper equation,

Imagine the following situations: when a person extends his body, his clothes tend to be far away from his body; in contrast, when a person huddles up, his clothes tend to be close to his body. The spatial-temporal average model that we obtain in Section 4.2 from multiple poses is equivalent to the intermediate state between the above two situations. To mitigate the effect of clothes on the spatial-temporal average model, we conduct a space-time analysis for every point across the frames in which the point is in front of the Kinect sensor. For a specific point on the spatial-temporal average model (the blue rectangle in

In the upper equation, _{aver} is the same point on the spatial-temporal average model, _{origin} is the same point on the model after shape parameter initialization, < , > represents the dot product of two vectors and _{aver}

For measuring the arm length, neck-hip length and leg length, we specify the indices of the start and end points according to the bone segment in advance, and the system can automatically measure these parameters from an obtained model. For measuring the chest girth, waist girth and hip girth, we specify the indices of a circle of points around a corresponding location according to the standard definition in advance; afterward, our system constructs a convex hull from these points during runtime, and the parameters can be measured from the girth of the convex hull automatically by our system (

In our experiments, we have tested a total of 55 sequences of video. There are 25 men and 10 women tested in our experiments (some people were tested more than once). The people measured in our experiments are 20 to 45 years old, their weights range from 40 kg to 88 kg and their heights range from 1.55 m to 1.90 m.

In the remaining part of this section, we will compare our method with the state-of-art methods [

As a result, our proposed method can accurately measure the body parameters of dressed humans with large-scale motions. In other words, our proposed method can be easily applied in public situations, such as shopping malls, police stations and hospitals. Additionally, because the total price of our system is approximately $150, it can be widely used in home situations.

First, we compare our method to [

As can be seen in

Cui

In

Before evaluating our method's effectiveness, a parameter for measuring the tightness of the clothes should be defined. Of course, for different parts of the human body, the tightness of the same clothes could be different. Thus, we should evaluate the tightness of different parts of the clothes separately. Here, we use the variance of a human body parameter in different poses to evaluate the tightness of the clothes at the location related to the parameter:

In the upper equation, _{i}_{i}

We measure the relative errors of the chest girth and waist girth from the spatial-temporal average model and the model after the mitigation of the clothes effect, in different situations of tightness, as illustrated in

In this paper, we present a novel approach that can measure the human body in clothes with large-scale motion accurately. The key contribution of our paper is to mine the cue from different poses in the temporal domain and the information from the spatial depth map to mitigate the effect of the clothes. Additionally, our reconstruction of the average model provides a robust estimate of the deformation direction from the original model to the model that is closest to a real human body. Another contribution of our paper is extending a motion capture framework from the cylinder-like model to the SCAPE model by using cues from the RGB images and the silhouette. Quantitative evaluations show that our solution for measuring parameters of the human body is more accurate than the present methods. Additionally, a comparison to the average model with large-scale motion shows that our method of mitigating the clothes effect is effective.

In the future, we will attempt to follow the core idea of Non-Rigid Structure from Motion and find a solution for measuring people with large-scale motion by using only an RGB camera.

This work was partially supported by Grant No.61100111, 61201425, 61271231 and 61300157 from the Natural Science Foundation of China, Grant No.BK2011563 from the Natural Science Foundation of Jiangsu Province, China, and Grant No.BE2011169 from the Scientific and Technical Supporting Programs of Jiangsu Province, China.

The authors declare no conflict of interest.

Pipeline of our approach: (

(

Impact of our RGB term: (

Impact of our silhouette term: (

Necessity of our constraint of consistency: (

The relationship between z-direction relative measurement error and the regularized dot product of the vertex normal and sensor projection normal.

Time-space analysis to mitigate the effect of clothes: the red star represents a point on the model after shape parameter initialization (baseline). The blue rectangle represents the same point on the spatial-temporal average model. The pink, green and purple points are the same points on the optimized models of the T-pose across the frames. As described in

Results for the mitigation of the clothes: the green domains are the models projected on the RGB images. (

Automatic Measurement: (

Results of the motion capture module: Column 1 shows the RGB images. Column 2 shows our results in the front view. Column 3 shows different viewpoints of column 2.

Results of the model reconstruction: Row 1 shows the RGB images. Row 2 shows the results of average models. Row 3 shows the results after mitigating the clothes effect. Row 4 shows a different viewpoint of row 3.

The statistics for the error of the proposed method.

Comparison of the accuracy of measuring a human wearing clothes with KinectAvatar [

Comparison of the model after the mitigation of the clothes effect with the spatial-temporal average model.

Average computational time statistics for our system: the running time statistics were gathered from testing our implementation on a dual core 2.33 GHz Intel processor.

Procedure | Time Consuming |
---|---|

Pose Recovery | 6.32 s per frame |

Shape Parameter Recovery | 55.2 s |

Weighed-Average Model Recovery | 5.45 s per frame |

Mitigation of the Effect of Clothes | 7.94 s per frame |

Comparison of the accuracy of almost bare human measurement with home 3D body scans [

Arm Length | Chest Girth | Neck to Hip Distance | Hip Girth | Thigh Girth | |
---|---|---|---|---|---|

Error of [ |
2.6 | 3.5 | 1.7 | ||

Error of ours (cm) | 1.6 | 2.6 |

Comparison of the accuracy of measuring a human wearing clothes with home 3D body scans [

Arm Length | Chest Girth | Neck to Hip Distance | Hip Girth | Thigh Girth | |
---|---|---|---|---|---|

Error of [ |
2.0 | 8.5 | 6.3 | 4.7 | |

Error of ours (cm) | 2.4 |