^{★}

This article is an Open Access article distributed under the terms and conditions of the Creative Commons Attribution license

This paper addresses the problem of accurate and robust tracking of 3D human body pose from depth image sequences. Recovering the large number of degrees of freedom in human body movements from a depth image sequence is challenging due to the need to resolve the depth ambiguity caused by self-occlusions and the difficulty to recover from tracking failure. Human body poses could be estimated through model fitting using dense correspondences between depth data and an articulated human model (local optimization method). Although it usually achieves a high accuracy due to dense correspondences, it may fail to recover from tracking failure. Alternately, human pose may be reconstructed by detecting and tracking human body anatomical landmarks (key-points) based on low-level depth image analysis. While this method (key-point based method) is robust and recovers from tracking failure, its pose estimation accuracy depends solely on image-based localization accuracy of key-points. To address these limitations, we present a flexible Bayesian framework for integrating pose estimation results obtained by methods based on key-points and local optimization. Experimental results are shown and performance comparison is presented to demonstrate the effectiveness of the proposed approach.

For the past decades, human body pose tracking from video inputs has been an active research field motivated by various applications including human computer interaction, motion capture systems, and gesture recognition. The major challenges of recovering the large number of degrees of freedom in human body movements are the difficulties to resolve various ambiguities in the projection of human motion onto the image plane and the diversity of visual appearance caused by clothing and varying illumination.

Existing approaches for human pose tracking include methods based on single cameras, multiple cameras, and sensors beyond visible spectrum. Time-of-flight (TOF) based imaging devices have attracted researchers’ attention due to the potential to resolve depth ambiguity [

Most existing approaches to track human body pose from depth sequences [

Recovering from pose tracking failure is indeed an important component for a robust pose tracking algorithm. Considering example postures shown in

For many existing pose tracking methods, tracking long sequences will result in tracking failure which cannot be easily recovered. This paper presents a key-point based method to reconstruct poses from anatomical landmarks detected and tracked from depth image analysis. The key-point based method is robust and can recover from tracking failure when a body part is re-detected and tracked. However, its pose estimation accuracy depends solely on the image-based localization accuracy of key-points. To address these limitations, we present a Bayesian framework to integrate pose estimation results from methods using local optimization and key-point detection. Our contribution of the work is to integrate pose estimation results from multiple methods. In particular, we use results obtained by using key-points and local optimization and show that accuracy is improved compared with either method alone.

The rest of the paper is organized as follows. Section 2 introduces the human model used in this paper, and the background on pose estimation with constrained inverse kinematics. Our Bayesian method for accurate and robust pose tracking is presented in Section 3. Methods using key-points and local optimization are described in Subsections 3.1 and 3.2, respectively. Experimental results are shown in Section 4. Section 5 concludes the paper.

The human body model is represented as a hierarchy of joint link models with a skin mesh attached to it as in Lewis

Let _{0} be the initial model pose, _{0}_{i}_{1} and _{2} are defined for singularity avoidance and joint limit avoidance. This type of formulation using inverse kinematics is often used to derive manipulators orientation at each joint, when given a desired position of the end-effector. See Zhu

Our model marker points (for key-point detection) include the set of model vertices as shown in

The main idea of tracking is illustrated in

Let _{t}_{t}|I_{1}_{2}, · · · _{t}_{1}_{2}_{t}

Let us assume that we can approximate the observation distribution as mixture of Gaussian:
_{t}_{t}

Let human dynamics have Gaussian noise

Using the above Bayesian tracking equation, we can represent the posterior probability distribution as:

Since we represent the posterior probability distribution as a sum of Gaussian, there are available methods to perform density approximation. One simple way is to keep the dominant modes in the posterior probability distribution. Researchers [

The detailed illustration of this Bayesian inference method to pose tracking is shown in

In order to have a robust pose tracker, one of the crucial processing steps is to localize each visible limb. We present a method to detect, label and track body parts using depth images as shown in

Once the head, neck, and trunk are detected, limbs (two arms and two legs) are to be detected as shown in

After the limbs are detected, we perform a labeling step in order to differentiate the left and right limbs as well as to determine the limb occlusion status. We use the following steps to label detected arms (same steps applied to leg labeling) based on the arm occlusion status at the last frame. For image frames where both arms are visible (in previous frame), let us define _{LA} and _{RA} to be the histograms of depth values for the left and right arms respectively, and we assign each pixel _{x}^{t}_{LA}(_{RA}(

When only one arm is visible from the last frame, we compute the geometric distance from the detected arm pixels to the tracked arm, and decide the label based on the maximal arm movement distance between successive frames. When both arms are not visible from the last frame, we label the detected arm based on its spatial distribution relative to the torso center line, where the left arm is located to the left of torso center line.

Finally, when the observed number of pixels for a limb is less than the threshold, we declare that the limb is occluded. For each visible limb, we preform a local optimization to align the 2-D scaled prismatic model [

Key-points corresponding to the human anatomical landmarks as in

Referring to _{0}. Let _{t–1} be the optimal pose estimation from the last frame and let
_{1} = 3 as in

H1:
_{t–1} and all feature points.

H2:

H3:
_{t–1} without using the extracted elbow feature points. This hypothesis is useful to prevent the large error in elbow detection and extraction.

Since the motion to be tracked in this study is general and has high uncertainty, a common approach is to model the human pose temporal dynamics as zero velocity with a Gaussian noise

Density sampling can be performed based on this temporal prediction prior probability distribution as this is a standard Gaussian mixture distribution.

Let

To evaluate tracking quality, we use a tracking error measurement function that is based on the sum of the distances from sampled depth points to their corresponding closest model vertices. Without loss of generality, let us use

Given observation distribution _{t}|q_{t}_{t}|I_{1}, _{2}, · · ·, _{t–1}) as

At any frame, the optimal pose estimation is exported as the mode in the posterior probability distribution _{t}|I_{1}_{2}_{t}

The Bayesian pose tracking algorithm is implemented and tested on a set of upper and whole body sequences captured from a single time-of-flight (TOF) range sensor [

The proposed Bayesian framework is able to track robustly and recover from tracking failure by integrating low-level key-point detection from depth image analysis;

The proposed Bayesian framework is able to achieve a higher accuracy by taking advantage of the ICP to refine the alignment between 3D model and point clouds;

Our current implementation works well for body twists up to 40 degree rotation on either side of a front facing posture. Large twists and severe interaction between upper and lower body limbs remain as a challenge in the current implementation. Example upper-body and whole-body tracking results are shown in _{1} + _{2} = 3 + 3 = 6, _{1} = 3 as explained in Subsection 3.1. Secondly, _{2} = 3 and _{2} and

We summarize and compare its performance with the ICP method and key-point based method as in

We have presented a Bayesian framework for human pose tracking from depth image sequences. Human pose tracking remains as a challenging problem, primarily because pose is difficult to track due to occlusion, fast movements, and ambiguity. Generating multiple hypotheses for human pose for one image is at times necessary to arrive at a correct solution. A method has been proposed to demonstrate a potential to integrate pose estimation results from different modalities to improve the robustness and accuracy. We believe the parallel nature of the hypothesis evaluation permits us to achieve a faster implementation with latest parallel programming techniques.

Depth data (a) Example upper body postures; (b) Example whole body postures.

Human body model (a) Hierarchical joint link model with 28 dofs; (b) Elbow joint limit constraints for natural pose tracking.

Model marker points (a) from key-point detection; (b) from dense ICP correspondences (each yellow vector represents a correspondence pair).

Robust pose estimation with Bayesian tracking framework.

Body part detection, labeling and tracking.

HNT template localization (shown in red) and limb detection: (a) Open arm detection; (b) Looped arm detection; (c) Arm detection that is in front of torso; (d) Lower limb detection.

Upper body pose tracking for violin playing motion. Rows 1 and 3: depth image sequence with the detected body parts. Rows 2 and 4: corresponding reconstructed pose.

Upper body pose tracking for frisbee throwing motion. Rows 1 and 3: depth image sequence with the detected body parts. Rows 2 and 4: corresponding reconstructed pose.

Whole body pose tracking with self occlusions during leg crossing. Rows 1 and 3: depth image sequence with the detected body parts. Rows 2 and 4: corresponding reconstructed pose.

Whole body pose tracking during a dancing sequence. Rows 1 and 3: depth image sequence with the detected body parts. Rows 2 and 4: corresponding reconstructed pose.

Comparison between various human pose tracking approaches.

Methods | Tracking through occlusion | Error-recovery | Tracking with missing key-points | Integration with other information | Speed |
---|---|---|---|---|---|

ICP based method | No | No | Yes | No | 5∼9 Hz |

Key-point based method | Yes | Yes | No | No | 3∼6 Hz |

Bayesian-based method | Yes | Yes | Yes | Yes | 0.1 Hz |

A comparison of overall trajectory accuracy between key-point based method and Bayesian-based method.

Methods | X trajectory accuracy | Y trajectory accuracy | Z trajectory accuracy |
---|---|---|---|

Key-point based method | 80 mm | 84 mm | 93 mm |

Bayesian-based method | 73 mm | 78 mm | 87 mm |