Dynamic Pose Estimation Using Multiple RGB-D Cameras

Human poses are difficult to estimate due to the complicated body structure and the self-occlusion problem. In this paper, we introduce a marker-less system for human pose estimation by detecting and tracking key body parts, namely the head, hands, and feet. Given color and depth images captured by multiple red, green, blue, and depth (RGB-D) cameras, our system constructs a graph model with segmented regions from each camera and detects the key body parts as a set of extreme points based on accumulative geodesic distances in the graph. During the search process, local detection using a supervised learning model is utilized to match local body features. A final set of extreme points is selected with a voting scheme and tracked with physical constraints from the unified data received from the multiple cameras. During the tracking process, a Kalman filter-based method is introduced to reduce positional noises and to recover from a failure of tracking extremes. Our system shows an average of 87% accuracy against the commercial system, which outperforms the previous multi-Kinects system, and can be applied to recognize a human action or to synthesize a motion sequence from a few key poses using a small set of extremes as input data.


Introduction
The detection of human body parts has been popularly researched in the computer vision and pattern recognition fields. Accurate detection of body parts is important in human pose estimation for activity recognition, which is utilized by various smart systems including: Human computer interaction (HCI), surveillance, healthcare, and entertainment. Recently, it has converged with the virtual reality (VR) and augmented reality (AR) techniques in the training field [1].
Early approaches using a single camera tried to detect the region of interest by extracting the features from illumination, color, and edge information on 2D images. In these approaches, machine learning algorithms such as adaptive boosting (AdaBoost), support vector machine (SVM), and gaussian mixture model (GMM) are used to extract key body features such as face, torso, hands, and feet from a large data set. However, a reliable detection of such features is difficult to achieve due to the background noises and illumination changes on the images. The recent availability of red, green, blue, and depth (RGB-D) cameras, such as Microsoft Kinect [2] and Intel RealSense [3] provides depth data and suggests a more reliable way to detect the features. Using depth information retrieved from an infrared sensor, the region of interest on the human body can be segmented more precisely without background ambiguities.
The joints of the human body can provide useful information for motion analysis. Using a single RGB-D camera, the approach introduced by Shotton et al. [4] has been widely used to detect human The rest of this paper is organized as follows. Previous approaches for human pose estimation are reviewed in Section 2. The detection of key body parts from each camera is described in Section 3, while tracking them from the unified data received from multiple cameras is detailed in Section 4. The experimental results for tracking accuracy, action recognition, and motion synthesis are demonstrated in Section 5. We conclude this paper with a discussion of potential improvements in Section 6.

Related Work
Detecting and tracking human body parts from sensor data has been actively researched in computer vision and recognition fields. Using single or multiple RGB-D cameras, the majority of the detection approaches can be differentiated into three categories: Generative (aka top-down), discriminative (aka bottom-up), and hybrid.
The generative approaches [12][13][14] rely on a template model of the human body and try to estimate the model parameters that best describe the pose in an input image. Using the iterative closest point (ICP) algorithm, Grest et al. [12] defined a nonlinear optimization function and estimated a human pose by applying the analytically simplified Jacobian. Based on the probabilistic inferencing algorithm, Zhu et al. [13] performed feature detection on depth images and estimated relatively simple poses from the detected features. Ganapathi et al. [14] performed real-time detection from a sequence of depth images based on the probabilistic temporal model. In their approach, a set of physical and free space constraints are derived to deform the template model. Shuai et al. [15] used multiple depth cameras to minimize occlusion and designed an ellipsoid-based skeleton model to capture the geometry detail of a tracked object. In these ICP-based approaches, the external template model and its parameters need to be configured in advance to initiate the tracking process, which is computationally expensive for a complicated model such as a human body. On the other hand, our system uses no template model and its parameters to track human poses.
The discriminative approaches [4,[16][17][18][19] try to detect the body parts directly from the observed data without using an initialization process with a template model. Shotton et al. [4] estimated a list of 3D joint positions on a single depth image by performing per-pixel classification which uses a randomized decision tree with a large image set. Their approach was further exploited by Girshick et al. [16], in which they anticipated the occluded joint positions using the regression forest with relative 3D offset information. Shen et al. [17] introduced an example-based approach which corrects occluded body parts, such as a side view. In their approach, a regression forest is learned based on the differences between the motion capture and Kinect data. Recently, Jung et al. [18] improved the performance of joint estimation using a random tree walk, while Shafaei and Little [19] improved the joint estimation accuracy by applying a convolutional neural network (CNN) based pixel classification. Using the discriminative approaches, the joint positions can be estimated in real time given a large set of high quality data for the training process. For example, Shotton et al. [4] trained each tree with 300,000 images while Shafaei and Little [19] collected a six million data set for their classification. However, our approach searches for a set of key body parts from the hierarchical graph structure using a much smaller set of samples (i.e., less than 1000).
Given a database of human motion, the hybrid approaches [20][21][22][23] try to improve the tracking accuracy by combining the generative and discriminative methods (i.e., solving the optimization problems with the database reference). Ganapathi et al. [20] demonstrated an interactive system that estimates body parts in a kinematic chain structure using the hill-climbing method. In their method, a local detector for body parts (i.e., a discriminative model) is used to initiate a tracking failure. Ye et al. [21] stored a set of a 3D skeleton and its mesh data into a database and obtained the optimal pose by matching the point cloud data through the shape refinement process. Baak et al. [22] showed a method of comparing the joint positions at previous frames and the salient body parts extracted from the depth information to search for similar poses. Later, Helten et al. [23] presented a similar approach with a personalized body tracker that improves the lookup accuracy from the regenerated database. Like the discriminative approaches, an extensive set of samples needs to be prepared in advanced for most hybrid approaches to estimate accurate human poses. For example, Ye et al. [21] captured 19,000 samples from a motion capture system, and Baak et al. [22] selected about 25,000 samples. Furthermore, these approaches are sensitive to the physical property of a user such as a body size and require an additional fitting process to track poses from unknown users, making the approaches less applicable to general users. Using multiple cameras, our approach does not require prior body information to track poses from unknown users; hence, it is more applicable to real-time human action recognition and motion synthesis for unspecified individuals.
Recently, multiple depth cameras [5][6][7][8][9][10][11] have been adopted to overcome the joint occlusion problems by using the unified data captured at different view points. Zhang et al. [5] used multiple Kinects for non-skeletal motion data while Kaenchan et al. [6] applied them for a walking analysis. Kitsikidis et al. [7] adopted a hidden conditional random fields (HCRF) classifier to detect patterns in dance motion. With two synchronized cameras, Michel et al. [8] solved an optimization problem using stochastic optimization techniques to track a human body from the depth volume. Moon et al. [9] adopted a Kalman filter framework to combine the multiple depth data to improve the occlusion problem. Recently, Kim et al. [10,11] demonstrated a large scale of multi-Kinects system to capture dynamic motion in dance and martial arts. Most of the time, these approaches rely on the Kinect method [4] to configure the skeleton structure in an articulated model, which often requires an expensive and complicated post process to enhance naturalness in an output pose. On the other hand, our system detects a small set of key body parts and uses them as inputs to refer to the existing motion data for estimating a dynamic human pose.

Single Camera Process
Our system acquires a continuous sequence of RGB-D images from multiple cameras. For each camera, major body parts are detected through three steps: Background subtraction, quadtree-based graph construction, and part joint detection using accumulative geodesic distances and a local detector. The details of the steps are described in the subsequence sections.

Background Subtraction
In our system, the RGB-D cameras provide a continuous sequence of color and depth images with same resolution, and both images are calibrated. Given a sequence of color and depth images streamed from a single RGB-D camera, the background information is subtracted from the images to isolate a human object based on the depth information as it is robust to the illumination changes. We captured the first frame of the depth sequence, where no human is visible and subtracted it from subsequent frames. For the depth images with a human object, a threshold value (i.e., the minimum depth value for each pixel) is used to distinguish between the background and foreground objects. This simple method is sensitive to background noises and can generate false positives, especially at the edges of the human object [24]. As a post-processing step, a morphological erode operation and Sobel kernels are applied to reduce such false positive areas. We compute approximation of vertical and horizontal derivatives using Sobel kernels and remove noises from depth images based on gradient magnitudes. From a filtered depth image, I D , a corresponding color image, I C , can be obtained. We refer a filtered image, I = {I D , I C }. This simple technique works well for a low-resolution depth image. However, a more sophisticated method for the background subtraction could be used using a hardware acceleration [25].

Graph Construction
Inspired by the work identifying geodesic extreme positions using a graph structure [26,27], our detection method assumes the nature of an invariant body structure such that the accumulative geodesic distances between the center of body and key body parts such as head (H), right hand (RH), left hand (LH), right foot (RF), and left foot (LF) do not change regardless of the human poses as shown in Figure 2. For example, let P C be a center position of the human object averaged from I D and the positions of extreme points, P i , on the key body parts are located farthest from P C , where P C , P i ∈ R 3 and i ∈ {H, RH, LH, RF, LF}. Based on this geodesic characteristic, both I D and I C can be represented as a graph model, G. As shown in Algorithm 1, a quadtree-based segmentation is used to group neighboring data efficiently, and each node has a representative value with the center position. if (t D >= t max ) then 8: 9: return node(I)

Body Parts Detection
Algorithm 1 generates an undirected G with a vertex, v j ∈ I sub , which is a leaf node of the quadtree, and a weighted edge which connects to the neighboring vertices of v j . The weight value can be estimated from the Euclidean distance between the neighboring vertices of v j .
Let N k be the number of candidate extreme points and D k represent the shortest paths from the kth starting vertex, s k , where k = [1, . . . , N k ], to other nodes in G. Using Dijkstra's algorithm, a set of the candidate extreme points,Ṕ k , from G can be searched in an iterative way as follows, Set P C as a start vertex, s k , and search G.
Save the accumulative geodesic distances of (1) to D k .
SetṔ k to the longest accumulative geodesic end point of D k .
Set s k as a start vertex and partially search G such that v j is nearer to s k than to s k−1 .
(2) Update D k−1 to D k using the result of (1).
(3) SetṔ k to the longest accumulative geodesic end point of D k . (4) UpdateṔ k to s k+1 .
GivenṔ k , P i for the key body parts can be classified by matching local features. The supervised learning model like SVM requires a relatively small amount of sample data and is well suited for the detection of specific human parts [28,29]. For P i classification, the image patches of major joints are collected from I C , and data augmentation is used to increase the number of the patches. The histograms of gradients for the patches are arranged into a 1D feature vector and used to train the SVM [30]. During the test process, P i is classified within the region of interest forṔ k (i.e., 80 by 80 pixels) by applying a sliding window (i.e., 5 to 20 pixels) with multi-scales for scale-invariant detection. Figure 3 shows the results of each step with t max = 8, δ D = 8, δ C = 5, and N k = 10 to specify P i .

Multi-Camera Process
Our system combines a set of key body parts detected from each camera into a single space and tracks the body parts with minimum errors. This multi-camera process consists of four steps: Data unification in a single coordinate system, body part tracking based on a voting scheme, noise removal and failure recovery with a Kalman-filtered method, and body orientation estimation using principal component analysis (PCA). The details of each step are described in the subsequent sections.

Data Unification
Using multiple RGB-D cameras, a set of extreme points, P c i , detected from the cth camera can be unified into the same coordinate system by using a rigid transformation, T. If one of the cameras is selected as a reference coordinate system, T can be estimated from the ICP method [31] by minimizing the error function, E(T) = E(R T , L T ), where R T and L T are the rotation and translation of the camera data to the reference system, respectively. This convergence method is capable of an online performance with the input data obtained from multiple cameras; however, the unified result may be erroneous without enough matching points, M t , where M t ∈ R 3 and t ∈ [1, . . . , N t ]. As shown in Figure 4, P RH of the calibration pose is traced from the reference, M R t , and the cth camera, M c t , respectively. Given M t , E(R T , L T ) can be evaluated as follows: Here, Given a correlation matrix, W, between M R t and M c t , Here, the optimal solution for E(R T , L T ) is R T = UV with W = UCV derived from a single value decomposition (SVD). Furthermore, R B is the body rotation between the depth images from the reference camera, I R D , and the cth camera, I c D . This is estimated by a plane defined by the upper three extremes, namely, P H , P LH , and P RH and used to enhance the ICP performance. In our system, M t is collected at every 33 ms until N t = 500.

Body Parts Tracking
Given a set of extreme points for each body part in a single coordinate system, our tracking method uses a voting scheme to set a priority for each point. At first, a set of P c i for each body part i from the cth camera forms a candidate group. Within this group, the distance between P c i below a threshold value (150 mm) is regarded as the same point and counted as a vote, ν D i . Next, the characteristics of human physical constraints are considered using the accumulative geodesic distance for the vote count. For example, starting from P C , the end joints such as head, hands, and feet are generally located further than the internal joints such as neck, elbows, and knees. Similarly, the accumulative geodesic distances from P C to the end joints are longer than ones to the internal joints. Another vote, ν R i , is counted for the point if the distance from P C is larger than another threshold value (a quarter of user's height). A total vote, ν i , for P c i is counted as follows: Here, 1], are the vote counts from the distance measure and the range measure, respectively. Furthermore, w D i and w R i are the weight values for ν D i and ν R i , respectively. In our system, w D i and w R i are set to 2 and 1 to emphasize the importance of the neighboring factor, ν D i .
Once ν i is counted for P c i , P i,t is tracked based on the minimum Euclidean distance between a tracked point at a previous frame, P i,t−1 , and candidate points at a current frame, P c i,t , by maximizing ν i as follows: where P c i,t , ν i is the extreme point from the cth camera at the t frame with ν i votes. Here, P c i,t is compared to P i,t−1 in order from the largest ν i to the smallest one. If the maximum of ν i is 0, the tracking attempt fails and enters a recovery process described in the following section.

Noise Removal and Failure Recovery
Whenever the tracking process fails or causes positional noises such as jerky movements in a trajectory of P i,t , our system applies a Kalman filter-based method to correct the erroneous P i,t . Assuming a linear system used for a state-space model in a Kalman filter, the system state model and its measurement model can be defined as follows: where t is the time index, A is the state transition model, x t is the state vector, w t is the state noise vector, H is the measurement matrix, z t is the measurement vector, and v t is the measurement noise vector. Here, w t and v t are considered to be white noises, which comply to the Gaussian normal distribution with a mean value of 0, a covariance matrix of Q = ww , and R = vv . As input arguments to x t , the position and velocity of P i,t are used, and z t returns a corrected position of P i,t . In our system, σ 2 in Q and R is set to 0.01 and 1.0, respectively. Given the state-space model, the Kalman filter estimates a predicted position,P i,t from the prediction and correction steps with P i,t . For example, the prediction step estimatesP i,t while the correction step removes the noises in P i,t . During the prediction step, a predicted state vector, x t , and a predicted covariance matrix,P t , are estimated from a posteriori at t − 1 as follows: where x t−1 and P t−1 are the posteriori state estimate and the posteriori error covariance matrix at time t − 1, respectively. Here,x t replacesP i,t , which failed to be located during the tracking process. During the correction step,x t and the Kalman gain matrix, K t , are used to update x t as follows: Here,ẋ t is the updated state vector, which removes the noises and sets a corrected position of P i,t . Finally, the posteriori error covariance matrix at time t, P t , is estimated as follows: which will be used during the prediction step at t + 1. To summarize,x t from the prediction step anḋ x t from the correction step determines P i,t which fails to track or needs to be corrected for its position, respectively. The result of this process for P LF,t is shown in Figure 5.

Body Orientation Estimation
The body orientation of each pose from I D , serve as a useful parameter for motion analysis. In our system, the normal vector at P C , n C , is estimated using PCA, which finds the best fitting plane from the point locations in I D . When PCA is applied to the selected locations in I D , the first two eigenvectors define the plane. For example, when a covariance matrix (i.e., a size of 3 by 3) is estimated for the matrix of coordinates from I D (i.e., a size of N s by 3), where N s is the number of points to be fit, it can be decomposed into a set of eigenvectors and eigenvalues. Here, the first two eigenvectors with the largest eigenvalues define a plane; thus, the cross product of these two eigenvectors defines a normal vector (i.e., body orientation), n D , on the plane. Figure 6 shows a set of P i tracked and a body orientation represented by n C . In our system, n C is placed 300 mm higher from P C for better recognition.  Figure 7 shows the prototype of our system, which tracks various dynamic movements of Taekwondo from general users. In this system, two RGB-D (Kinect v2 [2]) cameras (Microsoft, Redmond, WA, USA) are placed in front of the user with two displays. From the cameras, only RGB and depth data are retrieved to track a set of extreme points on key body parts such as head, hands, and feet, and the body orientation. A smart sandbag, equipped with pressure sensors, is self-manufactured and used to detect the hitting moments such as punches and kicks from input motion. Our system is best understood through examples of its uses, as described in following sections, and the accompanying video (located at https://drive.google.com/open?id=1IJjt0TTs0TimcEcsTSoIbWrxZiiNYv9g).

Tracking Accuracy
To evaluate the tracking accuracy between different systems, the ground-truth data are captured by the Xsens system [32]. As shown in Figure 7, this commercial system uses a set of inertial sensors embedded in a wearable suit such that user motions can be tracked from both camera-based and Xsens systems at the same time. Owing to the differences in the sensing locations on key body parts between the two systems, a set of joint vectors, j, is defined from the center of the body to each of the end-effector joints (i.e., hands and feet) and used to compare the angular differences between two outputs as follows: where j C and j X are measured with a total of N a frames from the camera-based and Xsens system, respectively. For this and next comparisons, a total of 3845 frames are collected from six Taekwondo actions in Figure 8. As shown in Table 1, it is notable that the head part shows the highest accuracy, which is visible most of time during the tracking process. However, the hand parts are less accurate than the head and foot parts. This is mainly because larger noises arise in the hand areas whenever a user takes poses with hands on the torso areas. These poses are frequent in Taekwondo actions in Figure 8 and make the local detection ambiguous between the hand and torso areas. Next, our system is compared against the multi-Kinects system [11]. In this comparison, the body parts tracked by our system are compared against the major joints (i.e., HEAD, HAND_RIGHT, HAND_LEFT, ANKLE_RIGHT, ANKLE_LEFT) recovered from the Kinect skeletal data. As shown in Table 2, our system with two Kinect cameras outperforms the multi-Kinects system with two or four cameras throughout all of tracked body parts. As expected, the head part shows the highest accuracy as it is visible most of time from all cameras. However, for other parts, the multi-Kinects system suffers from the erroneous skeleton reconstruction, especially the foot areas, and shows relatively less accuracy. It is noteworthy that using four cameras for our system shows negligible improvements on tracking accuracy due to the majority of frontal movements in the collected data. Table 2. Tracking accuracy comparison between our system and multi-Kinects system [10].

Action Recognition
As shown in Figure 8, our system recognizes various kick and punch motions from a user through three phases:Iinput motion segmentation, feature vector extraction, and motion type recognition. First, an input motion is segmented by detecting starting and ending moments of key poses. The starting moment is determined based on the speed and position of the hands and feet. For example, an input motion starts at a moment when the speed of the hands and feet are under a threshold value (1 m/s) while the positions of both feet stay under a height threshold value (10% of the user's height). The motion ends at the moment when the sandbag system detects a hit from the user. To recognize a motion piece by comparing it to a reference, these motions should be aligned in the time-space domain as each of the segmented motions differ in temporal length and the user's body size. Using a number of samples (10 to 15 depending on the complexity of the input poses), a sample set, P S i,t and n S C,t , are linearly interpolated from the trajectories of P i,t and n C,t in the segmented motion and defines the feature vector. For a normalization of the feature vector, P S i,t is translated to an origin using the average positions between P S i,0 of a foot and dividing them by the user's height. In addition, n S C,t is normalized by 360 degrees. Given the feature vector for each input motion, a SVM is utilized to recognize different motion types. Table 3 shows training and test sets (a total of 67,925 frames from 1465 training and 392 test samples) used for recognizing 12 types of Taekwondo motions in an offline manner. For the data set, 10 general users performed each motion type. Table 4 shows that our system is capable of recognizing the motion types with over an average of 96% accuracy from the test set. However, some of the similar motions are incorrectly tracked and misclassified from the test set. For example, about 9% of a front punch with the right hand was recognized as a front punch with a left hand due to the fast exchanges of the left and right hands from user input motions. Table 3. Training and test sets used for action recognition in Taekwondo. The first column represents the action type and motion side to be recognized. A total of 1465 and 392 samples are collected from 10 general users to train (the second column) and to test (the third column) the action recognition, respectively. Here, the total frames for each action (the fourth column) combine the frames used in the training and test samples while each action has a different temporal length (the last column).

Motion Synthesis
As shown in Figure 9, our system is tested for synthesizing a sequence of dynamic movements from a few key poses. The key poses are captured from the Xsens system. Provided with a set of key poses with input parameters (a set of extremes tracked from our system), a sequence of in-between poses between the keys can be generated from the motion blending technique with their weight values estimated from the multi-dimensional scattered interpolation [33].  Figure 10 shows an instance of synthesizing three Taekwondo motions, where each of them is generated using a set of tracked extremes as input parameters and blending five key poses from example motions. As demonstrated in the results, the synthesized motions are comparable to the input motions, exhibiting key movements of each Taekwondo motion type. It is noteworthy that some of movement details, such as relative hand and foot positions, are not synthesized in the output motions due to the small number of key poses used to generate the in-between poses. In these results, it took about 3.43 s, 2.98 s, and 5.78 s to synthesize 1406, 1335, and 2485 frames, respectively. Thus, our system can produce over 400 frames per second, showing a real-time performance for motion synthesis.

Conclusions
In this paper, we introduced a marker-less system for human pose estimation by detecting and tracking key body parts: Head, hands, and feet. Using multiple RGB-D cameras, our system minimizes the self-occlusion problems by unifying the depth data captured at different viewpoints into a single coordinate system. To accelerate the search process of the candidate points on the body parts, a quadtree-based graph is constructed from the RGB-D image, and the accumulative geodesic distances on the graph is used to select a set of extreme points on the key body parts. During the tracking process, these points are used as input parameters for motion analysis. Whenever there are tracking noises or failures, a Kalman filter-based method for noise removal and recovery is introduced to correct and expect the extreme positions. Unlike the previous approaches using a learning-based model, our approach does not reconstruct a full skeleton structure to estimate human poses from input data. Instead, the input poses are abstracted with a small set of extreme points, making the detecting and tracking process easier without solving the optimization problem for skeleton reconstruction. Using a small set of extremes as input data, our system can be applied to recognize a human action or to synthesize a motion sequence from a few key poses in real time. As demonstrated in the experimental results, our system shows a higher accuracy over the multi-Kinects system with more RGB-D cameras used.
The current system can be easily scalable by adding more RGB-D cameras as needed. For example, placing two more RGB-D cameras behind the user might provide better accuracy for the occluded poses in turning motion if the space and system cost are permitted. Using other RGB-D camera such as Intel RealSense [3] was problematic due to noisy depth data and unstable support for the software library. Furthermore, our system is mainly designed to capture one user at a time. For the multi-person pose estimation, the current detection method can be exploited to extract multiple independent keys, possibly other than hands and feet, from the input images and to map each set of the keys to a different person.
The proposed system causes higher tracking errors when there are frequent crossing of hands and feet in an input pose. We are currently improving the tracking recovery process of such cases by analyzing the velocity gradients of hands and feet. In addition, there is no synchronization in the times between the input data received from multiple Kinect cameras that do not support a triggering signal. An external sync generator can be adopted with more sophisticated cameras; however, such a configuration increases the overall system cost, making the system less applicable for general users.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: