You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

10 November 2018

Dynamic Pose Estimation Using Multiple RGB-D Cameras

and
1
Creative Content Research Division, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Korea
2
School of Games, Hongik University, 2639 Sejong-ro, Jochiwon-eup, Sejong 30016, Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
This article belongs to the Special Issue Sensors Signal Processing and Visual Computing

Abstract

Human poses are difficult to estimate due to the complicated body structure and the self-occlusion problem. In this paper, we introduce a marker-less system for human pose estimation by detecting and tracking key body parts, namely the head, hands, and feet. Given color and depth images captured by multiple red, green, blue, and depth (RGB-D) cameras, our system constructs a graph model with segmented regions from each camera and detects the key body parts as a set of extreme points based on accumulative geodesic distances in the graph. During the search process, local detection using a supervised learning model is utilized to match local body features. A final set of extreme points is selected with a voting scheme and tracked with physical constraints from the unified data received from the multiple cameras. During the tracking process, a Kalman filter-based method is introduced to reduce positional noises and to recover from a failure of tracking extremes. Our system shows an average of 87% accuracy against the commercial system, which outperforms the previous multi-Kinects system, and can be applied to recognize a human action or to synthesize a motion sequence from a few key poses using a small set of extremes as input data.

1. Introduction

The detection of human body parts has been popularly researched in the computer vision and pattern recognition fields. Accurate detection of body parts is important in human pose estimation for activity recognition, which is utilized by various smart systems including: Human computer interaction (HCI), surveillance, healthcare, and entertainment. Recently, it has converged with the virtual reality (VR) and augmented reality (AR) techniques in the training field [1].
Early approaches using a single camera tried to detect the region of interest by extracting the features from illumination, color, and edge information on 2D images. In these approaches, machine learning algorithms such as adaptive boosting (AdaBoost), support vector machine (SVM), and gaussian mixture model (GMM) are used to extract key body features such as face, torso, hands, and feet from a large data set. However, a reliable detection of such features is difficult to achieve due to the background noises and illumination changes on the images. The recent availability of red, green, blue, and depth (RGB-D) cameras, such as Microsoft Kinect [2] and Intel RealSense [3] provides depth data and suggests a more reliable way to detect the features. Using depth information retrieved from an infrared sensor, the region of interest on the human body can be segmented more precisely without background ambiguities.
The joints of the human body can provide useful information for motion analysis. Using a single RGB-D camera, the approach introduced by Shotton et al. [4] has been widely used to detect human body parts as a hierarchical skeleton structure. In their approach, a list of joint positions is estimated from a user who faces the RGB-D camera by using the random statistical model. However, its accuracy, especially for internal joints, is sensitive to the input pose. For example, the joint positions of occluded body parts might be either skipped or incorrectly estimated from unknown poses. Using multiple cameras around the user, a set of depth images captured from different viewpoints can be combined to complement the occluded body parts [5,6,7,8,9,10,11]. In these approaches, an optimization problem should be solved to track a list of joints in an articulated model from the depth volume.
In this paper, we introduce a marker-less system for human pose estimation by detecting and tracking key body parts, namely, the head, hands, and feet. Given color and depth images captured by multiple RGB-D (two Kinect) cameras, our system detects a set of extreme points on the key body parts from each camera (i.e., the single camera process) and tracks them from the unified data received from the multiple cameras (i.e., the multi-camera process) as shown in Figure 1. During the single camera process, a quadtree-based graph model with segmented regions is constructed from the color and depth images received from a single camera. Using geodesic distances on the graph, a set of candidate points is searched and selected as extreme points through a supervised learning model. During the multi-camera process, the optimal extreme points are selected and tracked from the unified data received from the multiple cameras. For a better tracking performance, a Kalman filter-based method is introduced to reduce positional noises and to recover from a failure of tracking extremes.
Figure 1. Overview of the proposed system: Human pose estimation using multiple red, green, blue, and depth (RGB-D) cameras.
Unlike the previous approaches, our system does not reconstruct a full skeleton structure from input data. Instead, input poses are abstracted with a small set of extreme points on the key body parts, making the detecting and tracking processes easier without solving the optimization problem for skeleton reconstruction. Using a small set of extremes as input data, our system can be applied to recognize a human action or to synthesize a motion sequence from a few key poses in real time. As demonstrated in the experimental results, our system shows an average of 87% accuracy against the commercial system, which outperforms the multi-Kinects system with more RGB-D cameras used.
The rest of this paper is organized as follows. Previous approaches for human pose estimation are reviewed in Section 2. The detection of key body parts from each camera is described in Section 3, while tracking them from the unified data received from multiple cameras is detailed in Section 4. The experimental results for tracking accuracy, action recognition, and motion synthesis are demonstrated in Section 5. We conclude this paper with a discussion of potential improvements in Section 6.

3. Single Camera Process

Our system acquires a continuous sequence of RGB-D images from multiple cameras. For each camera, major body parts are detected through three steps: Background subtraction, quadtree-based graph construction, and part joint detection using accumulative geodesic distances and a local detector. The details of the steps are described in the subsequence sections.

3.1. Background Subtraction

In our system, the RGB-D cameras provide a continuous sequence of color and depth images with same resolution, and both images are calibrated. Given a sequence of color and depth images streamed from a single RGB-D camera, the background information is subtracted from the images to isolate a human object based on the depth information as it is robust to the illumination changes. We captured the first frame of the depth sequence, where no human is visible and subtracted it from subsequent frames. For the depth images with a human object, a threshold value (i.e., the minimum depth value for each pixel) is used to distinguish between the background and foreground objects. This simple method is sensitive to background noises and can generate false positives, especially at the edges of the human object [24]. As a post-processing step, a morphological erode operation and Sobel kernels are applied to reduce such false positive areas. We compute approximation of vertical and horizontal derivatives using Sobel kernels and remove noises from depth images based on gradient magnitudes. From a filtered depth image, I D , a corresponding color image, I C , can be obtained. We refer a filtered image, I = { I D , I C } . This simple technique works well for a low-resolution depth image. However, a more sophisticated method for the background subtraction could be used using a hardware acceleration [25].

3.2. Graph Construction

Inspired by the work identifying geodesic extreme positions using a graph structure [26,27], our detection method assumes the nature of an invariant body structure such that the accumulative geodesic distances between the center of body and key body parts such as head (H), right hand (RH), left hand (LH), right foot (RF), and left foot (LF) do not change regardless of the human poses as shown in Figure 2. For example, let P C be a center position of the human object averaged from I D and the positions of extreme points, P i , on the key body parts are located farthest from P C , where P C , P i R 3 and i { H , RH , LH , RF , LF } . Based on this geodesic characteristic, both I D and I C can be represented as a graph model, G. As shown in Algorithm 1, a quadtree-based segmentation is used to group neighboring data efficiently, and each node has a representative value with the center position.
Figure 2. Accumulative geodesic distances between the center of body and key body parts such head, hands, and feet. Here, the poses are colored white if the distances are closer to the center of the body and red if they are closer to key body parts.
Algorithm 1 Graph construction: Here, δ D and δ C are threshold values to split a node into four children.
1:
Input: image I = { d e p t h ( D ) , c o l o r ( C ) } , init tree depth t D = 0
2:
 
3:
Output: decomposed image with quadtree structure
4:
 
5:
functionQuadtree( I , t D )
6:
 
7:
if ( t D > = t max ) then
8:
 
9:
  return n o d e ( I )
10:
 
11:
else
12:
 
13:
   σ D = standard deviation of I D
14:
 
15:
   δ D = divider of I D
16:
 
17:
   σ C = standard deviation of I C
18:
 
19:
   δ C = divider of I C
20:
 
21:
  if ( σ D > δ D σ C > δ C ) then
22:
 
23:
    I sub = split I as four sub images
24:
 
25:
   return QuadTree( I sub , t D + 1 )
26:
 
27:
  end if
28:
 
29:
end if
30:
 
31:
end function

3.3. Body Parts Detection

Algorithm 1 generates an undirected G with a vertex, v j I sub , which is a leaf node of the quadtree, and a weighted edge which connects to the neighboring vertices of v j . The weight value can be estimated from the Euclidean distance between the neighboring vertices of v j .
Let N k be the number of candidate extreme points and D k represent the shortest paths from the kth starting vertex, s k , where k = [ 1 , , N k ] , to other nodes in G. Using Dijkstra’s algorithm, a set of the candidate extreme points, P ´ k , from G can be searched in an iterative way as follows,
When k = 1 ,
(1)
Set P C as a start vertex, s k , and search G.
(2)
Save the accumulative geodesic distances of (1) to D k .
(3)
Set P ´ k to the longest accumulative geodesic end point of D k .
(4)
Update P ´ k to s k + 1 .
When 1 < k < N k + 1 ,
(1)
Set s k as a start vertex and partially search G such that v j is nearer to s k than to s k 1 .
(2)
Update D k 1 to D k using the result of (1).
(3)
Set P ´ k to the longest accumulative geodesic end point of D k .
(4)
Update P ´ k to s k + 1 .
Given P ´ k , P i for the key body parts can be classified by matching local features. The supervised learning model like SVM requires a relatively small amount of sample data and is well suited for the detection of specific human parts [28,29]. For P i classification, the image patches of major joints are collected from I C , and data augmentation is used to increase the number of the patches. The histograms of gradients for the patches are arranged into a 1D feature vector and used to train the SVM [30]. During the test process, P i is classified within the region of interest for P ´ k (i.e., 80 by 80 pixels) by applying a sliding window (i.e., 5 to 20 pixels) with multi-scales for scale-invariant detection. Figure 3 shows the results of each step with t max = 8 , δ D = 8 , δ C = 5 , and N k = 10 to specify P i .
Figure 3. Results of each step in the single camera process: (a) a filtered human object (foreground), (b) a quadtree-based decomposition of color and depth images, (c) accumulative geodesic end points (candidate extreme points), and (d) selected extreme points on the head (H) (yellow), right hand (RH) (white), left hand (LH) (green), right foot (RF) (red), and left foot (LF) (blue) in searched regions.

4. Multi-Camera Process

Our system combines a set of key body parts detected from each camera into a single space and tracks the body parts with minimum errors. This multi-camera process consists of four steps: Data unification in a single coordinate system, body part tracking based on a voting scheme, noise removal and failure recovery with a Kalman-filtered method, and body orientation estimation using principal component analysis (PCA). The details of each step are described in the subsequent sections.

4.1. Data Unification

Using multiple RGB-D cameras, a set of extreme points, P i c , detected from the cth camera can be unified into the same coordinate system by using a rigid transformation, T. If one of the cameras is selected as a reference coordinate system, T can be estimated from the ICP method [31] by minimizing the error function, E ( T ) = E ( R T , L T ) , where R T and L T are the rotation and translation of the camera data to the reference system, respectively. This convergence method is capable of an online performance with the input data obtained from multiple cameras; however, the unified result may be erroneous without enough matching points, M t , where M t R 3 and t [ 1 , , N t ] . As shown in Figure 4, P RH of the calibration pose is traced from the reference, M t R , and the cth camera, M t c , respectively. Given M t , E ( R T , L T ) can be evaluated as follows:
E ( R T , L T ) 1 N t t = 1 N t | | M t R R B ( R T M t c + L T ) | | 2 .
Figure 4. Data unification for a multi-camera process: (a) a calibration pose for the coordinate unification and (b) two depth images in the same coordinate system (oriented toward a viewer).
Here, L T = M ¯ R R T M ¯ c , where M ¯ R = 1 N t t = 1 N t M t R and M ¯ c = 1 N t t = 1 N t M t c . Given a correlation matrix, W, between M t R and M t c ,
W = t = 1 N t ( M t R M ¯ t R ) ( M t c M ¯ t c ) = U C V .
Here, the optimal solution for E ( R T , L T ) is R T = U V with W = U C V derived from a single value decomposition (SVD). Furthermore, R B is the body rotation between the depth images from the reference camera, I D R , and the cth camera, I D c . This is estimated by a plane defined by the upper three extremes, namely, P H , P LH , and P RH and used to enhance the ICP performance. In our system, M t is collected at every 33 ms until N t = 500 .

4.2. Body Parts Tracking

Given a set of extreme points for each body part in a single coordinate system, our tracking method uses a voting scheme to set a priority for each point. At first, a set of P i c for each body part i from the cth camera forms a candidate group. Within this group, the distance between P i c below a threshold value (150 mm) is regarded as the same point and counted as a vote, ν i D . Next, the characteristics of human physical constraints are considered using the accumulative geodesic distance for the vote count. For example, starting from P C , the end joints such as head, hands, and feet are generally located further than the internal joints such as neck, elbows, and knees. Similarly, the accumulative geodesic distances from P C to the end joints are longer than ones to the internal joints. Another vote, ν i R , is counted for the point if the distance from P C is larger than another threshold value (a quarter of user’s height). A total vote, ν i , for P i c is counted as follows:
ν i = w i D ν i D + w i R ν i R .
Here, ν i D and ν i R , where ν i D , ν i R [ 0 , 1 ] , are the vote counts from the distance measure and the range measure, respectively. Furthermore, w i D and w i R are the weight values for ν i D and ν i R , respectively. In our system, w i D and w i R are set to 2 and 1 to emphasize the importance of the neighboring factor, ν i D .
Once ν i is counted for P i c , P i , t is tracked based on the minimum Euclidean distance between a tracked point at a previous frame, P i , t 1 , and candidate points at a current frame, P i , t c , by maximizing ν i as follows:
P i , t = arg min P i , t c P i , t c , ν i P i , t 1 2 ,
where P i , t c , ν i is the extreme point from the cth camera at the t frame with ν i votes. Here, P i , t c is compared to P i , t 1 in order from the largest ν i to the smallest one. If the maximum of ν i is 0, the tracking attempt fails and enters a recovery process described in the following section.

4.3. Noise Removal and Failure Recovery

Whenever the tracking process fails or causes positional noises such as jerky movements in a trajectory of P i , t , our system applies a Kalman filter-based method to correct the erroneous P i , t . Assuming a linear system used for a state-space model in a Kalman filter, the system state model and its measurement model can be defined as follows:
x t + 1 = A x t + w t , z t = H x t + v t ,
where t is the time index, A is the state transition model, x t is the state vector, w t is the state noise vector, H is the measurement matrix, z t is the measurement vector, and v t is the measurement noise vector. Here, w t and v t are considered to be white noises, which comply to the Gaussian normal distribution with a mean value of 0, a covariance matrix of Q = ww , and R = vv . As input arguments to x t , the position and velocity of P i , t are used, and z t returns a corrected position of P i , t . In our system, σ 2 in Q and R is set to 0.01 and 1.0, respectively.
Given the state-space model, the Kalman filter estimates a predicted position, P ˜ i , t from the prediction and correction steps with P i , t . For example, the prediction step estimates P ˜ i , t while the correction step removes the noises in P i , t . During the prediction step, a predicted state vector, x ˜ t , and a predicted covariance matrix, P ˜ t , are estimated from a posteriori at t 1 as follows:
x ˜ t = A x t 1 , P ˜ t = A P t 1 A + Q ,
where x t 1 and P t 1 are the posteriori state estimate and the posteriori error covariance matrix at time t 1 , respectively. Here, x ˜ t replaces P ˜ i , t , which failed to be located during the tracking process. During the correction step, x ˜ t and the Kalman gain matrix, K t , are used to update x t as follows:
K t = P ¯ t H ( H P ˜ t H + R ) 1 , x ˙ t = x ˜ t + K t ( P i , t H x ˜ t ) .
Here, x ˙ t is the updated state vector, which removes the noises and sets a corrected position of P i , t . Finally, the posteriori error covariance matrix at time t, P t , is estimated as follows:
P t = P ˜ t K t H P ˜ t ,
which will be used during the prediction step at t + 1 . To summarize, x ˜ t from the prediction step and x ˙ t from the correction step determines P i , t which fails to track or needs to be corrected for its position, respectively. The result of this process for P LF , t is shown in Figure 5.
Figure 5. Comparison of foot trajectories: Initially tracked (green) and Kalman filtered (white). Here, the tracking noises cause sudden positional changes (the white circle) while the tracking failure skips the foot positions (the red circle), which are corrected in the filtered trajectory.

4.4. Body Orientation Estimation

The body orientation of each pose from I D , serve as a useful parameter for motion analysis. In our system, the normal vector at P C , n C , is estimated using PCA, which finds the best fitting plane from the point locations in I D . When PCA is applied to the selected locations in I D , the first two eigenvectors define the plane. For example, when a covariance matrix (i.e., a size of 3 by 3) is estimated for the matrix of coordinates from I D (i.e., a size of N s by 3), where N s is the number of points to be fit, it can be decomposed into a set of eigenvectors and eigenvalues. Here, the first two eigenvectors with the largest eigenvalues define a plane; thus, the cross product of these two eigenvectors defines a normal vector (i.e., body orientation), n D , on the plane. Figure 6 shows a set of P i tracked and a body orientation represented by n C . In our system, n C is placed 300 mm higher from P C for better recognition.
Figure 6. A set of extreme points on the body parts tracked and the body orientation represented by a normal vector (cyan).

5. Experimental Results

Figure 7 shows the prototype of our system, which tracks various dynamic movements of Taekwondo from general users. In this system, two RGB-D (Kinect v2 [2]) cameras (Microsoft, Redmond, WA, USA) are placed in front of the user with two displays. From the cameras, only RGB and depth data are retrieved to track a set of extreme points on key body parts such as head, hands, and feet, and the body orientation. A smart sandbag, equipped with pressure sensors, is self-manufactured and used to detect the hitting moments such as punches and kicks from input motion. Our system is best understood through examples of its uses, as described in following sections, and the accompanying video (located at https://drive.google.com/open?id=1IJjt0TTs0TimcEcsTSoIbWrxZiiNYv9g).
Figure 7. System setup for action recognition and motion synthesis: Two RGB-D cameras with displays, a smart sandbag to detect user kicks and punches, and the Xsens system (a wearable suit) used for accuracy comparison.

5.1. Tracking Accuracy

To evaluate the tracking accuracy between different systems, the ground-truth data are captured by the Xsens system [32]. As shown in Figure 7, this commercial system uses a set of inertial sensors embedded in a wearable suit such that user motions can be tracked from both camera-based and Xsens systems at the same time. Owing to the differences in the sensing locations on key body parts between the two systems, a set of joint vectors, j , is defined from the center of the body to each of the end-effector joints (i.e., hands and feet) and used to compare the angular differences between two outputs as follows:
E r r = 1 0 N a ( j X · j C + 1 ) 2 N a ,
where j C and j X are measured with a total of N a frames from the camera-based and Xsens system, respectively.
For this and next comparisons, a total of 3845 frames are collected from six Taekwondo actions in Figure 8. As shown in Table 1, it is notable that the head part shows the highest accuracy, which is visible most of time during the tracking process. However, the hand parts are less accurate than the head and foot parts. This is mainly because larger noises arise in the hand areas whenever a user takes poses with hands on the torso areas. These poses are frequent in Taekwondo actions in Figure 8 and make the local detection ambiguous between the hand and torso areas.
Figure 8. Recognition of various Taekwondo actions: Input RGB (left) and output depth data with tracked extreme points (right).
Table 1. Tracking accuracy of our system.
Next, our system is compared against the multi-Kinects system [11]. In this comparison, the body parts tracked by our system are compared against the major joints (i.e., HEAD, HAND_RIGHT, HAND_LEFT, ANKLE_RIGHT, ANKLE_LEFT) recovered from the Kinect skeletal data. As shown in Table 2, our system with two Kinect cameras outperforms the multi-Kinects system with two or four cameras throughout all of tracked body parts. As expected, the head part shows the highest accuracy as it is visible most of time from all cameras. However, for other parts, the multi-Kinects system suffers from the erroneous skeleton reconstruction, especially the foot areas, and shows relatively less accuracy. It is noteworthy that using four cameras for our system shows negligible improvements on tracking accuracy due to the majority of frontal movements in the collected data.
Table 2. Tracking accuracy comparison between our system and multi-Kinects system [10].

5.2. Action Recognition

As shown in Figure 8, our system recognizes various kick and punch motions from a user through three phases:Iinput motion segmentation, feature vector extraction, and motion type recognition. First, an input motion is segmented by detecting starting and ending moments of key poses. The starting moment is determined based on the speed and position of the hands and feet. For example, an input motion starts at a moment when the speed of the hands and feet are under a threshold value (1 m/s) while the positions of both feet stay under a height threshold value ( 10 % of the user’s height). The motion ends at the moment when the sandbag system detects a hit from the user. To recognize a motion piece by comparing it to a reference, these motions should be aligned in the time-space domain as each of the segmented motions differ in temporal length and the user’s body size. Using a number of samples (10 to 15 depending on the complexity of the input poses), a sample set, P i , t S and n C , t S , are linearly interpolated from the trajectories of P i , t and n C , t in the segmented motion and defines the feature vector. For a normalization of the feature vector, P i , t S is translated to an origin using the average positions between P i , 0 S of a foot and dividing them by the user’s height. In addition, n C , t S is normalized by 360 degrees.
Given the feature vector for each input motion, a SVM is utilized to recognize different motion types. Table 3 shows training and test sets (a total of 67,925 frames from 1465 training and 392 test samples) used for recognizing 12 types of Taekwondo motions in an offline manner. For the data set, 10 general users performed each motion type. Table 4 shows that our system is capable of recognizing the motion types with over an average of 96 % accuracy from the test set. However, some of the similar motions are incorrectly tracked and misclassified from the test set. For example, about 9% of a front punch with the right hand was recognized as a front punch with a left hand due to the fast exchanges of the left and right hands from user input motions.
Table 3. Training and test sets used for action recognition in Taekwondo. The first column represents the action type and motion side to be recognized. A total of 1465 and 392 samples are collected from 10 general users to train (the second column) and to test (the third column) the action recognition, respectively. Here, the total frames for each action (the fourth column) combine the frames used in the training and test samples while each action has a different temporal length (the last column).
Table 4. Accuracy of action recognition in Taekwondo.

5.3. Motion Synthesis

As shown in Figure 9, our system is tested for synthesizing a sequence of dynamic movements from a few key poses. The key poses are captured from the Xsens system. Provided with a set of key poses with input parameters (a set of extremes tracked from our system), a sequence of in-between poses between the keys can be generated from the motion blending technique with their weight values estimated from the multi-dimensional scattered interpolation [33].
Figure 9. Key poses used for motion synthesis: front kick (left), round kick (middle), and front punch (right).
Figure 10 shows an instance of synthesizing three Taekwondo motions, where each of them is generated using a set of tracked extremes as input parameters and blending five key poses from example motions. As demonstrated in the results, the synthesized motions are comparable to the input motions, exhibiting key movements of each Taekwondo motion type. It is noteworthy that some of movement details, such as relative hand and foot positions, are not synthesized in the output motions due to the small number of key poses used to generate the in-between poses. In these results, it took about 3.43 s, 2.98 s, and 5.78 s to synthesize 1406, 1335, and 2485 frames, respectively. Thus, our system can produce over 400 frames per second, showing a real-time performance for motion synthesis.
Figure 10. Motion synthesis from a set of key poses and input parameters: Front kick (top), round kick (middle), and front punch (bottom).

6. Conclusions

In this paper, we introduced a marker-less system for human pose estimation by detecting and tracking key body parts: Head, hands, and feet. Using multiple RGB-D cameras, our system minimizes the self-occlusion problems by unifying the depth data captured at different viewpoints into a single coordinate system. To accelerate the search process of the candidate points on the body parts, a quadtree-based graph is constructed from the RGB-D image, and the accumulative geodesic distances on the graph is used to select a set of extreme points on the key body parts. During the tracking process, these points are used as input parameters for motion analysis. Whenever there are tracking noises or failures, a Kalman filter-based method for noise removal and recovery is introduced to correct and expect the extreme positions. Unlike the previous approaches using a learning-based model, our approach does not reconstruct a full skeleton structure to estimate human poses from input data. Instead, the input poses are abstracted with a small set of extreme points, making the detecting and tracking process easier without solving the optimization problem for skeleton reconstruction. Using a small set of extremes as input data, our system can be applied to recognize a human action or to synthesize a motion sequence from a few key poses in real time. As demonstrated in the experimental results, our system shows a higher accuracy over the multi-Kinects system with more RGB-D cameras used.
The current system can be easily scalable by adding more RGB-D cameras as needed. For example, placing two more RGB-D cameras behind the user might provide better accuracy for the occluded poses in turning motion if the space and system cost are permitted. Using other RGB-D camera such as Intel RealSense [3] was problematic due to noisy depth data and unstable support for the software library. Furthermore, our system is mainly designed to capture one user at a time. For the multi-person pose estimation, the current detection method can be exploited to extract multiple independent keys, possibly other than hands and feet, from the input images and to map each set of the keys to a different person.
The proposed system causes higher tracking errors when there are frequent crossing of hands and feet in an input pose. We are currently improving the tracking recovery process of such cases by analyzing the velocity gradients of hands and feet. In addition, there is no synchronization in the times between the input data received from multiple Kinect cameras that do not support a triggering signal. An external sync generator can be adopted with more sophisticated cameras; however, such a configuration increases the overall system cost, making the system less applicable for general users.

Author Contributions

Conceptualization, S.H. and Y.K.; Data curation, S.H.; Formal analysis, S.H. and Y.K.; Funding acquisition, Y.K.; Investigation, S.H. and Y.K.; Methodology, S.H. and Y.K.; Supervision, Y.K.; Visualization, S.H. and Y.K.; Writing—original draft, Y.K.; Writing—review & editing, S.H. and Y.K.

Funding

This research was supported by the Sports Promotion Fund of Seoul Olympic Sports Promotion Foundation from Ministry of Culture, Sports and Tourism (s072016122016) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2017R1C1B5017000).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RGB-DRed, green, blue, and depth
HCIHuman computer interaction
VRVirtual reality
ARAugmented reality
AdaBoostAdaptive boosting
SVMSupport vector machine
GMMGaussian mixture model
ICPIterative closest point
CNNConvolutional neural network
HHead
RHRight hand
LHLeft hand
RFRight foot
LFLeft foot
SVDSingle value decomposition
PCAPrincipal component analysis
FKFront kick
SKSide kick
RKRound kick
FPFront punch
WPForward punch
FPOSFront punch opposite side

References

  1. Patrizio, A. Japan Airlines Employs Microsoft HoloLens for Inspections and Training. 2016. Available online: https://www.networkworld.com/article/3098505/software/japan-airlines-employs-microsoft-hololens-for-inspections-and-training.html (accessed on 12 February 2018).
  2. Microsoft Kinect. Available online: https://developer.microsoft.com/en-us/windows/kinect (accessed on 15 March 2018).
  3. Intel RealSense. Available online: https://www.intel.com/content/www/us/en/architecture-and-technology/realsense-overview.html (accessed on 31 August 2018).
  4. Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
  5. Zhang, L.; Sturm, J.; Cremers, D.; Lee, D. Real-time human motion tracking using multiple depth cameras. In Proceedings of the International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 2389–2395. [Google Scholar]
  6. Kaenchan, S.; Mongkolnam, P.; Watanapa, B.; Sathienpong, S. Automatic multiple kinect cameras setting for simple walking posture analysis. In Proceedings of the International Computer Science and Engineering Conference, Bangkok, Thailand, 4–6 September 2013; pp. 245–249. [Google Scholar]
  7. Kitsikidis, A.; Dimitropoulos, K.; Douka, S.; Grammalidis, N. Dance analysis using multiple kinect sensors. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, 5–8 January 2014; pp. 789–795. [Google Scholar]
  8. Michel, D.; Panagiotakis, C.; Argyros, A.A. Tracking the articulated motion of the human body with two RGBD cameras. Mach. Vis. Appl. 2015, 26, 41–54. [Google Scholar] [CrossRef]
  9. Moon, S.; Park, Y.; Ko, D.W.; Suh, I.H. Multiple kinect sensor fusion for human skeleton tracking using kalman filtering. Int. J. Adv. Robot. Syst. 2016, 13, 1–10. [Google Scholar] [CrossRef]
  10. Kim, Y.; Baek, S.; Bae, B.-C. Motion capture of the human body using multiple depth sensors. ETRI J. 2017, 39, 181–190. [Google Scholar] [CrossRef]
  11. Kim, Y. Dance motion capture and composition using multiple RGB and depth sensors. Int. J. Distrib. Sens. Netw. 2017, 13, 1–11. [Google Scholar] [CrossRef]
  12. Grest, D.; Woetzel, J.; Koch, R. Nonlinear body pose estimation from depth images. In Joint Pattern Recognition Symposium; Springer: Berlin/Heidelberg, Germany, 2005; pp. 285–292. [Google Scholar]
  13. Zhu, Y.; Dariush, B.; Fujimura, K. Controlled human pose estimation from depth image streams. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  14. Ganapathi, V.; Plagemann, C.; Koller, D.; Thrun, S. Real-time human pose tracking from range data. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 738–751. [Google Scholar]
  15. Shuai, L.; Li, C.; Guo, X.; Prabhakaran, B.; Chai, J. Motion capture with ellipsoidal skeleton using multiple depth cameras. Trans. Vis. Comput. Gr. 2017, 23, 1085–1098. [Google Scholar] [CrossRef] [PubMed]
  16. Girshick, R.; Shotton, J.; Kohli, P.; Criminisi, A.; Fitzgibbon, A. Efficient regression of general-activity human poses from depth images. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 415–422. [Google Scholar]
  17. Shen, W.; Deng, K.; Bai, X.; Leyvand, T.; Guo, B.; Tu, Z. Exemplar-based human action pose correction and tagging. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1784–1791. [Google Scholar]
  18. Jung, H.Y.; Lee, S.; Heo, Y.S.; Yun, I.D. Random tree walk toward instantaneous 3D human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 2467–2474. [Google Scholar]
  19. Shafaei, A.; Little, J.J. Real-time Human motion capture with multiple depth cameras. In Proceedings of the Conference on Computer and Robot Vision, Victoria, BC, Canada, 1–3 June 2016; pp. 24–31. [Google Scholar]
  20. Ganapathi, V.; Plagemann, C.; Koller, D.; Thrun, S. Real time motion capture using a single time-of-flight camera. In Proceedings of the Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 755–762. [Google Scholar]
  21. Ye, M.; Wang, X.; Yang, R.; Ren, L.; Pollefeys, M. Accurate 3d pose estimation from a single depth image. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 731–738. [Google Scholar]
  22. Baak, A.; Muller, M.; Bharaj, G.; Seidel, H.-P.; Christian, T. A data-driven approach for real-time full body pose reconstruction from a depth camera. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1092–1099. [Google Scholar]
  23. Helten, T.; Baak, A.; Bharaj, G.; Muller, M.; Seidel, H.-P.; Theobalt, C. Personalization and evaluation of a real-time depth-based full body tracker. In Proceedings of the International Conference on 3D Vision, Verona, Italy, 5–8 September 2013; pp. 279–286. [Google Scholar]
  24. Greff, K.; Brandao, A.; Kraub, S.; Stricker, D.; Clua, E. A comparison between background subtraction algorithms using a consumer depth camera. In Proceedings of the International Conference on Computer Vision Theory and Applications, Rome, Italy, 24–26 February 2012; pp. 431–436. [Google Scholar]
  25. Hwang, S.; Uh, Y.; Ki, M.; Lim, K.; Park, D.; Byun, H. Real-time background subtraction based on GPGPU for high-resolution video surveillance. In Proceedings of the International Conference on Ubiquitous Information Management and Communication, Beppu, Japan, 5–7 January 2017; pp. 109:1–109:6. [Google Scholar]
  26. Plagemann, C.; Ganapathi, V.; Koller, D.; Thrun, S. Real-time identification and localization of body parts from depth images. In Proceedings of the International Conference on Robotics and Automation, Anchorage, Alaska, 4–8 May 2010; pp. 3108–3113. [Google Scholar]
  27. Mohsin, N.; Payandeh, S. Localization and identification of body extremities based on data from multiple depth sensors. In Proceedings of the International Conference on Systems, Man, and Cybernetics, Banff, AB, Canada, 5–8 October 2017; pp. 2736–2741. [Google Scholar]
  28. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part based models. Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
  29. Jiang, G.; Cheng, J.; Pang, J.; Guo, Y. Realtime hand detection based on multi-stage HOG-SVM classifier. In Proceedings of the International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013; pp. 4108–4111. [Google Scholar]
  30. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
  31. Besl, P.J.; McKay, N.D. A method for registration of 3-D shapes. Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
  32. Xsens Technologies. Available online: https://xsens.com (accessed on 15 March 2018).
  33. Sloan, P.P.; Rose, C.F.; Cohen, M.F. Shape by Example. In Proceedings of the International Conference on Symposium on Interactive 3D Graphics, Chapel Hill, NC, USA, 26–29 March 2001; pp. 135–144. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.