Dual Quaternions as Constraints in 4D-DPM Models for Pose Estimation

The goal of this research work is to improve the accuracy of human pose estimation using the Deformation Part Model (DPM) without increasing computational complexity. First, the proposed method seeks to improve pose estimation accuracy by adding the depth channel to DPM, which was formerly defined based only on red–green–blue (RGB) channels, in order to obtain a four-dimensional DPM (4D-DPM). In addition, computational complexity can be controlled by reducing the number of joints by taking it into account in a reduced 4D-DPM. Finally, complete solutions are obtained by solving the omitted joints by using inverse kinematics models. In this context, the main goal of this paper is to analyze the effect on pose estimation timing cost when using dual quaternions to solve the inverse kinematics.


Introduction
Human pose estimation has been extensively studied for many years in computer vision. Many attempts have been made to improve human pose estimation with methods that work mainly with monocular red-green-blue (RGB) images such as [1][2][3][4][5].
With the ubiquity and increased use of depth sensors, methods that use red-green-blue-depth RGB-D imagery are fundamental. One of the methods that used such imagery, and which is currently considered the state of the art for human pose estimation, is Shotton et al. [6], which was commercially developed for the kinect device (Microsoft, Redmond, WA, USA). Shotton's method allows real-time joint detection for human pose estimation based solely on depth channel. Despite the state-of-the-art performance of [6] and the commercial success of kinect, the many drawbacks of [6] make it difficult to be adopted in any other type of three-dimensional (3D) computer vision system. Some of the drawbacks of [6] include copyright and licensing issues, which restrict the use and implementation of the algorithm for working on any other devices. Another drawback of the algorithm is the large number of training examples (hundreds of thousands) that are required to train its deep random forest algorithm, and which could make training cumbersome. Another drawback of [6] is that its model is trained only on depth information, and thus discards potentially important information that could be found in the RGB channels and could help approach human poses more accurately. To alleviate these and other drawbacks in [6], we propose a novel approach that takes advantage of both RGB and depth information combined in a multi-channel mixture of parts for pose estimation in single frame images coupled with a skeleton constrained linear quadratic estimator (Kalman filter) that uses the rigid information of a human skeleton to improve joint tracking in consecutive frames. Unlike kinect, our approach makes our model easily trainable even for nonhuman poses. By adding depth information, we increase the time complexity of the proposed method. For this reason, to speed up the proposed method, we reduced the number of points modeled in the proposed method compared with the original deformation part model DPM. Finally, we propose an inverse kinematics method for the inference of the joints not considered initially, which cuts the training time.
The main contribution of our method extends to: (i) a multi-channel mixture of parts model that allows the detection of parts in RGBD images; (ii) a linear quadratic estimator (KF) that employs rigid information and connected joints of human pose; (iii) a model for unsolved joints through inverse kinematics that allows the model to be trained with fewer joints and in less time. In our previous work, [7,8], it is shown that computational cost is too high. This is the reason why in this paper a dual quaternion solution is introduced to improve the computational cost of the previously proposed method. Our results show significant improvements over the state of the art in both the publicly available CAD60 data set and our own data set.

Related Work
Human pose estimation has been intensely studied for decades in the field of computer vision due to its wide applications. Some of the methods in the literature that attempt to solve this problem date back to the use of pictorial structures (PS) introduced by [9]. More recent methods improve the concept of PS with improved features or inference models, as in [3,[10][11][12][13]. Recently, the launch of low-cost RGB-D sensors (e.g., kinect) has further triggered a large amount of research due to their good performance from extra depth information whose intensities depict an inversely proportional relationship between the distance of the objects to the camera. The existing algorithms can be roughly categorized into three groups, i.e., using only RGB sensor, using only Depth sensors, or using both RGB and Depth sensors. Some approaches in the first group are [1,[14][15][16][17][18]. Some approaches in the second group are [6,[19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38]. Some approaches in the third group are [39,40].
Using RGB sensors, Yang et al. [1] uses a mixtures of parts model based on a robust joint relationships, Sapp et al. [14], in turn, uses a multimodal decomposable model, Bourdev et al. [16] addresses the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet, and Wang et al. [15] considers part-based models by introducing hierarchical poselets. Ionescu et al. [17] describes automatic 3D human pose reconstruction from monocular images, based on a discriminative formulation with latent segmentation inputs. Using depth sensor, Shotton et al. [6], which was developed for the kinect algorithm, has become the state of the art for performing human pose estimation that predicts 3D positions of body joints from a single depth image. As mentioned in [26], to capture the human pose efficiently from multi-view video sequences, a sum of Gaussian (SoG) model was developed in [32]. This simple yet effective shape representation provides a differentiable model-to-image similarity function, allowing a fast and accurate full body pose estimation. The SoG model was also used in [33,36,40] for human or hand pose estimation. Extended from SoG, a generalized SoG model (GSoG) was proposed in [34], where it encapsulated fewer anisotropic Gaussians for human shape modeling, and a similarity function between GSoG and SoG was derived in 3D space. Meanwhile, a sum of anisotropic Gaussians (SAG) model [37] shared the similar spirit with GSoG for hand pose estimation, and it provided an overlap measurement between projected SAG and SoG/SAG in 2D image. Although GSoG and SAG based approaches have improved the pose estimation performance with better model adaptability, their similarity functions are specifically designed for different situations/applications. In addition, the clamping function that aims to handle the model intersection problem in previous SoG-based approaches [32,34,36] leads to a discontinuous energy function that could hinder the gradient-based optimization. In [26], inspired by the classical Kernel Correlation-based algorithm [38], generalizes previous SoG-based methods and derives a unified similarity function from the perspective of Gaussian kernel correlation. Ding et al. [26] embeds a kinematical skeleton into the kernel correlation, which enables us to achieve a fast articulated pose estimation.
Using both RGB and depth sensors, object detection has been done using RGB-D with Markov Random Fields (MRFs) and features from both RGB and Depth [39]. Ding et al. [40] defines a method that can capture a broad range of articulated hand motions at interactive rates.
The proposed method uses both RGB and Depth information and a discriminative method using a deformable parts model combined with a generative method using Kalman filter for tracking the human pose.
We first explain the proposed method, Section 2, using the pre-processing step, Section 2.1, for the depth channels in which the background was removed to improve the accuracy of our algorithm (see Figure 1). Section 2.2 explains the formulation of our four dimensional (4D) mixture of parts model. Section 2.3 explains our structured quadratic linear estimator for correcting joints in consecutive frames. Section 2.4 explains the polisphere model used. Finally, the Section 2.5 describes the strategy to reduce the computational complexity of our proposed method using dual quaternions. Finally, Section 3 shows us the results obtained comparing the proposed method (4D-DPM) with the original method DPM in Section 3.1 and a time complexity analysis in Section 3.2.

Data Pre-Processing
As a processing step of RGB channels, we isolate significant foreground areas in these channels from background noise. This is done by removing regions in the depth images that are most unstable to different thresholds that belong to the background. Such a foreground and background template is then transferred to the RGB images to thus remove noise or conflicting object patterns that would confuse foreground and background features in our method, and would hinder detection accuracies.
The intuition behind this approach is that objects or people in the foreground seen through the depth sensor share areas with similar pixel intensities. The reason for this is that the infra red (IR) rays being reflected from the objects in the foreground are reflected more or less at the same time and with the same intensity. Other objects or areas that are much farther away from the IR camera unevenly reflect such rays, and these areas appear noisier and with varying intensities. Figure 2 shows the different intensities reflected from the IR sensor that represents the depth coordinates of the objects.
Due to this property of the pixel intensities in the depth images, our background removal method, which is used for depth and later applied to the RGB images, uses a maximally stable extremal regions (MSER) based approach [41]. These regions are the most stable ones within a range of all possible threshold values being applied to them. A stability score δ of each region in the depth channels is calculated so that δ = |∆R−R| |R| , where |R| represents the area of the region in question and ∆ represents the intensity variation for the different thresholds. Hence, we remove those MSER regions in which areas are above a T threshold. We train the parameters for MSER based on a subset of the training set.
We can see in Figure 2 the results from our background subtraction method. Note that most of the noisy pixels in the background have been removed.

Multi-Channel Mixture of Parts
Until recently, Yang and Ramanan's method [1] has been a state-of-the-art method for pose estimation in monocular images. Yang and Ramanan's method performs poorly on images that vary from those in its training set, and their method only improves by a small margin even after retraining.
Although there have been other algorithms that have improved Yang and Ramanan's model, such as [2,3,5], all of these methods, including Yang and Ramanan's, use a mixture of parts for only the RGB dimension of channels. Conversely, our method uses a multi-channel mixture of parts model that allows us to extend the number of mixtures of parts to the depth dimension of RGBD images. The depth channel increases time complexity, but this disadvantage has been solved by cutting the number of joints modeled in our 4D-DPM method. On [7,8,42], we can find the main equations changed to introduce the new dimension, depth channel.

Point Detection in Consecutive Frames
To date, we have dealt only with pose estimation for each single frame independently. However, most of the joint movement performed in normal circumstances displays uniform and constant changes of displacement and velocity. Hence, we can use joint velocity and acceleration to predict where joints would most likely be, given their past history. This motion-based prediction could help us validate our frame-based prediction.
One way of predicting joint location based on previous detections is by using a linear quadratic estimator (LQE). Using a simple LQE works well when the joints being tracked are independent of each other and their movement does not correlate. However, in our case, our joints are connected to each other through limbs, which are rigid connections and allow the movement of one joint related to the other one to be connected; e.g., the foot joint movement would be relative to a parent joint like as a knee or hip.
Using the same algorithm as [7,8,42], a Kalman filter is used for tracking the points of interest.

Geometric Model
In the case of improving the results of the Kalman filter, we introduce some restrictions using the geometric model. To do that, we use a polisphere to represent the human body. This representation allows us to detect collisions between the different parts of the body.
In Figure 3, we can see the geometric model used. Green parts are the principal spheres used and delimit each part of the body.

Model Simplification
The additional depth images included in our formulation add computational cost to our training and testing phases.
In this section, we explain a simplification technique that uses inverse kinematic equations in order to infer shoulder and knee joints. The original DPM model calculates the full body parts with 14 joints. By using inverse kinematics, we can lower that number of points to 10. The joints modeled in our proposed 4D-DPM method were reduced, as were the variables to be predicted with KF.
• Human body model: In order to track the human skeleton, we model it as a group of kinematic chains, where each part and joint in the human body corresponds to a link and joint in a kinematic chain. Given the joint positions predicted by the KF, inverse kinematics are used to obtain all of the joints using Dual Quaternions (DQ). • State variables: The human body model is divided into four main kinematic chains (KC) that perform collision detection with their correspondent state variables, in essence: one KC for each arm and one for each leg. Figure 4 shows the state variable for each KC. • DQ model: We use DQ to model each KC. In this sense, we use six joints for each KC for shoulders, hips, hands, and feet (see Figure 4). • DH model: We use the Denavit-Hartenberg (DH) method to obtain the base coordinate system for each joint. After that, we will apply the dual quaternion method. First, we establish the base coordinate system (X 0 , Y 0 , Z 0 ) at the supporting base with the Z 0 axis lying along the axis of motion of joint 1. We have four base coordinate systems (X 0 , Y 0 , Z 0 ), each one located at (X 1 , Y 1 , Z 1 ) from each KC. Then, we establish a joint axis and align the Z i with the axis of motion of joint i + 1.
We also locate the origin of the i th coordinate at the intersection of the Z i and Z i−1 or at the intersection of a common normal between the Z i and Z i−1 . Then, we establish X i = ± (Z i−1 × Z i ) / Z i−1 × Z i or along the common normal between the Z i and Z i−1 axes when they are parallel. We also assign Y i to complete the right-handed coordinate system. Finally, we find the link and joint parameters: θ i (angle of the joint with respect to the new axis), d i (offset of joint along the previous axis to the common normal), a i (length of the common normal), and α i (angle of the common normal with respect to the new axis).
For each KC, we have six variable joints q i . Each q i is placed on the z i axis in Figure 4. (the left leg in Figure 4 has the same coordinate systems as the right leg.) Once we have the coordinate systems for each joint, a dual quaternion method is explained. First, we introduce a DQ representation and then we explain the kinematics. A DQ is: whereq s is a dual scalar,q v is a dual vector, q and q 0 are two quaternions and ε is a dual unit. We define the next expressions: These equations represent different parts of quaternions product: V{q} is a vectorial part, S{q} is a scalar part, S{R{q}} is a scalar part of real part, S{D{q}} is a scalar part of dual part, V{R{q}} is a vectorial part of real part and V{D{q}} is a vectorial part of dual part.
All the movements of the rigid body in the 3D space, with the exception of pure translation, are equivalent to the screw movements, that is, the rotation on the line together with the translation on the line.
If the line passes over the origin, the movement of screw can be written as: where R(θ, d) represents the 3 × 3 rotation matrix on the axis in the direction of the unit vector d through an angle θ.
If the axis of the screw movement does not pass over the origin, it can be written as: We can represent a screw movement as dual quaternions as follows: In a Plücker coordinates, each line can be fully represented by an ordered set of two vectors. The first point is a vector p that indicates the position of an arbitrary point on the line, and the second point is the direction vector d, which gives us the direction of the line. The Plücker coordinate can be represented as follows: where m = p × d is the vector moment of d with respect to the reference origin selected. We can represent one line on Plücker coordiante asl a = l a + m a , and we can transform that expression tol b =q l a q * using the dual quaternion unit.
The representation of the Plücker coordinates is not minimal since it uses six parameters for the representation of the line. The main advantage of the representation of the Plücker coordinates is that it is homogeneous. L p (m, d) represents the same line as L p (km, kd), where k ∈ .
To solve the forward and inverse kinematics using dual quaternions, we use Paden-Kahan subproblems. We have three sub-problems of Paden-Kahan, and Figure 5 shows graphically the three sub-problems. To solve the sub-problem 1, we have y = q ⊗ x ⊗ q * as the general movement equation, where θ = arctan2(S{l ⊗ x ⊗ y }, S{x ⊗ y }), x = x + S{l ⊗ x}l, and y = y + S{l ⊗ y}l, l = [0, l] is the director vector of l.
To solve the sub-problem 3, we have ||y − q ⊗ x ⊗ q * || = ||γ|| as the general movement For forward kinematics, given the six variable joints (q 1 , q 2 , q 3 , q 4 , q 5 , q 6 ), we obtain the coordinates of end effector (x, y, z) with respect to the base of the KC.
As Equation (1), the transformation operators DQ can be obtained as follows: where for prismatic joints q i = [1, 0, 0, 0] and q 0 i = [0, q 0 1 , q 0 2 , q 0 3 ], for revolute joints In addition, where d i is the rotation axis vector, m i is the vector moment, and θ i is the angle of rotation and i = 1, 2, ...n.
We can obtain Equation (9) using the intersection of two orthogonal unit line vectors given by: For inverse kinematics, given the coordinates of end effector, p 6 , and the orientation,l 6 , in Euler parameters, (x, y, z, φ, θ, ψ), we can obtain the six variable joints, (q 1 , q 2 , q 3 , q 4 , q 5 , q 6 ), as we show below.
We use inverse kinematics because we can obtain the base of our KC (shoulders or hips), and where the final effector and the orientation (hands and feet) are; thus, we have these parameters: (x, y, z, φ, θ, ψ) and, using inverse kinematics, we obtain the six variable joints,(θ 1 , θ 2 , θ 3 , θ 4 , θ 5 , θ 6 ), and use them to know where the elbow or knee are located.

Results
• 3D Camera Calibration: Our method works with any RGB-D sensor after correct calibration.
In our experiments, we use a kinect device and calibrate the intrinsic and extrinsic parameters of the monocular and IR sensors. The calibration system is done similarly to [43] or [44,45]. • Data sets: To train and test our method, we use a combination of videos from our own data set and a subset of the publicly available CAD60 data set [46]. • CAD60 data set: The original CAD60 data set [46] contains 60 RGB-D videos, four subjects (two male, two female), four different environments (office, bedroom, bathroom and living room) and 12 different activities. This data set was originally created for the activity recognition task [47][48][49]. The size of images is 320 × 240 pixels. • Our data set: It consists of seven videos with only one person on the scene moving his arms and legs. We had almost 1000 frames of people to obtain specific movements, e.g., crossing arms over one's body, to complement the CAD60 data set. Images were taken indoors in different scenarios. The subject inside the images is a male who wears different clothes. The size of the images is 320 × 240 pixels.
The ground truth of the joints in this data set was obtained by recording predictions from kinect. Thus, in order to make a fair comparison of the predictions from the methods being tested, we provide the videos to our human annotators to manually record the ground truth of the joint positions in the CAD60 data set. Thus, our annotators recorded over 15, 000 frames of videos that correspond to 16 videos from the CAD60 data set with different activities and environments. For training and testing purposes, we use two different splits of such annotations. We chose to manually annotate the CAD60 data set because, to our knowledge, there is no RGBD data set with the ground truth of human pose joints. We will also publicly release our annotated videos for the benefit of the research community.
We can find some other data sets using RGB and depth images for pose estimation, but they can not be used in our proposed method due to annotation problems.
Metrics: The metrics we use in our different experiments are the probability of a correct kypoint (PCK), the average precision keypoint (APK) and error distance.
PCK: The probability of a correct keypoint (PCK) was introduced by Yang and Ramanan [1]. Given the bounding box, a pose estimation algorithm must report back the keypoint locations for body joints. The overlap between the keypoint bounding boxes was measured, which can suffer from quantization artifacts for small bounding boxes. A keypoint is considered correct if it lies within α · max(h, w) of the ground truth bounding box, where h corresponds to the height and w to the width of the corresponding bounding box. α is a parameter that controls the relative threshold to consider the correctness of the keypoint.
APK: In a real system, however, one has no access to annotated bounding boxes at the test time, and one must also address the detection problem. One can cleanly combine the two problems by thinking of body parts (or rather joints) as objects to be detected, and evaluate object detection accuracy with a precision-recall curve. The average precision keypoint is another metric introduced by Yang and Ramanan [1], where, unlike PCK, it penalizes false-positives. Correct keypoints are also determined through the α · max(h, w) relationship.
Error distance: This metric calculates the distance between the results and the correct labeled point. To do this, we calculate the distance error between the predicted result and the ground truth location. For each joint, we obtain an error score that is the mean value calculated from all of the frames. Table 1 compares our results with Yang and Ramanan's [1] original method (Yang*) trained with the same images that we used to train our proposed method (P. Method*). Observing the results obtained in Table 1, and by comparing our proposed method with the original DPM, trained both with the same range of images and tested with the same range of images, but a different one of trained images, we have improved the results with the proposed method by adding depth information, a Kalman filter and using Denavit-Hartenberg (DH), in order to cut the number of points modeled in the DPM. Observing the results in Table 1, and independently of the data set used to test or train parts, our proposed method obtains better solutions. This means that the results can be repeatable with different data sets. Table 1. Experimental comparisons with the state-of-the-art methods on our proposed data set. The probability of a correct kypoint (PCK) and the average precision keypoint (APK) metrics are expressed on %. Error is expressed in pixels.  Table 1 shows the results using KF and DH. , and using DQ will obtain the same results in accuracy as using DH. For this reason, a table comparing the original DPM model with our proposed method is not shown. We discuss in the next section the difference between DH and DQ.

Time Complexity Analysis
For our experiments, we use a system based on Windows 7 (Microsoft, Redmond, WA, USA) with 64 bits and 4 GB RAM. The processor used is Inter Core Quad 2.33 GHz (Intel Corporation, Santa Clara, CA, USA). We calculate for each frame the average time taken for the proposed algorithm to process the frame. The images used have 320 × 240 pixels.
In our previous work, we used (DH) kinematics instead of dual quaternions. Dual quaternions are faster than a transformation matrix used in DH and do not have singularities in their solutions.  Table 2 shows the number of operations for one degree of freedom, for n degrees of freedom, we have on DH 64 (n − 1) products operations and 48 (n − 1) between sums and subtraction operations, while, for DQ. we have 48 (n − 1) products operations and 40 (n − 1) between sums and subtractions operations. Figure 7 shows how many operations we need for each degree of freedom added.  Figure 8 shows the time needed to make the operations. We can see that DQ is faster than DH; for this reason, we opted to use DQ. All of these comparisons about computational cost using dual quaternions are for one kinematic chain. In our proposed method, we are using four kinematic chains for which we have to multiply these results by 4.
Finally, the computational cost of the proposed method is 6.85 s and the original DPM method takes 9.21 s.

Conclusions
In this paper, we have presented a 4D-DPM model using RGB-D information to improve accuracy and timing cost. We use MSER for foreground subtraction. We use dual quaternions to reduce the number of points of interest inside the imagery. We use a polisphere to draw the results and detect collisions between the different parts of the body. All of this allows us to reduce the time complexity during the training part using a smaller fraction of training samples.