Hybrid Orientation Based Human Limbs Motion Tracking Method

One of the key technologies that lays behind the human–machine interaction and human motion diagnosis is the limbs motion tracking. To make the limbs tracking efficient, it must be able to estimate a precise and unambiguous position of each tracked human joint and resulting body part pose. In recent years, body pose estimation became very popular and broadly available for home users because of easy access to cheap tracking devices. Their robustness can be improved by different tracking modes data fusion. The paper defines the novel approach—orientation based data fusion—instead of dominating in literature position based approach, for two classes of tracking devices: depth sensors (i.e., Microsoft Kinect) and inertial measurement units (IMU). The detailed analysis of their working characteristics allowed to elaborate a new method that let fuse more precisely limbs orientation data from both devices and compensates their imprecisions. The paper presents the series of performed experiments that verified the method’s accuracy. This novel approach allowed to outperform the precision of position-based joints tracking, the methods dominating in the literature, of up to 18%.


Introduction
Human limbs motion tracking can be defined as a process of unambiguous limbs joints spatial positioning, devoid of significant delays. Nowadays, it might be exploited within several areas, such as entertainment issues, virtual environments interaction or medical rehabilitation treatment [1]. However, it is the system precision that determines its possible application.
Since 1973, when Gunnar Johanson invented his motion capture system [2], such techniques were available mainly for professionals. However, since 2010, when the Microsoft Company (Redmond, WA, USA) released the Kinect v.1 controller, motion capture became available to almost everyone. The Microsoft Kinect controller is an exemplary implementation of the optical, markerless depth sensor with the most remarkable human motion capture functionality. Though it was designed for, and mainly used in the field of home physical activity games, Kinect became a popular subject for many researchers, whose goal was to find other, more advanced scenarios, where this controller might be applied e.g., Lange et al. [3] and Chang et al. [4] designed Kinect based systems to support physical rehabilitation for people with motor disabilities.
Microsoft Kinect represents a broad group i.e. PrimeSense Carmine (PrimeSense, TelAviv, Israel), ASUS Xtion (AsusTek Computer Inc., Taipei, Taiwan) of mono-view markerless depth sensors that exploit structural light patterns to reconstruct the depth of the scene and track perceived objects, with special attention put to human limbs and body parts tracking (i.e., face tracking). This group reveals data fusion as well as IMU and Kinect fusion. The uniqueness of the proposed method was taking into consideration geometrical constraints of the human pose. If, according to mentioned constraints, the estimated pose was impossible to achieve, the estimation was aligned to the closest, valid value. Thanks to that approach, authors achieved deviation of rotation angle estimation lower than 20 • .
The analysis of the available literature shows that the majority of known data fusion methods is based on different variants of Kalman filter (linear, extended or unscented), to fuse accelerometer with gyroscope data as well as IMU data with these provided by the Kinect device. Only Kalkbrenner et al. decided to fuse accelerometer and gyroscope data with Madgwick's filter, which is designed especially for this kind of data and, regarding the Madgwick's report [13], it has better accuracy and performance than the linear Kalman filter. It is also noticeable that authors of these methods are selectively taking into consideration characteristics and limitations of measurement devices. As an example, Bo et al. [11] and Destelle et al. [12] focused mainly on IMU devices and their intrinsic data fusion, treating Kinect device data as a reliable reference for their methods, despite the known inaccuracy of this device. Kalkbrenner et al. [14] as well as Feng and Murray-Smith [15] tried to compensate some Kinect's measurement problems using proper parameters in the Kalman filter. However, they focused only on joint position estimation fluctuations, e.g., Kalkbrenner recalculated fusion parameters only when the difference in the selected joint position estimation value, between two consecutive measurement frames, was greater than 15 cm. None of the authors included in their methods any compensation for IMU and Kinect's measurement inaccuracy due to the environment conditions (the device operational temperature) or the context of the performed motion (the body rotation angle to the camera, the distance from the measurement device). The fact that these kinds of factors influence noticeably the quality and the accuracy of data measured by both devices may have an impact on the further data processing and on the final joint's position estimation. Authors of the discussed methods fuse the signals of both measuring devices basing on joint positions and they don't take bone orientations into consideration during that process. This fact has an impact on the final results, as joint positions used during the fusion include an additional, significant error caused by the non-fixed bones model length. The data fusion based on bone oation is free of this kind of error and it uses data that are measurable directly by both considered devices. This paper presents the novel approach to depth sensor (Microsoft Kinect) and IMU data fusion that systemically makes up for limitations of both measurement devices and compensates their imperfections. The novel orientation based, human limbs' joint positions tracking method outperforms state-of-the-art hybrid systems' accuracy by 11-18%.

Microsoft Kinect Controller Characteristics
The Microsoft Kinect controller might be described as an RGB-D (Red, Green, Blue and Depth) camera. The first version of this device, originally designed for the Xbox 360 video game console, was built from two CMOS (Complementary metal-oxide-semiconductor) cameras and an integrated infrared (IR) projector. One of these CMOS cameras was responsible for an RGB signal and the second one was calibrated to record IR beam's distribution. Considering the limbs' motion tracking and the body gesture recognition, the most crucial part of Microsoft Kinect controller is the chip made by the PrimeSense company (PrimeSense, TelAviv, Israel). All recognition algorithms that stand behind Kinect's motion tracking were implemented as a firmware for this processing unit. The Microsoft Company has not revealed any detailed description of tracking algorithms and key characteristics, so their retrieval became a subject for scientists and hobbyists [18][19][20]. The good understanding of the Kinect controller constraints seems to be indispensable to create an accurate method that would fuse its data with data collected from any other source.
In the official device specification, it is claimed that the operating range of Kinect varies between 0.8 m and 4 m with the field of view 57 • horizontally ( Figure 1a) and 43 • vertically (Figure 1b). At the same time, the specification does not reveal the possible heterogenity of the measurement's accuracy. However, the experiments conducted by the authors showed that there are distance measurement nonlinearities within the Kinect's working area. Moreover, researchers [21] and hobbyists [22] reported that the operating range is different among device series. A comparison of fluently changing distances between Kinect and limbs' joints, collected simultaneously from Microsoft Kinect controller and Vicon (Vicon Motion Systems, Inc., Los Angeles, CA, USA) (Vicon-high accuracy, reliable, optical marker-based motion capture system) motion capture system, showed that Kinect has tendencies to underestimate the distance in the close range and to overestimate it in the far range. Collected values (squares on Figure 2) allowed to elaborate a distance estimation error function in a form of 3rd order polynomial, as it is presented in Equation (1), calculated based on Equation (2), where X is the matrix of sample points, Y is the matrix of values in these sample points and A is a coefficients matrix used in Equation (1): (2) Figure 2 presents the plot of function defined with Equation (1). The analysis of this chart shows that Kinect's distance estimation is optimal when the user stands about 2 m from the device (2-2.3 m). Outside this distance range, when Kinect works with a noticeable error, the systematic sensor depth correction should be taken into consideration. Such inaccuracy and its tendencies are probably caused by the algorithm implemented in the Kinect controller. Though Microsoft did not reveal any precise technical description of their method for measuring distance between the camera and objects at the scene, original patent forms [23][24][25] owned by PrimeSense, combined with independent research results [26], provide some general overview of how this recognition algorithm works. It is known that the analyzed scene, in front of the Kinect controller, is enlightened with IR light dots structured pattern, which is skew symmetric, so the Kinect can work upright or upside down. Then, Kinect analyses the IR pattern's dots distortion, and, basing on that, it estimates the subject distance. For the depth map reconstruction, Kinect uses two techniques in parallel: dots blurriness analysis [27] and stereo-vision based on a single IR camera and a projector [28]. The achieved depth map is a foundation of a human skeleton estimation algorithm. Depth-retrieved human body parts estimations are subsequently classified into body poses by means of the machine learning approach: random decision forest algorithms as well as object detection algorithms, such as the one proposed by Viola-Jones [29,30].
As the only source of data for that estimation process is an IR camera (RGB camera is not used at all), the observed scene must be free of any external source of the IR light. This requirement restricts the usage of Kinect in outdoor scenarios. A need for the scene isolation from any IR sources other than the Kinect itself is the main reason why Microsoft does not recommend using two or more controllers simultaneously. Nevertheless, scientists invented and published methods on combining signals from several Kinect devices [31][32][33]. It is noticeable that data gathered from the Kinect is incomplete, due to the lack of information about joint rotation along the bone. This is the result of the skeleton model design, where each of 20 tracked joints is described as a single point.
The last major flaw of Microsoft Kinect controller is an occlusion, which occurs when a part of a user's body is covered by another object or is hidden behind any other body part (self-occlusion). The occlusion by an external object seems obvious and does not require any additional explanation; however, self-occlusions are less intuitive. They are connected with Kinect's sensitivity to user's rotation to the camera (α angle in Figure 3), as Microsoft recommends working with Kinect in the face off pose, which is not precisely defined. Self-performed experiments allowed observing α angle changes while the user rotated in front of the camera. Assuming that P Sh L = [p K Sh L ,X , p K Sh L ,Z ] is the position of the left shoulder and P Sh R = [p K Sh R ,X , p K Sh R ,Z ] is the position of the right shoulder, both defined in a Kinect controller coordinating system (limited to X and Z axes in Figure 3), then α can be calculated according to Equation (3).
There are two possible strategies that Kinect uses when the occlusion happens. Either the device tries to estimate the location of the covered joint or, when it is not able to perform even rough estimation of the joint position, it stops tracking of such a joint. The results of self-performed experiments show that considerable occlusions occur when the user is rotated by more than 50 • (α > 50 • ).
Charts presented in Figures 4 and 5 show measured joints tracking states and the elbow angle (angle β in Figure 3) changes during the rotation, respectively. In Figure 4, it is visible that tracking the states of both shoulders joints fluctuate between Tracked and Interferred states, when the subject was rotated more than 50 • (α > 50 • ) in relation to Kinect's camera. It is worth mentioning that even an easily visible right shoulder joint lost tracking state during that rotation. In Figure 5, it can be noticed that the estimation of the right elbow angle β was unstable for considerable body rotation (α > 50 • ), though the hand was fully visible all the time.

IMU Characteristics
IMU devices have been in professional usage from decades. Within the scope of the current study, there are two types of sensors that measure inertial forces affecting them: accelerometers and gyroscopes. All experiments presented in this paper were performed with the usage of IvenSense MPU-6050 module (San Jose, CA, USA), which integrates both of these types of inertial devices and a thermometer. Even though the accelerometer and the gyroscope were integrated on a single PCB (Printed Circuit Board), their data must be processed individually, as they are affected by the different set of external noises with various frequencies that must be filtered out to make them usable. The IMU device used in our research can be classified according to [34] as an industrial device. Characteristics described in the article are thus representative for other devices belonging to this class.
An accelerometer is an inertial sensor that measures linear forces, affecting it along the device coordinating system axes. Usually, the measurements are defined in relation to the gravity force (1 g ≈ 9.81 m/s 2 ), so it is possible to estimate its linear acceleration during the motion. When the device rests, the theoretically correct measurement should be 1 g in the upward direction and 0 g in two other directions. However, due to external noises (mainly high frequency), the resting device measurements are oscillating around theoretical values. The analysis of these oscillations, during calibration, allows for designing low pass filter [35,36] that would be able to filter out at least major part of these high frequency noises. Accelerometers are also sensitive to operating temperature changes and its influence on sensor's accuracy was the subject of some researches [37,38]. The temperature sensitivity is caused by the architecture of the sensor, which is built with several capacitors, hence the operating temperature has influence on their capacity. During the self conducted experiment, the measurement of the g-force was observed when the resting device was heated and then cooled down multiple times in the temperature range 10-50 • C. The results of this experiment are presented in Figure 6. This chart shows that the measurement tends to change from under to overestimation with the temperature change. It is also worth noticing that the measured value of the gravity is up to 4% different than the expected, what may influence the further results. According to the MPU-6050 specification and results of self-conducted experiment, the neutral operating temperature for the considered module is 25 • C. Due to the fact that the operating temperature (T) of the IMU device, placed on a human body, rises and stabilizes at approx. 30 • C, some sort of compensation (f(T)) is required there. Equation (4) presents the formula of the measured error: Gravity (g) Figure 6. Accelerometer gravity measurements in temperature range 10-50 • C.
A gyroscope-the second type of inertial sensors-measures its angular velocity in deg/s units. If the sensor is in the resting state, all measurements should equal 0 for each of three axes. However, the device suffers from the long-term bias that should be limited (ideally-removed). The analyzed noise has low frequency characteristics, thus the appropriate high pass filter [39], limiting its influence, should be applied. As a MEMS (MEMS-microelectromechanical system) based gyroscope is also built from multiple capacitors, their bias is influenced by the temperature as well. However, as the distortions related to the temperature have low frequency nature, their influence on the data is reduced during signal filtration by a high-pass filter.
The last problem, considered by the authors, related to IMU devices is the incompleteness of the data provided by these units. To calculate the accurate orientation of the IMU module, data from the accelerometer and the gyroscope (orientation around corresponding axes) should be fused together. The accelerometer allows for calculating orientation around two axes, even though it measures forces along all three orthogonal directions. The orientation (or the rotation) around the gravity vector is unmeasurable for such device. Theoretically, the gyroscope should retrieve orientation measures around all three directions; however, even the filtered signal contains some errors. Its numerical integration over time results in the significant data drift that makes such measurements useless. An example of the data drift is presented in Figure 7. Despite the sensor remaining motionless during the experiment, the estimations based on the numerically integrated data showed the sensor theoretical rotation about 60 • within 2 min (about 0.5 • /s).

Hybrid, Orientation Based, Human Limbs Motion Tracking Method
As an input to the method, two (Kinect and IMU), independently registered, streams of data representing tracked human skeleton bone oations, were considered. The problem that the motion tracking method needs to face was estimating human skeleton selected joint (j) fused (F) from considered data sources, position (P F j,t = [p F j,x , p F j,y , p F j,z ] t ) within the referenced three-dimensional coordinating system, in a particular moment of time (t). To solve this problem, the authors propose a novel, hybrid human limbs motion tracking method, based on the continuous linear fusion of skeleton bone oations (contrary to state-of-the-art skeleton joint positions fusion) with respect to the current motion context. It takes into consideration the controllers' reliability and compensates measurement devices' imperfections described in previous chapters. The data fusion method has been split into several, sequential phases presented on a diagram in Figure 8 and described in the further part of this article.  The discussed novel data fusion method can be presented as a general formula with Equation (5): The estimation of the selected joint j position (P F j,t ) in particular time t is based on the measurements of the accelerometer (A = [a x , a y , a z ]), the gyroscope (G = [g x , g y , g z ]) and the positions of the selected joint and its parent, measured by the Kinect controller (P . Additionally, to compensate for imperfections of IMU measurements, the current operating temperature T is taken into consideration. In addition, positions of both limb (i.e., shoulder) joints (P K sh L ,t , P K sh R ,t ) are used to determine the reliability of the data and choose the right data fusion formula. The discussed method estimates the joint position with some time interval ∆t aligned to the Microsoft Kinect update interval. IMU are stuck to the body surface, on the bone of length l, between the tracked joint j and its parent joint j − 1. To define the fused position P F j,t in superior coordination space, it has to be aligned to the fused position of its parent joint P F j−1,t . The simplified hierarchical model of the limb (described on a hand example), as well as the placement of IMUs, are presented in Figure 9.

Coordination Space
Both measurement devices have different local coordination spaces that presented in Figure 10. As the majority of the data processed in the method are gathered from the Kinect controller, the global coordination space equals the Kinect's coordination system. This allows for reducing the number of required transformations and to improve the overall performance of the system. To unify both coordination spaces, frame of reference needs to be defined according to which any further rotations are estimated. During the calibration process, when a user stands without motion in T-pose in front of a Kinect camera, such frame of reference is defined. Later, during motion, each IMU estimation ([x, y, z] T I ) needs to be transformed into the Kinect's coordination space ([x, y, z] T K ). As an IMU coordination system is anchored to the gravity vector, and retains its orientation regardless of device rotations, such transformation is defined as a sequence of a rotation by 180 • around the y-axis (R Y (180)) and scale (S) transformations according to Equation (6). A definition of one common coordination space for both measurement devices is necessary to fuse their signals:

IMU Initialization
The goal of the IMU initialization procedure, presented in Figure 11, is to calculate the data (G, A) correction factors matrix (cor = [cor A cor G ] T = [[c a x, c a y, c a z] [c g x, c g y, c g z]] T ) for each IMU individually.   Figure 11. Diagram of the IMU initialization routine.
During this initialization, IMU modules must lay down without any motion, in the face-up position, as this is the position, in which we know exactly the expected data (A 0 = [0, 0, 1] for the accelerometer and G 0 = [0, 0, 0] for the gyroscope). However, the ideal values are impossible to achieve due to the noise that cannot be completely removed. Because of that, the calibration routine works iteratively (s -iteration index) as long as the average IMU measurement errors ([A G] T = [[a x , a y , a z ] [g x , g y , g z ]] T ) are lower than the defined threshold ([A th G th ] T = [[a x,th , a y,th , a z,th ] [g x,th , g y,th , g z,th ]] T ). In this case, a relation of [A G] T ≤ [A th G th ] T is true when each element of the first matrix is lower than the corresponding element of the second matrix. As a result of this calibration method, the matrix cor is calculated, which elements added to the current IMU data, allow to use data with the desired accuracy.
If  During this initialization, IMU modules must lay down without any motion, in the face-up position, as this is the position, in which we know exactly the expected data (A 0 = [0, 0, 1] for the accelerometer and G 0 = [0, 0, 0] for the gyroscope). However, the ideal values are impossible to achieve due to the noise that cannot be completely removed. Because of this, the calibration routine works iteratively (s-iteration index) as long as the average IMU measurement errors ([A G] T = [[a x , a y , a z ] [g x , g y , g z ]] T ) are lower than the defined threshold ([A th G th ] T = [[a x,th , a y,th , a z,th ] [g x,th , g y,th , g z,th ]] T ). In this case, a relation of [A G] T ≤ [A th G th ] T is true when each element of the first matrix is lower than the corresponding element of the second matrix. As a result of this calibration method, the matrix cor is calculated, which elements added to the current IMU data, allow to use data with the desired accuracy.
If [A G] s,i is an i-th out of the n consecutive IMU device measurements for the iteration s, then the average error [A G] T is calculated according to Equation (7): Then, the data correction factors matrix cor, for single measurement, is defined by Equation (8): cor s−1 − diag(1/a x,th , 1/a y,th , 1/a z,th , 1/g x,th , 1/g y,th , 1/g z,th )

IMU Data Fusion Filters Initialization
The Madgwick filter [13], defined with the formula presented in Equation (9), has been used as an IMU module inherent accelerometer and gyroscope data fusion method: This filter estimates the IMU module orientation in the quaternion form (Q I ), using accelerometer data (A), gyroscope data (G) and the filtration factor ( f m ) based on the gyroscope drift characteristics. Madgwick filter processes the IMU data with the time interval ∆t. Before the filter can run on the real data, it needs to be initialized to set up its internal parameters correctly. Madgwick filter initialization requires running the filter multiple times for the measurements of non-moving accelerometer and gyroscope sensors. During this initialization the filtration factor f m needs to be set to the relatively high value e.g., f m = 2. After the completion of the initialization process, f m should be defined according to the average Angle Random Walk (ARW) noise value ω that can be calculated with the Allan's variance [40][41][42]. In case of exploited MPU-6050, IMU modules filtration factor f m is calculated according to the original article [13] equation: f m = 3 4 ω, and equals 0.082.

IMU and Kinect Controller Data Noise Correction
The data noise correction is a crucial phase for the data fusion. Its goal is to remove as much noise as possible from the raw data. All methods used in this phase are related to measurement devices characteristics described earlier in this article.
The first step to improve the IMU module data quality is the accelerometer measurements (A) error compensation due to the device operating temperature (T). For the neutral temperature (T 0 = 25 • C) and the correction factor f T = 0.0011, the correction is defined with Equation (10). The value of f T has been estimated in self-conducted experiments: The low and high pass filtration of accelerometer and gyroscope data is a part of Madgwick filter implementation, so there is no need to repeat this process. The second measurement device-Kinect controller-requires two data correction steps in order to improve the measurement quality: the correction of the distance estimation and smoothing the joint position estimations. The distance estimation correction is done by the function presented in Equation (11), which is the opposite of the distance estimation error model (Equation (1)). The argument of this function (z) is directly taken from the Kinect's distance estimation API: The smoothing of the Kinect's joint positions (P K j,t ) is necessary as coordinates' fluctuations may happen, especially when occlusions occur. While there is an occlusion, the position of the same joint in two consecutive frames (measurements) can differ few centimeters, which is physically and anatomically impossible, while refresh rate of the Kinect is 30 Hz. The smoothing can be done by the Kinect's firmware, but this approach gives a significant delay of the signal interpretation. The alternative approach to positions smoothing, which was exploited in the method, is a simple low pass filter. The method uses the 1st order exponential filter to remove unreliable positions estimations defined with Equation (12) and the filtration factor f LPF = 0.065: The value of f LPF has been chosen as a compromise between error of estimation and smoothed signal delay. Figure 12 presents charts of estimation error and signal delay related to f LPF factor of proposed smoothing function.

IMU and Kinect Synchronization
The data fusion of any two signals requires signal alignment in the time domain. This guarantees fusion of samples that were collected in the same time synchronously. The goal of IMU and Kinect signals synchronization is to find the time offset τ between two data streams. To synchronize both signals, they must have the same frequency, so the IMU signal has to be downsampled from 100 Hz to 30 Hz, which is a nominal frequency of the Kinect controller. The downsampling has been implemented in the form of decimation, where Kinect's new sample availability defines the time of IMU sample picking. Then, the time offset τ between the IMU signal I and the Kinect signal K was defined as a time offset argument of the cross-correlation algorithm that gives the maximum value of correlation between these two signals (variable τ max ) (Equations (13a) and (13b)). The value of τ max is added to the timestamp of IMU samples to align it to the Kinect's signal:

IMU and Kinect Data Fusion
The data fusion method processes time aligned filtered data and takes into consideration their reliability. The most serious concerns about the quality of data used within the fusion were related to the data provided by Microsoft Kinect controller. The decision about Kinect's data quality was based on the several information gathered during the motion. The first one is the combination of the joint tracking state and the noise level of its position estimation. A fully visible joint has its state set to Tracked value and its position can be considered as reliable. Otherwise, if the tracking state value is set to Interferred, the position estimation noise level must be calculated to check if the value can be treated as reliable or not. In the Interferred state, the position of the joint is estimated basing more on the predication than direct measurements, so its accuracy might be low and values may differ significantly between consecutive frames (or in a short time period). Other parameters taken into consideration are: the value of the angle α between the user and the camera, as well as the information about the tracking state of both shoulder joints. Microsoft Kinect has been designed to track the motion of the human who stands frontally to the camera. If the user rotates too much, Kinect is not able to track the motion correctly anymore. During self experiments, it turned out that the maximum reliable angle between the human and the Kinect is 50 • . Exceeding this rotation angle results in unreliable, often random, values of all joint positions. As the value of the α angle is calculated using position values of both shoulder joints, both of them should be fully visible (tracking state set to Tracked), and their position values estimations should be stable in time. This means that the angle measurement standard deviation should be lower than 1.5 • , so there should be no rapid changes in shoulders position estimations, which are characteristic for the significant error in the provided data.
If all of the mentioned conditions are satisfied, signals are fused according to Equation (14). The novelty of the presented method, in comparison with literature approaches, lies in the fact that the conditionally fused data represents skeleton bone oations in the form of Euler angles E = φ θ ψ , instead of joints resulting positions. Microsoft Kinect controller provides natively information about selected bone oation in the form of quaternion, calculated using current bone's joint positions. Used in the presented method, Madgwick's filter also provides IMU orientation as single quaternion due to the fact that orientation fusion, described in this article, was defined with Euler angles. Quaternions from both sources need to be converted to this form before they are fused. To prevent the Euler angle's singularity problem, during the conversion from quaternion, an additional check is performed. This step is supposed to detect if the sensor is oriented vertically up or down and, if it is, convert data to fixed values. The singularity is detected by the sum of multiplications of quaternion's values: If the value of c s is greater that 0.4999, it means that the sensor is oriented vertically up, and, if it is lower than −0.4999, then it is oriented down. In case the singularity appears, the φ, θ, ψ are defined as: The ± is related to the sign of the c s value. The proposed data convertion method is based on the assumption that we need to align estimations to some fixed value around the singularity point. This means that we lose real measurements. On the other hand, ±0.4999 threshold value cuts off the orientation angle with a value exceeding ±87 • . Details about the conversion from the quaternion to the Euler angles can be found in [43]: The originally elaborated weights [w φ , w θ , w ψ ] describe the importance level of IMU measurements. They were based both on the device precision and the data completeness, and were defined as [0.98, 0.05, 0.65], respectively. The very high and very low first and second values are the result of measurement devices' characteristics described in the beginning of this article. As Kinect is not able to detect Roll rotation, the estimation of IMU is promoted (the first value). Similarly, the second value that relates to the Yaw rotation, for which IMU estimation is considerably less accurate than Kinect's estimation, has been depreciated with a low second coefficient value. The Pitch rotation can be estimated by both devices; this is why the third value is defined close to 0.5. Exact values are the results of self-made research about estimating arm and forearm bone oations by both measurement devices while the hand motion was performed only in one, selected degree of freedom. Such constraint guaranteed that possible, accidental rotations around other axes than the considered one have no impact on gathered results. Figures 13-15 present charts of the influence of w factor's value to the estimation error of each rotation during mentioned experiments.   If Kinect's data turn out to be unreliable, they cannot be fused with the IMU data. Averaged bone oations are then estimated basing on the previously (t − 1) fused estimation and the current change of IMU orientation, according to Equation (15): ). (15) However, in the case of Kinect data unreliability, weights [w φ , w θ , w ψ ] are defined differently. In Equation (15), these weights were defined as where t noise coefficient represents the time in seconds, when Kinect stayed unreliable and its maximum accepted value is set to 10s. After this time, the estimation of the new bone orientation (only around two axes) stops, until Kinect is available again, and the method privileges the IMU signal. In this scenario, as the IMU is in fact the only source of up to date data, the initial coeficient of Yaw rotation (w θ ) has been increased compared to the value in the scenario when both Kinect and IMU data are available to fuse. Leaving this value on the low level caused the rotation to not be detectable.

Joints Final Position Estimation
The final joint position (P F j,t ) estimation requires the information regarding the position of the joint's parent in the hierarchical skeleton model (P F j−1,t ), the estimated while calibration (and fixed) length of the bone between the current joint and its parent (l) and the bone orientation estimation . As a result of the data fusion, estimated orientations were presented in the form of Euler angles. However, this representation of the orientation may complicate further calculations. To simplify them, the opposite conversion from Euler angles to the quaternion Q F t is recommended, according to the formula described in [43]. Since the skeletal model is defined as a hierarchical structure, the newly estimated joint position must be aligned to the position of its parent joint. Taking that into consideration, the final joint position calculation procedure is described with Equation (16):

Results
In order to verify the accuracy of the elaborated method, several experiments were performed. They were conducted with the Vicon motion capture system, as a source of ground truth reference data. Hand movements of three subjects were monitored, being tracked simultaneously with Kinect controller, two IMUs attached to the hand's arm and forearm bones on the skin surface, as well as the Vicon tracking system. In current implementation of the proposed method, we have made the assumption that each sensor is statically assigned to the particular limb. This means that dedicated (to arm and forearm) sensors are assigned to arm and forearm, respectively, and can not be misplaced. Automatic sensors identification and their relative hierarchical assignment leave room for further method improvements.
Each user had to perform four exercises, each of them twice with five repetitions per single try. Each try started with the devices' synchronization in the T-Pose, and then five repetitions of the motion began. The initial T-pose has been used as a frame of reference and joint positions were updated according to this position. Exercises focused on the horizontal and vertical hand flexion in the elbow joint, horizontal hand motion and keeping a straight arm in the T-pose with no motion for at least 60 s. All of these exercise schemes are presented in Figure 16. Three out of four motion trajectories used in the verification experiments can be considered as difficult according to the measurement devices' characteristics.The first motion is relatively easy to track by both devices as no joints occlusion was observed there, and it was fully measurable by the accelerometer as well as the gyroscope. The second and the third trajectories define motions around the axis, where the IMU device was not able to correctly measure current orientation and Kinect controller loses visibility of the joint due to the self occlusion. The important difference between these two motions was the number of occluded joints: the elbow only in the second motion and the elbow and the shoulder in the third. This difference had an impact on Kinect's estimations reliability, which was worse when the shoulder was not fully visible to the controller's camera. The last motion focused on IMU's constant measurement drift around the axis perpendicular to the gravity vector. Mentioned physical body movements let us claim that selected exercises represent the most challenging trajectories that the motion tracking system can face, so it should be able to track other trajectories with similar or better accuracy. The pace of the movements was characteristic for the movements of rehabilitation's physical exercises and the subjects could perform the motions according to individual preferences, without any specific constraints. The only formal pace limitation is the Microsoft Kinect tracking performance (Kinect has 30 Hz slower operating frequency than the exploited IMU controller).
For the self made measurement device used in my studies, both IMU work with different addresses in I2C communication channel and, in order to simplify the code and processing in the current version of the method, sensors are statically assigned to the limbs/bones. It means that the sensor with the channel address addr1 should always be attached to the upper arm and addr2 to the forearm (or generally addr1 to the parent bone and addr2 to its child).
The accuracy of three parameter estimations has been observed: the position of the elbow joint, the position of the wrist joint and the angle between the arm and the forearm measured in the elbow joint. Those values have been compared with the corresponding parameters estimated with Kalkbrenner's method [14], which, on the contrary, fused positions instead of orientations of selected joints. The position estimation error has been defined as the root mean squared error (RMSE) of the Euclidean distance (d e ) between joint reference position, measured by the Vicon motion capture system (P V j ), and joint position estimated by two competitive data fusion methods. The first one is the joint position-based Kalkbrenner data fusion method (P FP j ) and the second: the bone rotation-based fusion method (P FO j ) proposed by the authors of this article. The RMSE value has been calculated with Equation (17) [44]. The error for the elbow joint angle β estimation has been calculated in a similar way: The overall error of both methods, for each test exercise, has been calculated as a mean of RMSE defined with eqation of each motion tracking session. To compare Kalkbrenner's method with the authors' novel method, the ratio r between the difference of RMSE of both methods to the RMSE of Kalkbrenner's method has been calculated according to Equation (18) and expressed in the form of percents. Figure 17a,b show the summary of RMSE for tracked upper limb joints: the elbow and the wrist. Above each pair of charts' bars, the value of r ratio has been presented. Based on the results presented in Figure 17a,b, the improvement of the proposed orientation-based fusion method, in reference to the best in the literature reference-Kalkbrenner's method, has been noticed. However, the improvement ratio varies between joints. The error of the elbow joint position estimation has been reduced up to 18% (from about 3.1 cm to 2.6 cm) and the wrist joint position estimation error up to 16% (from about 3.4 cm to 2.9 cm). The difference in the improvement ratio between these two joints is caused by the fact that the elbow joint is closer to the skeleton model root than the wrist joint, so it accumulates less ascending joint position estimation errors. Figure 18 presents the chart of RMSE of the elbow angle β estimation (calculated similarly to Equation (17)). In this value estimation, the improvement has been also noticed and its ratio is up to 11% (from about 4.6 • to 4.1 • ):  [44]. The error for the elbow joint angle β estimation has been calculated in a similar way: The overall error of both methods, for each test exercise, has been calculated as a mean of RMSE defined with eqation of each motion tracking session. To compare Kalkbrenner's method with the authors' novel method, the ratio r between the difference of RMSE of both methods to the RMSE of Kalkbrenner's method has been calculated according to Equation (18) and expressed in the form of percents. Figure 17a,b show the summary of RMSE for tracked upper limb joints: the elbow and the wrist. Above each pair of charts' bars, the value of r ratio has been presented. Based on the results presented in Figure 17a,b, the improvement of the proposed orientation-based fusion method, in reference to the best in the literature reference-Kalkbrenner's method, has been noticed. However, the improvement ratio varies between joints. The error of the elbow joint position estimation has been reduced up to 18% (from about 3.1 cm to 2.6 cm) and the wrist joint position estimation error up to 16% (from about 3.4 cm to 2.9 cm). The difference in the improvement ratio between these two joints is caused by the fact that the elbow joint is closer to the skeleton model root than the wrist joint, so it accumulates less ascending joint position estimation errors. Figure 18 presents the chart of RMSE of the elbow angle β estimation (calculated similarly to Equation (17)). In this value estimation, the improvement has been also noticed and its ratio is up to 11% (from about 4.6 • to 4.1 • ):   8% 4% 11%

Discussion and Conclusions
The authors presented a new, orientation-based method for skeleton joint positioning, which compensates devices' imperfections and considers the context of the motion. It was tested on a set of right hand movements, which appeared to be demanding for measurement devices. Results were compared with those gathered from the position-based fusion method and the reference, professional, ground truth Vicon tracking system. Presented results prove that the newly elaborated, complex approach reduces joint position estimation error. The principle reason of such estimation improvement, identified by the authors, is related to the better stability of bone rotation estimation in reference to the position-based method, as well as the diminished inaccuracy of bone length measurements and the lack of the fixed skeleton model definition support. Moreover, the authors have found that the introduced complementary fusion of the signals, measured by both devices, reduces the influence of source signals' imperfections on the final result. In the case of IMU devices, the information about their 3D space orientation (for IMU devices stuck to the body, this orientation also represents the orientation of the bone) was the default and required less complex data processing. The estimation of the joints' positions, based on IMU measurements, required additional measurement of the selected bone length and additional geometric transformations (geometric transformations required to estimate the joint position based on IMU measurement has been described in detail in [14]), which have an impact on the final joint position estimation error. Considering the Microsoft Kinect controller, the human body skeletal model includes information about each bone length and its orientation. However, these length values are updated continuously and may differ significantly between two consecutive measurement frames. The difference might be even a few centimeters. During experiments, the measured differences in bone length were up to 5 cm for Tracked state joints. Moreover, even if the occlusion doesn't occur, the estimation of angles between selected bones turns to be more stable in time than the estimation of joint positions. This means that the estimation of bone rotation is more reliable than joint position estimation. Furthermore, the bone rotation is the information available from both signal sources with minimal raw data processing, which has an impact on the quality of data used in the fusion.
What differentiates the proposed method from the already known methods described in the literature is the fact that it takes into consideration characteristics of the classes of chosen measurement devices, and provides the complex compensation of detectable errors that exist in their data. It also allows for improving the quality of the source data before their fusion. It is important that many of the characteristics described in this article adaptively respect the context of the performed motion. As an example, the variable accuracy of the distance measured between Kinect and the user, or the human rotation angle relative to the surface of Kinect, has an impact on the stability and reliability of Kinect's data.
The goal of the conducted research was to find the method of human limbs motion tracking, which would take into consideration elaborated characteristics and would reduce joint position estimation error significantly. The proposed method, tested on the example of the upper limbs motion, has met this goal. The results obtained by the authors' method have been compared with the results of the position-based data fusion method with the highest declared accuracy. Three parameters estimated by both methods have been compared: the elbow joint position error, the wrist joint position error and the elbow angle β error. The set of motions ( Figure 16) used to test these methods have been chosen to check how methods work when source data is of poor quality due to the limitations and imperfections of measurement devices.
The results of the comparison of the elbow joint position estimation error are presented in the chart in Figure 17a. It shows that the achieved improvement in this parameter estimation is up to 18%. The comparison of the wrist joint position estimation error is presented on the chart in Figure 17b and the improvement of this parameter estimation is slightly lower than the improvement of the elbow joint position estimation error and its value is up to 16%. The difference in these values is caused by the distance between each of these joints and the root of the skeletal model. The joint that is further from the root accumulates existing errors of all joints that are defined between the root and itself. This explains why the improvement of the estimation error is lower for the wrist than for the elbow. The chart presented in Figure 18 shows the results of the comparison of the third parameter, which is the elbow angle estimation error during the motion. The best improvement in this parameter estimation was close to 11%. However, in exercises 2 and 3, which might be considered as the most difficult from the data quality detection point of view, the improvement is noticeably lower.
Obtained results prove that the novel data fusion approach, based on skeletal bone rotations, might be considered as an improved alternative to the well-known, joint position-based methods.
Author Contributions: Grzegorz Glonek designed the method and experiments and analyzed the data; Adam Wojciechowski contributed and supervised the work; Grzegorz Glonek and Adam Wojciechowski wrote the paper.