State Estimation Using a Randomized Unscented Kalman Filter for 3D Skeleton Posture

: In this study, we propose a method for minimizing the noise of Kinect sensors for 3D skeleton estimation. Notably, it is difﬁcult to effectively remove nonlinear noise when estimating 3D skeleton posture; however, the proposed randomized unscented Kalman ﬁlter reduces the nonlinear temporal noise effectively through the state estimation process. The 3D skeleton data can then be estimated at each step by iteratively passing the posterior state during the propagation and updating process. Ultimately, the performance of the proposed method for 3D skeleton estimation is observed to be superior to that of conventional methods based on experimental results.


Introduction
Action recognition is necessary for a broad range of applications, such as video surveillance, medical diagnosis, and sports analysis. Sensor-based systems provide the information for a skeleton, such as body joints, for poses and activities [1,2]. However, the body joints affect temporal noises because of the nonlinear properties of Kinect sensors. Therefore, it provides accurate information of the body joints for the effective estimation of a 3D skeleton. Moreover, it is possible to exploit it for various applications, such as fitness, gesture recognition [3,4], augmented and virtual reality, bone growth evaluation, and gait analysis [5].
The skeleton model consists of joints, such as ankles, knees, shoulders, elbows, wrists, and limb orientation, incorporating the skeleton model of the body. In addition, it is based on the x, y, and z-coordinates using the depth sensor [6,7] from RGB images. The Azure Kinect camera [8], which has 32 skeleton joint points and depth libraries, is used to obtain the skeleton data. In general, the noisy depth data [9,10] obtained by the Kinect does not affect the videogames and entertainment applications; however, high depth accuracy is required for the construction fields and mobile robot navigation [11]. For the skeleton estimation, high-accuracy joint data are required. In this case, Kinect is able to extract the noisy joint's spatial coordinates [12] through software libraries. These libraries impose errors in the skeleton data while estimating the poses with human physical conditions, and affect the camera depth factors such as object distance, motion surface, and frame rate, caused by the temporal noise [13][14][15][16][17]. While obtaining the skeleton data, the estimated depth at the object point is not stable across frames.
Noise reduction [18][19][20] has been studied in other research to assess skeleton data. The Skeleton data were recorded with the Kinect v2 sensor, and generate significant noise owing to self-occlusion and fast movements. The Tobit Kalman filter (TKF) [21,22] was applied to filter the noise from the acquisition device. Additionally, the extended Kalman filter (EKF) [23,24] and unscented Kalman filter (UKF) [25] are well-known estimation techniques applied in several applications, such as the smoothing of skeleton joints. The techniques are accurate and reduce the posture errors at a specific joint during hip acceleration and trunk posture with motion sensor data. Further, improved skeleton data from the Kinect sensor are applied in the senior fitness test using the UKF [26]. The above-mentioned applications reduce noise and also improve accuracy in specific joints. However, there are limitations to reducing the noise at each joint. This is because conventional methods, such as a Kalman filter [27,28], are adapted to linear systems. The skeleton data from the Kinect camera contains the nonlinear noise. The EKF can be applied to nonlinear systems to reduce errors present in the data; this technique is limited to calculating the Jacobian vector in many complex models.
The main purpose of this study is to estimate the skeleton after filtering the temporal noise in the skeleton data obtained using Kinect cameras, and to compare the performance of the proposed method called a randomized unscented Kalman filter (RUKF). The use of the nonlinear estimator is considered for proper estimation of skeleton data, which consists of nonlinear noise. In addition, the proposed method improves the accuracy of joints and bone length measurements. The remainder of the paper is organized as follows. First, related works are discussed in Section 2. Then, estimation of the 3D skeleton using the proposed method is discussed in detail in Section 3. Section 4 presents the experimental results. Finally, the conclusions of the paper are presented in Section 5.

Related Works
The data acquisition and processing system consist of two parts: data acquisition and data processing, as shown in Figure 1. The data acquisition consists of Microsoft's Azure Kinect camera with inbuilt RGB, depth, Infrared (IR), and a motion sensor (IMU) to acquire the position of the body joints of a user. The Azure Kinect camera uses modulated illumination in the near-IR (NIR) spectrum. It then measures the time the NIR spent traveling from the camera to the object, and back. A depth map is generated by these measurements, and represented as values for every pixel of the image on z-coordinate. The Kinect camera information entered through the software development kit (SDK) interface provides 5, 15, and 30 frames/s. In addition, there are manual frame rate setting options available through program settings. However, the Kinect camera sensor contains a significant amount of nonlinear noise, even at a fixed position, which limits the position of the joint point's changes over time. Therefore, we are using the Kalman filters to estimate the time-varying states for noisy data measurements. techniques applied in several applications, such as the smoothing of skeleton joints. The techniques are accurate and reduce the posture errors at a specific joint during hip acceleration and trunk posture with motion sensor data. Further, improved skeleton data from the Kinect sensor are applied in the senior fitness test using the UKF [26]. The above-mentioned applications reduce noise and also improve accuracy in specific joints. However, there are limitations to reducing the noise at each joint. This is because conventional methods, such as a Kalman filter [27,28], are adapted to linear systems. The skeleton data from the Kinect camera contains the nonlinear noise. The EKF can be applied to nonlinear systems to reduce errors present in the data; this technique is limited to calculating the Jacobian vector in many complex models.
The main purpose of this study is to estimate the skeleton after filtering the temporal noise in the skeleton data obtained using Kinect cameras, and to compare the performance of the proposed method called a randomized unscented Kalman filter (RUKF). The use of the nonlinear estimator is considered for proper estimation of skeleton data, which consists of nonlinear noise. In addition, the proposed method improves the accuracy of joints and bone length measurements. The remainder of the paper is organized as follows. First, related works are discussed in Section 2. Then, estimation of the 3D skeleton using the proposed method is discussed in detail in Section 3. Section 4 presents the experimental results. Finally, the conclusions of the paper are presented in Section 5.

Related Works
The data acquisition and processing system consist of two parts: data acquisition and data processing, as shown in Figure 1. The data acquisition consists of Microsoft's Azure Kinect camera with inbuilt RGB, depth, Infrared (IR), and a motion sensor (IMU) to acquire the position of the body joints of a user. The Azure Kinect camera uses modulated illumination in the near-IR (NIR) spectrum. It then measures the time the NIR spent traveling from the camera to the object, and back. A depth map is generated by these measurements, and represented as values for every pixel of the image on z-coordinate. The Kinect camera information entered through the software development kit (SDK) interface provides 5, 15, and 30 frames/s. In addition, there are manual frame rate setting options available through program settings. However, the Kinect camera sensor contains a significant amount of nonlinear noise, even at a fixed position, which limits the position of the joint point's changes over time. Therefore, we are using the Kalman filters to estimate the time-varying states for noisy data measurements.  Filtering involves the problem of estimating the current state of a system. Based on the estimation of known or unknown states, filters are categorized into two types: batch filters and recursive filters. The estimation of unknown states involves a batch filter, and that of Electronics 2021, 10, 971 3 of 13 the known states involves recursive filters. This study focuses on recursive filters such as the Kalman filter (KF), EKF, UKF, and particle filter (PF) [29][30][31][32]. Recursive filters can be further categorized into two types: linear and non-linear filters. If both the system and measurement model are linear, a linear filter can be utilized (such as a KF). With regard to the KF, the data filtering process uses the covariance between sensor noise and process noise. This method involves the assumption that the noise is Gaussian, and that linear models maintain this noise. The basic system models the state of an object with its position and velocity and uses them to correct the covariance between the predicted state information about the object along with the object coordinates observed by the sensor. If any one of the systems or measurement models is nonlinear, then a nonlinear filter must be used. A skeleton model typically requires a nonlinear method for model construction. For instance, nonlinear modeling methods such as the EKF and UKF exist. In nonlinear filtering, based on approximation and sampling, methods can be categorized as analytical approximation or sample-based methods. The EKF involves the Taylor series-approximated Jacobian estimation method, which is explained in Section 2.1. The UKF involves a sample-based method, which is explained in Section 2.2. In addition, a few examples of sample-based filters are heuristic filters, such as simplex [33] and genetic [34] filters. In this study, we did not focus on heuristic filters, but we discuss how a skeleton system can be incorporated in detail.

Extended Kalman Filter (EKF)
The Kalman filter that can model the extended nonlinear system is known as an EKF. The EKF is the first estimator that linearizes the nonlinear model about an estimate of the current mean and covariance through the Jacobian vector. The EKF process consists of three steps: time update, prediction step, and measurement update. The goal of this filter is to compute and update the Jacobian vector at each time step (k) while considering the initial parameters, i.e., k = 1, initial meanx 0 , initial covariance P 0 , and the predicted statex − k , through the state-transition function f . The following equations include the hat operator (ˆ), which is an estimate of a variable and the superscript ( − ), which denotes the predicted (prior) estimate, as follows: The covariance matrix can be predicted as P − k , from the updated estimate P k−1 , F is a state Jacobian vector, and the process noise covariance Q k , as follows: The Jacobian estimation H k computed in the time update through the observation function h, measurement-noise covariance R k , and measurement prediction as follows: The measurement update of the cross-covariance P x k z k updated through the Jacobian estimation and sensor measurement z k as To combine all observation models, the Kalman gain K k contributes more or less weight to predict the observation model as To combine all the updated statex k , and covariance P k , information from statetransition and observation models is given as follows, After being updated at k = k + 1, the Jacobian estimation is expressed as The pseudocode of Algorithm 1 is shown below.

7:
Update the cross-covariance P x k z k (defined in Equation (6)), Kalman gain K k (defined in Equation (7)), state estimatex k (defined in Equation (8)), and state covariance P k (defined in Equation (9)) return estimated data ( x k ) end for

Unscented Kalman Filter (UKF)
The model, which is highly nonlinear for both state transition and observation models, is known as UKF. The UKF selects the sigma points from the mean through unscented transform (UT). The new mean and covariance are obtained from sigma points, which are propagated through the nonlinear functions. The procedure of UKF consists of three steps: time update, prediction step, and measurement update. The goal of this filter is to generate the 2L + 1 sigma points twice at each time step (k), while considering initial parameters, i.e., k = 1, initial meanx 0 , initial covariance P 0 , and set of points called sigma points χ, weights W and for both mean reconstruction ( m ), and covariance reconstruction ( c ). In the time update, the sigma points are propagated, and the state can be expressed through the nonlinear function as Electronics 2021, 10, 971 5 of 13 The predicted state, covariance matrix, and process noise covariance Q k can be ex- The regenerating 2L + 1 sigma points χ k|k−1 (L is the dimension of the state) are obtained as These newly generated sigma points are instantiated through the observation model h for the measured observation state and predicted observation state Z k|k−1 through the weights followed by the covariance matrix, and measurement-noise R k covariance, as follows In the measurement update, the cross-covariance is computed with the sensor measurement z k with the state transition model and Kalman gain as To combine all the updated state and covariance information from state-transition and observation models are as follows; The regenerating 2L + 1 sigma points χ k−1 at k = k + 1 are obtained as follows According to the equations, 2L + 1 sigma points are generated, where L is defined as the dimension of the state. In our proposed method, the number of sigma points is set to 13 because the dimension of the state is six. Moreover, our simulation occurs separately for the lower body and upper body using two UKF models. At this time, the number of total sigma points generated is 436, of which those for the lower body and upper body are 101 and 335, respectively. When nonlinearity increased in the system, then UT automatically deteriorates. Therefore, we proposed state estimation using a randomized unscented Kalman filter for 3D skeleton posture in order to improve the performance of conventional methods. The pseudocode of Algorithm 2 is shown below. Algorithm 2: Unscented Kalman Filter (UKF) [26] Input: z k denotes the sensor-measured values of the k-frames. Output: x k denotes the estimated values of the k-frames. 1: Initial parameters each time step k = 1, initial meanx 0 = 0, initial covariance P 0 = 0 2: for k = 1 to N (total number of iterations) do 3: Generate sigma points: χ k−1 using the Equation (14) Time update (prediction): 4: State prediction step: Predictx − k , and P − k using χ * k|k−1 (defined in Equa tion (11)). 5: Regenerate 2L + 1 (defined in Equation (14)) sigma points χ k|k−1 usingx − k 6: Measurement prediction step: Predictẑ − k and Pẑ kẑk using Z k|k−1 (de fined in Equation (15)). Measurement update (correction):

7:
Update the cross-covariance P x k z k (defined in Equation (18)), Kalman gain K k (defined in Equation (19)), state estimatex k (defined in Equation (20)) and state covariance P k (defined in the Equation (21)) end for return estimated data (x k )

Proposed 3D Skeleton Posture Estimation Using a Randomized Unscented Kalman Filter
An EKF linearizes the nonlinear process model to estimate the state vector through Jacobian calculation. The EKF struggles to obtain the Jacobian estimate, which assumes nearly linear behavior for both the transition and observation function for more motion poses. This causes a significant difference in error while reconstructing the non-motion skeleton estimation and motion skeleton estimation. It is known that UKF is one of the methods for noise reduction while estimating a skeleton. It caused occasional instability in terms of state and measurement when reconstructing a skeleton estimation. To overcome these limitations, a novel noise reduction method based on an RUKF is proposed. The proposed method uses the data obtained from Microsoft Azure SDK. The concept of a skeleton is provided from the Kinect camera. The entire process, from the data acquisition to 3D skeleton estimation, through the filtering methods and expected simulated skeleton poses, is shown in Figure 2. These sensor data were modeled using conventional methods; as there is noise present in the data, to minimize, we proposed our method that includes key steps such as predict and update. The state transition function is used to predict the next state for the transition matrix, which is also called a state transition vector.

Proposed 3D Skeleton Posture Estimation Using a Randomized Unscented Kalman Filter
An EKF linearizes the nonlinear process model to estimate the state vector through Jacobian calculation. The EKF struggles to obtain the Jacobian estimate, which assumes nearly linear behavior for both the transition and observation function for more motion poses. This causes a significant difference in error while reconstructing the non-motion skeleton estimation and motion skeleton estimation. It is known that UKF is one of the methods for noise reduction while estimating a skeleton. It caused occasional instability in terms of state and measurement when reconstructing a skeleton estimation. To overcome these limitations, a novel noise reduction method based on an RUKF is proposed. The proposed method uses the data obtained from Microsoft Azure SDK. The concept of a skeleton is provided from the Kinect camera. The entire process, from the data acquisition to 3D skeleton estimation, through the filtering methods and expected simulated skeleton poses, is shown in Figure 2. These sensor data were modeled using conventional methods; as there is noise present in the data, to minimize, we proposed our method that includes key steps such as predict and update. The state transition function is used to predict the next state for the transition matrix, which is also called a state transition vector.   Let us consider the state transition function f , state x, and there is uncertainty w (mean 0, and covariance Q), as follows: The observation function h used to obtain the transformations which use the rotation matrix to find the next joint obtained by the Euler angle and uncertainty v (mean 0, and covariance R) along with the next joint state, as follows The procedure of RUKF consists of three steps: time update, prediction step, and measurement update. The proposed method is based on a stochastic integration rule (SIR) [35]. The SIR computed using the UT with spread and rotation N max sigma point sets is random, which is represented as a randomized unscented transform (RUT) [36]. The purpose of this filter is to update the state covariance function at each time step (k). The initial parameters, i.e., k = 1, initial meanx 0 = 0, initial covariance P 0 = 0, maximum number of iteration steps N max = 1, process noise covariance Q k , predicted statex − k , and covariance P − k , can be expressed through the state transition function as follows: The measurement estimate prediction can be obtained through the observation function h by the stochastic integration algorithm (SI alg) aŝ The defined measurement covariance update function a(x), obtained through the observation function and measurement estimate predictionẑ − k as The measurement covariance prediction Pẑ kẑk can be obtained through covariance update function a(x), measurement-noise covariance R k , and cross-covariance update function b(x); computed from the observation function h as follows: In the measurement update, cross-covariance P x k z k , can be obtained from sensor measurement, a cross-covariance update function b(x); and to combine all observation models, the Kalman gain K k contributes more or less weight to predict the observation as All the updated statex k and covariance P k information from the state transition and observation models are combined aŝ At k = k + 1, the updated state covariance update g(x) is expressed as Therefore, an RUKF is used to predict, update, and filter the kinematics motion for different skeleton poses. These skeleton poses are estimated by reducing the temporal noise for nonlinear estimation in state and observation transition models through the propagation and updating process iteratively. For each iteration, it will obtain the data from both the propagation and updating of RUKF. Moreover, it is possible to obtain the coordinates of each joint at computing state through the observation function. According to the authors of [30], the convergence of the system depends on an arbitrary function. We implemented the RUKF with an improved version of the SIR. Therefore, the arbitrary function is approximately constant, enabling the proposed algorithm to reach convergence. The pseudocode of Algorithm 3 is shown below.

Measurement update (correction)
9: Update the cross-covariance P x k z k defined in Equation (31), Kalman gain K k defined in Equation (32), state estimatex k defined in Equation (33) and sate covariance P k (defined in Equation (34)) end for return estimated data (x k )

Experimental Results
The proposed method, tested with twelve poses, is considered as a manually annotated [37][38][39] dataset to evaluate its performance. It has been difficult to find standard public datasets for pose and skeleton estimation. Therefore, reference actions from [40,41] chose twelve poses, and the experiment was conducted with 20 participants to process the data. In this paper, the twelve pose datasets are as follows: Crossing arms, extending arms, flexion and extending, hands up, left standing, right standing, standing, unipedal, right flexion and extending, left flexion and extending, o-leg stand, and x-leg stand.
The tested dataset consists of 32 joints with the 30-s data, including the RGB and Depth information of each pose. The person would perform the pose in front of the Kinect camera to capture the pose. In the skeleton, each joint has its own joint coordinate system which is related to the depth camera 3D coordinate system. A skeleton includes 32 joints, with each connection linking the parent joint with a child joint. The pelvis is the root joint, fixed in the center of the human body followed by the child joints. For instance, the pelvis, which is the root joint, links to the spine_naval, which is the child joint; and the spine_navel (parent joint) links to the spine_chest (child joint). Euler angles and angular velocities are needed for the next child joint as the state, except for the root position, rather than using the 3D position coordinates of all skeletons. In each frame, the x, y, and z-coordinates of each joint are used to create a joint object instance for each of the 32 joints. Using the depth-first search algorithm, bone lengths are calculated and stored in the skeleton. To generate the skeleton map, the human body kinematic model splits into two: upper and lower body. In the calibration phase, the initial mean state vector of the lower body, and the initial mean state vector of the upper body, will be returned.
A model-filter controller function utilizes the lower and upper body properties to apply the proposed method. The Kinect data contained a number of frames, where the estimated data of the proposed method used every single frame. Subsequently, the estimated data were calculated with the help of filter objects, made in the previous step. The algorithm for one iteration was as follows. From the skeleton object list, the kth number of skeleton objects was selected, which was the skeleton for the kth frame. Then, the joint position of each joint of this frame was obtained, and the measurement vector (z k ) was calculated. Additionally, z k was used to calculate the updated state estimate for the kth frame. The updated mean and covariance are computed in the filter update method of the proposed method. From the updated state mean, the updated data for the current step are calculated by passing them through the observation function again. The updated mean and covariance are computed for lower and upper body models separately.
Finally, filter controller function retrieves and returns the covariance matrices for the upper and lower body using the proposed method. For comparison, we obtained the manually annotated ground truth [24,[42][43][44][45][46] for the set of twelve poses captured by the camera. To analyze the original data and the estimated data, we employed the proposed method for determining the coordinates of each joint by incorporating the observation function at each iteration. The numerical comparison was conducted by incorporating the RMSE mean value (RMSE_mean) and the RMSE standard deviation value (RMSE_(±SD)) associated with the estimated data and ground truth, which were measured in millimeters (mm). The results show a comparison of the RMSE _mean and standard deviation (±SD) values in comparison with UKF, EKF and the proposed method. Examples of four joint points, i.e., pelvis, spine_naval, head and nose, are shown in Figure 3. In Figure 3a, the x-axis depicts the 30-s frame data, and the y-axis depicts the RMSE. There are rapid enhancements and large errors on the EKF.   Joints of each pose are affected by the noise; the effected nonlinear temporal noise was filtered with nonlinear filters. Further, the EKF has the largest RMSE at the ear_right (5.83 ± 17.63) and the smallest RMSE at the pelvis (1.34 ± 0.57). Utilizing the UKF method, RMSE at the ear_right is (4.63 ± 15.6) and the value of the pelvis is (0.58 ± 0.40), the data demonstrate that there is a possibility to reduce the noise further. Finally, upon imple- The UKF and proposed method have similar results on Figure 3. Further, we will compare these with skeleton data on x, y, and z-coordinate systems. The joint variables and mean (±SD) of the RMSE values among the three nonlinear filters EKF, UKF, and proposed method are presented in Table 1. Joints of each pose are affected by the noise; the effected nonlinear temporal noise was filtered with nonlinear filters. Further, the EKF has the largest RMSE at the ear_right (5.83 ± 17.63) and the smallest RMSE at the pelvis (1.34 ± 0.57). Utilizing the UKF method, RMSE at the ear_right is (4.63 ± 15.6) and the value of the pelvis is (0.58 ± 0.40), the data demonstrate that there is a possibility to reduce the noise further. Finally, upon implementing the proposed method, the RMSE value for "ear_right" was found to be (1.60 ± 4.78), and the RMSE value for the pelvis was found to be (0.35 ± 0.26). Specifically, parts of the head including 'eye_right' and 'ear_right' moved more rapidly than the other joints of the body. This resulted in an increase in the SD compared to the other joints.
Finally, depending on the RMSE, the proposed method is proven to have superior performance across all the joints of the twelve poses. The estimated results of the x, y, z joint by UKF and the proposed method on the x, y, and z-coordinate systems are shown in Figure 4. This presents examples of four joint positions, i.e., pelvis, spine naval, head, and nose. Here, the x-axis is on a time domain, and the y-axis shows the position and its error. The error results of the UKF is indicated with the red line and the proposed method with the blue line. The error fluctuation by the proposed method is less than that by the UKF, as shown in Figure 4. This is because the proposed method compensates for the error, while using both the observation and state transition functions on a RUKF method. The performance of joint position connects to the skeleton estimation directly. The graphical results of skeleton estimate for twelve poses according to each algorithm are presented in Figure 5.
The skeleton estimation by the UKF and RUKF method is indicated as blue and green lines, respectively. These captured images are shown only single frame among 30-s in each pose. Therefore, there are some limitations to showing the experiments entirely. However, it is possible to evaluate the performance of the RUKF method using Table 1. The 3D skeleton map estimation by the proposed method has better results in terms of the quantitative results of RMSE_mean (±SD) than that of the UKF method in Figure 5. and nose. Here, the x-axis is on a time domain, and the y-axis shows the position and its error. The error results of the UKF is indicated with the red line and the proposed method with the blue line. The error fluctuation by the proposed method is less than that by the UKF, as shown in Figure 4. This is because the proposed method compensates for the error, while using both the observation and state transition functions on a RUKF method. The performance of joint position connects to the skeleton estimation directly. The graphical results of skeleton estimate for twelve poses according to each algorithm are presented in Figure 5.  The skeleton estimation by the UKF and RUKF method is indicated as blue and green lines, respectively. These captured images are shown only single frame among 30-s in each pose. Therefore, there are some limitations to showing the experiments entirely. However, it is possible to evaluate the performance of the RUKF method using Table 1. The 3D skeleton map estimation by the proposed method has better results in terms of the quantita-

Conclusions
This study proposed a method of 3D skeleton pose estimation using a randomized unscented Kalman filter to reduce nonlinear noises on a Kinect camera. The proposed method analyzed the propagation and updating process for temporal noises of skeleton data on the x, y, and z-coordinate systems. The accuracy of state estimation and prediction of the data are enhanced owing to the use of randomized UT. This enables the proposed method to correct errors that occur with the nonlinear temporal noise. Therefore, the experimental results show that the performance of the proposed method reduces the RMSE and SD while estimating the 3D skeleton map.