Markerless 3D Skeleton Tracking Algorithm by Merging Multiple Inaccurate Skeleton Data from Multiple RGB-D Sensors

Skeleton data, which is often used in the HCI field, is a data structure that can efficiently express human poses and gestures because it consists of 3D positions of joints. The advancement of RGB-D sensors, such as Kinect sensors, enabled the easy capture of skeleton data from depth or RGB images. However, when tracking a target with a single sensor, there is an occlusion problem causing the quality of invisible joints to be randomly degraded. As a result, multiple sensors should be used to reliably track a target in all directions over a wide range. In this paper, we proposed a new method for combining multiple inaccurate skeleton data sets obtained from multiple sensors that capture a target from different angles into a single accurate skeleton data. The proposed algorithm uses density-based spatial clustering of applications with noise (DBSCAN) to prevent noise-added inaccurate joint candidates from participating in the merging process. After merging with the inlier candidates, we used Kalman filter to denoise the tremble error of the joint’s movement. We evaluated the proposed algorithm’s performance using the best view as the ground truth. In addition, the results of different sizes for the DBSCAN searching area were analyzed. By applying the proposed algorithm, the joint position accuracy of the merged skeleton improved as the number of sensors increased. Furthermore, highest performance was shown when the searching area of DBSCAN was 10 cm.


Research Background
The captured human pose or gestured data can provide a lot of useful information for developing human-robot interaction (HRI) or human action recognition. The skeleton data, which consists of human joint position, is one of the most commonly used pieces of information in human research. This is because the number of joints that comprise the skeleton does not vary according to the shape of the human body or gender, maintaining a standardized structure that represents the human pose.
Due to these advantages, skeleton data is widely used in developing games for humans, industrial environments, and research on pathological diagnosis through recognition of gestures. M. Ma, et al. [1] developed a multi-planar full-body rehabilitation game named Mystic Isle using Microsoft Kinect V2. The user can interact with virtual environment by using their body. The Kinect V2 sensor tracked the user's body, and the body tracking SDK provided the tracked skeleton data, which included 25 joints. A. Taha et al. [2] attempted to obtain descriptive labeling of complex human activities using Kinect V2 skeleton data. They proposed building the specific feature vector using identified skeleton joint coordinates as input for the Hidden Markov model.
In [3], M. Varshney, et al. proposed the rule-based classifier method that recognizes a view-invariant multiple human activity recognition in real time. They also used a single Kinect V2 sensor and body tracking SDK to track the human skeleton data. The skeleton data was then used as input for the proposed classifier method. E. Cippitelli et al. [4] attempted to recognize human activity using skeleton data obtained from RGB-D sensors. They also extracted a specific feature vector from skeleton pose data and used a support vector machine to classify human actions. Bari proposes a gait recognition model based on deep learning architecture in his paper [5]. They also trained the suggested deep learning model using the public 3D skeleton gait database recorded with the Microsoft Kinect V2 sensor. Based on skeleton data, two unique geometric features, the joint relative cosine dissimilarity and joint relative triangle area, are constructed. Consequently, the skeleton data is utilized in wide research areas that target human gesture or interaction as the raw or input data. Skeleton-based approaches, on the other hand, do not always ensure good performance. If the skeleton data is incorrect or noisy, they perform poorly. In other words, the quality of the skeleton data may be a determinant of the performance of the human tracking system or gesture recognition model [6].
To capture the human gesture using skeleton data accurately, the typical system is the marker-based motion capture system such as VICON (Vicon Motion Systems Ltd., Oxford, UK) and Opti-track (Natural Point Inc., St Corvallis, OR, USA) because of their proven accuracy. These systems require wearing a suit attached with a reflective marker or a process for attaching the marker to the human body. This procedure takes a long time, and the attached marker makes the human's movement unnatural [7]. Therefore, the motion capture system is difficult to apply in a variety of environments for capturing human gestures.
In many recent studies, a human motion capturing system using an RGB-Depth (RGB-D) sensor such as Kinect (Microsoft Corp., Redmond, WA, USA), Xtion (ASUS, Taipei, Taiwan), Astra (Orbbec 3D Technology International, Inc., Troy, MI, USA), or Realsense (Intel Corp., Santa Clara, CA, USA) that does not require marker attachment is widely used as an alternative method [8][9][10]. The MS Kinect sensor is the most widely used in motion capture research, and it includes not only the sensor SDK, which captures RGB and depth data, but also the body tracking SDK, which captures 3D skeleton data from the depth image [11]. The Azure Kinect sensor is the most recent addition to the Kinect sensor series. The Azure Kinect body tracking SDK provides skeleton data consisting of 3D positional information for 32 joints, as shown in Figure 1. The random forest algorithm is adopted in Kinect v2 whereas a deep learning-based algorithm is adopted to Azure Kinect's body tracker [12,13]. The GPU can be used to run the deep learning model extracting human skeleton data. Furthermore, because the sensor SDK has been upgraded to allow multiple sensors to be operated on a single PC, it can be effectively applied in a variety of research or industrial areas. Although skeleton tracking for a whole body is possible using Azure Kinect, poor skeleton quality often occurs due to a problem called self-occlusion [14]. This problem is a limitation of the single sensor system and occurs when the target joint is obscured by other body parts. The issue causes the quality of skeleton data to degrade at random. One of the simplest ways to solve this problem is to use a motion capture system with multiple Although skeleton tracking for a whole body is possible using Azure Kinect, poor skeleton quality often occurs due to a problem called self-occlusion [14]. This problem is a limitation of the single sensor system and occurs when the target joint is obscured by other body parts. The issue causes the quality of skeleton data to degrade at random. One of the simplest ways to solve this problem is to use a motion capture system with multiple sensors that can cover the entire workspace area [15,16]. For example, if an occlusion problem causes an error in the skeleton data, the inaccurate information of the obscured joint can be compensated by using information from another sensor. In other words, a single sensor's self-occlusion problem can be overcome by appropriately combining skeleton data from multiple sensors.
In this paper, we developed a new algorithm that merges the multiple skeleton data obtained by multiple RGB-D sensors in real time. We used Azure Kinect sensors of Microsoft because of the sensor's convenient expandability meaning that the multiple sensors can be utilized on a single PC. The skeleton data was obtained using the Azure Kinect body tracking SDK. Because the Azure Kinect sensor provides a function that can synchronize time steps between sensors by linking them together, no additional work for time synchronization is required when using multiple sensors. Specifically, we adopted the density-based spatial clustering of applications with noise (DBSCAN) which is generally used for clustering as the error filter on the skeleton merging process. After the merging process, we used the Kalmal filter to minimize the tremble error in joint movements.
There are three main contributions. First, the proposed algorithm can merge the skeleton data accurately in real time. We demonstrated how the number of sensors (TNOS) increased the joint accuracy of the merged skeleton. Second, the error caused by selfocclusion is avoided during the merging process of skeletons obtained by multiple Kinects using DBSCAN, a clustering algorithm. Third, we reduced joint position tremble error by using a Kalman filter on merged skeleton data. With these contributions, the proposed method can help improve the performance of various skeleton-based research and applications by obtaining accurate skeleton data in all directions.

Related Works
There have been many studies dealing with markerless skeleton tracking to overcome the difficult usability of motion capture equipment. Among them, studies using RGB-D sensors are representative. The development of many kinds of RGB-D sensors provides the human pose information in the form of the skeleton data extracted from depth images. However, there is a problem known as self-occlusion, which is caused by the sensor's limited viewing area. Furthermore, when the subject is facing the sensor, the skeleton tracking algorithm can provide the best accuracy [17]. In other words, the subject can be tracked more reliably when facing the sensor in a pose with no invisible joint [14].
To address the occlusion issue, several studies adopted the multiple RGB-D sensors system to minimize invisible body parts. They made several attempts to obtain optimized skeleton data by combining multiple skeleton data obtained from multiple sensors installed in various views. Several studies have been conducted to merge multiple skeleton data sets using constraint rules determined by the structure of the human skeleton. These attempts were conducted by assigning different weights to inaccurate joint candidates in the merging process. These types of trials may necessitate an initial configuration process to determine structural components such as bone length.
Y. Kim, et al. [18] proposed a motion capture system using multiple Kinect V2 sensors for capturing the dynamic gestures of humans in a 3D environment. A posture reconstruction method was adopted for tracking human gestures consistently. They proposed a tracking method for the torso joints and limb joints separately, based on the consistent bone length of the human body. The mean value of the candidates within the bone length in the direction from the parent joint to the target joint was calculated in the case of the torso. In the case of limb, a joint candidate with the smallest sum of rotation direction and rotation angle compared to the previous joint coordinates was chosen from among the candidate groups within the bone length threshold. J. Colombel et al. [19] presented a fusion algorithm for tracking the joint's center position using multiple skeleton data from multiple depth cameras to improve human motion analysis. The proposed system adopted an extended Kalman filter for the fusion of the joint candidates into the joint center position and applied the anthropomorphic constraints of human skeleton structure. As the measurement model of the extended Kalman filter, a specific forward kinematics model representing the human locomotor system with fixed bone lengths was used. The measurement fusion method was chosen among the fusion methods based on the Kalman filter because the proposed algorithm should be run in real time. N. Chen et al. [20] describe a method for combining two skeleton data sets from two Kinect V2 sensors. They proposed a data fusion strategy that weights the candidate of the target joint based on human physiological movement constraints related to both bone length and joint angle.
The other approach for the development of a multiple skeleton fusion algorithm is the definition of confidence value for a joint candidate in the merging process. The confidence value is usually determined according to the state in which the joint or skeleton data is detected in the sensor's view. In [21], the authors proposed the human pose estimation method by fusing the multiple skeleton data and tracking the merged skeleton data. They considered the confidence value at both the whole skeleton and each joint level, and they filtered the inaccurate skeleton or joint data by using a confidence value threshold. Finally, the fused skeleton was tracked using the Kalman filter. Y. Wu et al. [22] created a real-time full-body tracking system with three Kinect V2 sensors. They used an adaptive weighting adjustment fusion method to build merged 3D skeleton data regardless of the subject's orientation. Each candidate of the target joint obtained from each sensor was weighted according to the angle between sensor and subject and participated in the merging process. K. Desai and S. Raghuraman [23] proposed a real-time skeletal pose tracking method that aims to get merged skeleton data using multiple inaccurate skeletons. They determined a new confidence value named probability of an accurate joint (PAJ) for each target joint. Several factors were considered when determining the PJA. First, PAJ was calculated using the skeleton's orientation, which is the angle between the subject's facing direction and the sensors. The second state is the joint state, which indicates that the target joint is visible in the sensor's visible area. The third factor is the bone angle, which is calculated between the bone and the capture plane, as well as the fixed length of a human's bone. Taking into account all components, the merged skeleton's joint position was determined using a distance-constrained consensus approach that maximizes the overall PAJ.
Some studies tried to design a new merging process for calculating the position of a joint. Moon and others [24] developed a human skeleton tracking system using Kalman filter framework with weighted measurement fusion method for merging five inaccurate skeleton data. The five Kinect V2 sensors were used to capture the subject, and the measurement noise of the Kalman filter was controlled based on the predicted state and joint motion continuity. In addition, H. Zhang et al. [25] proposed a method for combining multiple skeleton data sets. The proposed method's first strategy was to filter outliers among target joint candidates using spatial region constraints and K-means clustering. The second step was to combine the inlier candidates into a single skeleton and apply the proposed adaptive weighted fusion rules. K. Ryselis et al. [26] presented a practical solution for performing multiple skeleton data fusion algorithms in vector space using algebraic operations. They aimed to track the human with non-standard poses such as squatting, sitting, and lying. As in the studies described above, the algorithm developed in this paper does not estimate the skeleton using raw data such as RGB, Depth, and Pointcloud, but creates optimized skeleton data by merging multiple inaccurate skeleton data measured from various angles. In particular, as in [25], we also propose a method of filtering inaccurate joint candidates by applying a clustering algorithm in the merging process.
The remainder of the paper is structured as follows: Section 2 describes all sensor's coordinate system calibration methods, a proposed merging algorithm that uses DBSACN, and settings of our experiment; Section 3 describes the result of the experiment; Section 4 discusses the experimental result and the future works of this study. Finally, we present our conclusions in Section 5.

Calibration for Coordinate Systems of Sensors
To use the 3D data acquired using multiple RGB-D sensors, a calibration process must be performed first. Moreover, since the calibration accuracy can have a great effect on the joint position of the merged skeleton data, an accurate calibration process must be performed. The calibration process refers to matching the coordinate systems of each sensor into one global coordinate system by calculating a rigid transformation matrix M = [R, T]. Here, R is rotation matrix parameterized by the three rotations θ x , θ y , and θ z . Additionally, T is a translation matrix consisting of three translation offset values for the x, y, and z axes, respectively. In this paper, two steps for the calibration process were adopted as described in Figure 2. The first one is sensor-to-sensor calibration matching the coordinate system of each sensor to the coordinate system of the reference sensor set as a master sensor. Another is the sensor-to-marker calibration resetting the coordinate system of all sensors calibrated with the reference sensor to the global origin customized by the user.

Sensor-to-Sensor Calibration
The sensor-to-sensor calibration process was conducted by constructing correspondence trajectories composed of the 3D centroid points of the sphere object. Figure 3 depicts a sphere object with a radius of 24 cm and a red tone color. The RGB image was first converted into HSV color space to track the center coordinates of a spherical object. Additionally, the histogram backprojection algorithm [27] was used to extract the area of a specific hue value. The Pointcloud Data (PCD) corresponding to the sphere object is then obtained by extracting the pixels in the depth image that correspond to the pixels in the HSV image that correspond to the sphere. Finally, the RANSAC [28] algorithm identified iteratively all the 3D points that satisfy the surface equation of the sphere object with radius R to detect the robust 3D centroid points of sphere object. The 3D centroid points of the sphere object extracted for several frames from each sensor constitute a correspondence trajectory. In the process of capturing the sphere, frames in which one or more sensors do not detect the object are filtered out in the construction trajectory.

Sensor-to-Sensor Calibration
The sensor-to-sensor calibration process was conducted by constructing correspondence trajectories composed of the 3D centroid points of the sphere object. Figure 3 depicts a sphere object with a radius of 24 cm and a red tone color. The RGB image was first converted into HSV color space to track the center coordinates of a spherical object. Additionally, the histogram backprojection algorithm [27] was used to extract the area of a specific hue value. The Pointcloud Data (PCD) corresponding to the sphere object is then obtained by extracting the pixels in the depth image that correspond to the pixels in the HSV image that correspond to the sphere. Finally, the RANSAC [28] algorithm identified iteratively all the 3D points that satisfy the surface equation of the sphere object with radius R to detect the robust 3D centroid points of sphere object. The 3D centroid points of the sphere object extracted for several frames from each sensor constitute a correspondence trajectory. In the process of capturing the sphere, frames in which one or more sensors do not detect the object are filtered out in the construction trajectory.
converted into HSV color space to track the center coordinates of a spherical object. Additionally, the histogram backprojection algorithm [27] was used to extract the area of a specific hue value. The Pointcloud Data (PCD) corresponding to the sphere object is then obtained by extracting the pixels in the depth image that correspond to the pixels in the HSV image that correspond to the sphere. Finally, the RANSAC [28] algorithm identified iteratively all the 3D points that satisfy the surface equation of the sphere object with radius R to detect the robust 3D centroid points of sphere object. The 3D centroid points of the sphere object extracted for several frames from each sensor constitute a correspondence trajectory. In the process of capturing the sphere, frames in which one or more sensors do not detect the object are filtered out in the construction trajectory.  To obtain a rigid transformation matrix between the coordinate systems of the master and target sensors, singular value decomposition (SVD) was calculated on a pair of trajectories configured in the corresponding sensors [29]. Let the C M be the trajectory captured in the coordinate system of the master sensor and C p be the trajectory captured in the coordinate system of target sensor number p, we have the rigid transformation matrix RT between C p and C M as: To compute unknown parameters of the R pM , T pM matrix, the trajectory with 3 or more correspondences is required. In this study, we collected 1000 points of correspondence to create a trajectory. The optimal transformation matrix could then be computed by minimizing the error function shown below.
where N is the number of correspondences of the trajectory. To compute the rotation matrix, we used the least-squares-based method described in [30]. Thus, the optimal solution for E R (pM) , T (pM) could be obtained by computing the covariance matrix as follows: where,Ĉ i is the centroid of trajectory captured on the coordinate system of target sensor number p, andĈ M is the centroid of trajectory captured on the coordinate system of the master sensor. By using SVD, the covariance matrix Cov is decomposed as Cov = USV T . Then, the rotation matrix could be calculated as R (pM) = UV T and translation matrix T (pM) =Ĉ M −Ĉ p . This procedure occurs between all coordinate systems of target sensors and the master sensor; as a result of this procedure, all coordinate systems of target sensors are calibrated with the coordinate system of the master sensor. In this paper, we performed marker calibration to obtain the collected 3D skeleton data based on the desired coordinate system. This is accomplished by placing a plane marker with a specific pattern in the capture area and recognizing it in master Kinect's RGB view. As shown in Figure 3, we used a plate printed with four ARUCO markers. The method of [31] implemented in the OpenCV ver3.4.0 library was used for marker recognition. After detecting the marker in the master Kinect's RGB image, the PCD corresponding to each marker can be extracted from the depth image collected along with the RGB image. Then, the centroid of the PCD is calculated to obtain the three-dimensional center points of all markers.
To set all coordinate systems to a custom coordinate system defined by the user, we calculate a rigid transformation matrix by setting the desired coordinate axis and origin point using the center points. As shown in Figure 3, the vector between the centroid of marker number 2 and the centroid of marker number 1 serves as the x-axis in this study, while the vector between the centroid of marker number 0 and the centroid of marker number 1 serves as the y-axis. The z-axis was set in the direction from the floor to the ceiling by calculating the cross-product of the x and y axes, and the origin point was set as the centroid of marker number 1.
After sensor-to-marker calibration, the coordinate systems of all target sensors transformed to the master sensor are transformed again to the custom coordinate system. Throughout this procedure, all data obtained from multiple RGB-D sensors can be taken based on the desired coordinate system and origin set by the user.

Skeleton Merging Algorithm
As mentioned in Section 1, a self-occlusion problem that randomly degrades the quality of the skeleton data could occur when using a single RGB-D sensor to track the target. For handling this issue, we adopted multiple RGB-D sensors to overcome the self-occlusion problem. If a joint in a position invisible to one sensor is visible to another, it will be possible to compensate using the appropriate merging algorithm. The main issue with tracking skeletons with multiple sensors is figuring out how to combine multiple inaccurate skeleton data sets. In this paper, we designed a skeleton merging algorithm that increases the accuracy of the merged skeleton according to the TNOS, as shown in Figure 4. After capturing skeleton data and applying calibration, the merging process is applied. The proposed algorithm's first component is the rearrangement of a joint's directions, and the second is DBSCAN, which is used as a noise filtering method in the merging process [32]. Lastly, the third one is the tracking method based on Kalman filter to track the target joint and make the movement of joint smooth by canceling the tremble error.

Arrangement of Skeleton to Correct the Misoriented Joints
Before we apply the merging process, the arrangement procedure that corrects the direction of joints is needed because there is a misoriented problem mentioned in [21,23,33]. In more detail, the misoriented problem refers to a phenomenon in which the left-right direction of the joint in the measured skeleton data differs from the actual joint's left-right direction. For example, while one of the measured skeleton's joints is labeled as the left shoulder, for the actual skeleton it may be the right shoulder. In addition, the left and the right directions of each joint may be recognized differently for each sensor. In Figure 5, three examples of misoriented problems are described. The yellow and green spheres indicate the positions of the elbow joint recognized by the four sensors as left and right, respectively. Since the spheres of the same color are not adjacent to each other but are mixed, it is confirmed that a misoriented problem occurs in the skeleton data obtained by the SDK. As a result, if joints are merged without alignment, it is inevitable to obtain crushed skeleton data such as the red skeleton in Figure 5. rate skeleton data sets. In this paper, we designed a skeleton merging algorithm that increases the accuracy of the merged skeleton according to the TNOS, as shown in Figure 4. After capturing skeleton data and applying calibration, the merging process is applied. The proposed algorithm's first component is the rearrangement of a joint's directions, and the second is DBSCAN, which is used as a noise filtering method in the merging process [32]. Lastly, the third one is the tracking method based on Kalman filter to track the target joint and make the movement of joint smooth by canceling the tremble error.

Arrangement of Skeleton to Correct the Misoriented Joints
Before we apply the merging process, the arrangement procedure that corrects the direction of joints is needed because there is a misoriented problem mentioned in [21,23,33]. In more detail, the misoriented problem refers to a phenomenon in which the left-right direction of the joint in the measured skeleton data differs from the actual joint's left-right direction. For example, while one of the measured skeleton's joints is labeled as the left shoulder, for the actual skeleton it may be the right shoulder. In addition, the left and the right directions of each joint may be recognized differently for each sensor. In Figure 5, three examples of misoriented problems are described. The yellow and green spheres indicate the positions of the elbow joint recognized by the four sensors as left and right, respectively. Since the spheres of the same color are not adjacent to each other but are mixed, it is confirmed that a misoriented problem occurs in the skeleton data obtained by the SDK. As a result, if joints are merged without alignment, it is inevitable to obtain crushed skeleton data such as the red skeleton in Figure 5. For that problem, the arrangement process for each of the left and right joint sets is adopted. The body tracking SDK of Azure Kinect provides the confidence value of tracked joints. There are three levels of confidence values: medium, low, and none. If the tracker can track the joint with an average level of confidence, the joint will be assigned a medium level of confidence value. Additionally, the low level is assigned to a joint that is not tracked but is estimated by the tracker because it is occluded or invisible. The joint with the none level indicates that the target joint is not in the field of view. Additionally, this level of assurance was used as the arrangement's standard.
At first, for each of the left and right joints, the reference position for arrangement is set by the average point of the joint positions that have a medium confidence level (if there is no medium level joint, the standard will be low level). After, the distance comparing will be conducted. Let where the , are the reference positions of the right and left joint, respectively. Additionally, i is the number of the set of left and right joints that should be verified. Then, the distance between the reference position and the joints in both directions is calculated and compared to correct the misoriented problem as the following function. For that problem, the arrangement process for each of the left and right joint sets is adopted. The body tracking SDK of Azure Kinect provides the confidence value of tracked joints. There are three levels of confidence values: medium, low, and none. If the tracker can track the joint with an average level of confidence, the joint will be assigned a medium level of confidence value. Additionally, the low level is assigned to a joint that is not tracked but is estimated by the tracker because it is occluded or invisible. The joint with the none level indicates that the target joint is not in the field of view. Additionally, this level of assurance was used as the arrangement's standard.
At first, for each of the left and right joints, the reference position for arrangement is set by the average point of the joint positions that have a medium confidence level (if there is no medium level joint, the standard will be low level). After, the distance comparing will be conducted. Let P Rr where the P Rr i , P Rl i are the reference positions of the right and left joint, respectively. Additionally, i is the number of the set of left and right joints that should be verified. Then, the distance between the reference position and the joints in both directions is calculated and compared to correct the misoriented problem as the following function.
Here, the Dist P 1 , P 2 means the Euclidian distance (mm) between P 1 and P 2 , and the P Tr i , P Tl i are the positions of the target joint of the right and left sides, respectively. Then, we defined the checking rules to detect the misoriented problem as: When the misoriented joint set is detected, the direction label of the target joint set composed of left and right is corrected. The Azure Kinect body tracking SDK was used for this study, and the tracker provided skeleton data with 32 joints. However, we only used 16 joints from the torso (pelvis, spine, chest, neck, head, and shoulder) and limbs (elbow, wrist, hip, knee, and ankle). This is because other joints do not have high accuracy, including a large error when the subject performs a large movement action. The proposed arrangement process is applied to all limb joints, hip joints, and shoulder joints in this study to correct misaligned joints.

Skeleton Merging and Noise Filtering Using DBSCAN
In the merging process, we tried to conduct the merging process with only inlier candidates by noise filtering based on the positions of candidates. Noise candidates were defined as incorrectly recognized joints collected from sensors that had issues recognizing target joints. When compared to the inlier candidate, the noise candidate is relatively far from the actual target joint position. In this study, we use DBSCAN to filter these noise candidates during the merging process, as shown in Figure 6.

Skeleton Merging and Noise Filtering Using DBSCAN
In the merging process, we tried to conduct the merging process with only inlier candidates by noise filtering based on the positions of candidates. Noise candidates were defined as incorrectly recognized joints collected from sensors that had issues recognizing target joints. When compared to the inlier candidate, the noise candidate is relatively far from the actual target joint position. In this study, we use DBSCAN to filter these noise candidates during the merging process, as shown in Figure 6. DBSCAN is a clustering method based on the density of target data distribution. It operates under the assumption that candidates belonging to the same cluster will be distributed close to each other. Because it operates by including adjacent data in the same cluster based on the density of data, DBSCAN can perform clustering well on data of unspecified shapes. Furthermore, DBSCAN can classify noise data while clustering, which DBSCAN is a clustering method based on the density of target data distribution. It operates under the assumption that candidates belonging to the same cluster will be distributed close to each other. Because it operates by including adjacent data in the same cluster based on the density of data, DBSCAN can perform clustering well on data of unspecified shapes. Furthermore, DBSCAN can classify noise data while clustering, which mitigates the degradation of clustering performance caused by outlier participation. As a result of merging joint candidates that can be randomly distributed, noise can be appropriately filtered by using DBSCAN. It operates with hyperparameters composed of the neighboring data searching area ∈ and the minimum number of neighboring data N c . The detailed operation procedure of DBSCAN is described in Algorithms 1 and 2.

Input:
candidates set χ, searching area , minimum number of neighboring data N c Output:

Input:
data point χ, searching area Output: neighbors Our proposed algorithm defines the probability that the data in the cluster are the same as the actual location of the target joint considering both the number and the density of the cluster. In other words, it is assumed that the more densely the positions of candidate joints recognized from various angles belong, the higher the probability that the data constituting the cluster is the same as the actual joint coordinates. We use two more tricks in this case. First, the candidate for the reference position mentioned in Section 2.2.1 was chosen. In the clustering process, it gives more weight to candidates with a high confidence value of tracking state. Second, the previous position of the target joint was included as a clustering candidate. Even if an occlusion problem occurs during the skeleton data recognition process, the noise may be too large for DBSCAN to filter. Furthermore, there are cases where the movement of the recognized joint exceeds the actual joint movement distance or there is completely no recognized joint movement. To solve this problem, we added a smoothing effect to the movement of the joint by including the previous joint position in the clustering process.
After applying DBSCAN, we selected the cluster containing the most points as the candidate group for the target joint. Furthermore, we used the centroid of the chosen cluster as the merged joint position. The reference position and previous coordinates of the target joint, which were also included, were not used for the centroid calculation at this time to ensure the range of the actual joint movement. Among the hyperparameters of DBSCAN described above, N c was fixed to 1. In addition, the neighboring data searching area ∈ cannot be specified. Therefore, we conducted experiments on the searching areas of 5, 10, 15, and 20 cm in Section 3.

Joint Position Tracking Using Kalman Filter
Even after the merging process, the tremble noise remains in the target joint. Therefore, we applied a Kalman filter-based tracking method to each joint to make the movement of the target joint smooth [34]. We denote X j t = X j x,t , X j y,t , X j z,t the state vector of the joint number j at time step t and Z j t = x j Z,t , y j Z,t , z j Z,t is the measurement vector for this joint resulting from the previously described merging algorithm. Then, we designed a linear system for a state model with process noise, and its measurement model with measurement noise as follows: where A is the so-called state space transition model, H is the measurement matrix, and w p and v m are the process noise and measurement noise, respectively. Here, w p and v m are the white noise that complies with the Gaussian normal distribution with zero mean and covariance Q t and R t , respectively. In our system, Q t and R t are set to 0.01 and 1.0, respectively. For the input argument X j t , the 3D coordinate data of the target joint is used, and Z j t is a corrected position of X j j t t from the measurement step. The designed state model estimates a predicted position of joint, X j t , through the prediction and correction steps of the Kalman filter with X j t . In detail, the measurement step removes the noise in X j t and then the prediction step estimates X j t . The predicted state vector X j t and the predicted covariance matrix P j t are estimated in the prediction step at the time step t − 1 as follows: whereX j t−1 and P j t−1 are the posteriori state estimate and the posteriori error covariance matrix at the time step t − 1, respectively. Here, X j t is used as the tracked target joint position in this study. For the correction step, X j t , P j t , and the Kalman gain K t are used to calculate the posteriori system estimateX j t which removes the noise. Thus,X j t was calculated as the corrected position by using the measurement value p j t and the posteriori covariance matrix P j t as follows: which will be used for the prediction step at time step t + 1. By applying a Kalman filter to each joint and performing a tracking method, it is possible to obtain the smooth movement of joints with unrecognized movement or tremors.

Experiment Setting
In this section, the experiment settings and environment are described. All experiments were conducted on an Intel 19-11900F octa-core microprocessor clocked at 2.50 Ghz with 32 GB RAM. Additionally, we used two GeForce RTX 2060 super GPU for operating the Azure Kinect body tracking SDK. In the experiment, five Kinect Azure sensors were used to capture all of the subjects' gestures. The sensor SDK version was 1.4.1, and the body tracking SDK version was 1.1.0. Additionally, every procedure was developed in the C++ environment. The proposed algorithm's goal is to track skeleton data in real-time. However, when the body tracking SDK for five sensors was operated on two GPU, the capturing speed was less than 10 frames per second. Therefore, we chose the lite-model that had a 2 times performance increase and 5% accuracy decrease among the models of body tracking SDK (as described in https://docs.microsoft.com/en-us/azure/kinect-dk/ accessed on 17 April 2022). Additionally, the depth mode of the sensor was narrow field-of-view that has the smallest depth image size. Then, with five sensors in a single PC, we could achieve a capturing rate of 30 frames per second (obtaining data using all sensors). Furthermore, the proposed skeleton merging algorithm generates results at a speed of 1-2 msec per frame, on average. As a result, the final tracking ran at approximately 28-30 frames per second, allowing real-time tracking of skeleton data. In addition, by connecting multiple devices to each other, all sensors' time steps were synchronized. Therefore, no additional work for time synchronization was required because the sensor SDK manages the trigger timing between linked sensors.
We conducted an experiment to evaluate the proposed skeleton tracking algorithm, capturing six gestures performed by six different people using four Azure Kinect sensors. The gestures are both hands up and down, jump, squat, lunge, walking, and moving the body in a standing pose (random movement). The examples of gestures are described in Figure 7. All subjects repeatedly performed each gesture for 1000 frames. Additionally, all gestures were started with a standing pose. In the case of the jump gesture, all subjects jumped naturally with their hands raised. In the case of random movement, all subjects moved their body in standing pose, for example, waving arms, leaning, or crossing arms. Many studies that test the accuracy of skeleton data use motion capture equipment as the ground truth. However, there is an interference problem between the RGB-D sensor and the motion capture system [35,36]. This is caused by the interference of infrared (IR) wavelength between the RGB-D sensor and the IR sensor of the motion capture system. This issue prevents the RGB-D sensor from measuring depth data and causes serious issues when the RGB-D sensor's body tracker estimates skeleton data. Furthermore, because the retroreflective marker used in the motion capture system reflects IR, the RGB-D sensor cannot extract depth information from the corresponding area, affecting skeleton data recognition. Additionally, the position of each joint in the skeleton data provided by motion capture does not perfectly match the skeleton data of the RGB-D sensor body tracker. As a result, it is not appropriate to use the motion capture system as the ground truth in quantitatively evaluating the proposed algorithm. retroreflective marker used in the motion capture system reflects IR, the RGB-D sensor cannot extract depth information from the corresponding area, affecting skeleton data recognition. Additionally, the position of each joint in the skeleton data provided by motion capture does not perfectly match the skeleton data of the RGB-D sensor body tracker. As a result, it is not appropriate to use the motion capture system as the ground truth in quantitatively evaluating the proposed algorithm. Therefore, similar to the evaluation method of [21], we adopted the skeleton data from the best view sensor as the ground truth for the evaluation. The performance of Kinect body tracking is the best when entire body parts could be observed in the depth view of the sensor [14]. Additionally, when the sensor measures the subject from the front view of the subject, the largest number of body parts can be measured [17]. In other words, when the skeleton data are measured from the front view, the result of body tracking SDK could have the best accuracy. Therefore, the skeleton data measured from the front of the subject was adopted as the ground truth. By comparison, we evaluated how different the skeleton data merged by the proposed algorithm was from the ground truth.
For the experiment, we installed four sensors to capture the gestures of subjects and one more sensor was installed additionally for the ground truth. Figure 8   Therefore, similar to the evaluation method of [21], we adopted the skeleton data from the best view sensor as the ground truth for the evaluation. The performance of Kinect body tracking is the best when entire body parts could be observed in the depth view of the sensor [14]. Additionally, when the sensor measures the subject from the front view of the subject, the largest number of body parts can be measured [17]. In other words, when the skeleton data are measured from the front view, the result of body tracking SDK could have the best accuracy. Therefore, the skeleton data measured from the front of the subject was adopted as the ground truth. By comparison, we evaluated how different the skeleton data merged by the proposed algorithm was from the ground truth.
For the experiment, we installed four sensors to capture the gestures of subjects and one more sensor was installed additionally for the ground truth. Figure 8 depicts the locations of all capturing sensors with gray color as well as the areas where the subjects performed the gestures. The positions of capturing sensors measuring the skeleton data of the subject were fixed. The subjects performed all gestures for each direction described as the blue arrow-cross in Figure 8. According to directions of the subject, the reference sensor was moved to obtain the reference data that measured the subject from the front (the candidate positions of reference sensor are described in Figure 8 with green color). The best view sensor was capable of capturing all gestures without occlusion.
Sensors 2022, 22, x FOR PEER REVIEW 15 of the subject were fixed. The subjects performed all gestures for each direction descr as the blue arrow-cross in Figure 8. According to directions of the subject, the refer sensor was moved to obtain the reference data that measured the subject from the (the candidate positions of reference sensor are described in Figure 8 with green co The best view sensor was capable of capturing all gestures without occlusion. When all capturing sensors measure the skeleton data of the subject, the self-o sion problem arises in any direction the subject performs a gesture. We defined the tance between the joint positions of the ground truth and the joint positions of the mer process as an error for the evaluation in millimeters. The standard deviation of the values is also calculated. Through a designed experiment, the difference between the eton data merged by the proposed algorithm and the skeleton data measured from front was compared. The analysis of the results is described in the following section the raw data (RMSE, STD) is provided in the Appendixes A and B. In the case of Orientation Resetting and Average (A2), since there no significant difference in performance with A1 in our evaluation data, it was exclu from the comparison group. In other words, there was no misorientation case in ev tion data. However, a misorientation situation was observed when testing the body t ing SDK with a sensor height of 180 cm. Furthermore, since many studies reported misorientation error of skeleton data measured by Kinect body tracking SDK, the rea ment process of the joint's orientation is determined as essential [21,23,33]. For the r When all capturing sensors measure the skeleton data of the subject, the self-occlusion problem arises in any direction the subject performs a gesture. We defined the distance between the joint positions of the ground truth and the joint positions of the merging process as an error for the evaluation in millimeters. The standard deviation of the error values is also calculated. Through a designed experiment, the difference between the skeleton data merged by the proposed algorithm and the skeleton data measured from the front was compared. The analysis of the results is described in the following section and the raw data (RMSE, STD) is provided in the Appendices A and B. . Each comparison group represents the elements constituting the proposed merging algorithm. In the case of Orientation Resetting and Average (A2), since there was no significant difference in performance with A1 in our evaluation data, it was excluded from the comparison group. In other words, there was no misorientation case in evaluation data. However, a misorientation situation was observed when testing the body tracking SDK with a sensor height of 180 cm. Furthermore, since many studies reported the misorientation error of skeleton data measured by Kinect body tracking SDK, the realignment process of the joint's orientation is determined as essential [21,23,33]. For the result of A5, the average value of the results of setting the DBSCAN searching area to 5, 10, 15, and 20 cm was used. The first experiment's evaluation is divided into joints corresponding to the torso, upper limb, and lower limb, and the details are as follows. of A5, the average value of the results of setting the DBSCAN searching area to 5, 10, 15, and 20 cm was used. The first experiment's evaluation is divided into joints corresponding to the torso, upper limb, and lower limb, and the details are as follows. As a result of A1, the torso joints had an average error (AE) of 41.1 mm, the upper limb joints had an AE of 90.9 mm, and the lower limb joints had an AE of 45.2 mm. Additionally, the standard deviation (STD) was 11.02, 44.89, and 22.0, respectively. In the case of A3, the AE of the torso was 40.8 mm, the upper limb was 66.0 mm, and the lower limb was 38.9 mm, with STD values of 13.1, 40.1, and 17.8, respectively. There was an improvement in joint position accuracy in the joints corresponding to the upper and lower limbs, and the STD of error was reduced in the case of the lower limb, resulting in a smoothing effect. As a result of A4, torso had an AE of 41.1 mm, upper limb 60.2 mm, and lower limb 38.9 mm. STD was 12.0, 33.8, and 17.3, respectively. A4 also showed improvement in performance in the joints corresponding to the upper and lower limbs, also showing a smoothing effect by reducing the STD of error in the joints corresponding to the upper limb. Finally, in the case of A5, the torso had an AE of 39.9 mm, the upper limb had an AE of 55.9 mm, and the lower limb had an AE of 36.2 mm, with the STD of error being 10.3, 29.3, and 14.2, respectively. The performance for positioning accuracy in all joints improved as a result of A5, and the standard deviation of error was also reduced. Consequently, the performance of A5 proposed in this paper had the highest accuracy among all comparison groups. The raw data of the experimental result of algorithm improvement was described in Tables A1-A4 in Appendix A.

Result of Different Searching Areas of DBSCAN
The second analysis is a comparison of results according to the searching area of DBSCAN used in A3, A4, and A5. Additionally, the comparison group consisted of 5, 10, 15, and 20 cm. As in the above analysis, the results for all gestures are divided into torso, upper limb, and lower limb as shown in Figure 10. As a result of the searching area of 5 cm, the joints of torso had an AE of 40.6 mm, the joints of the upper limb had an AE of 60.2 mm, and the joints of the lower limb had an AE of 36.0 mm. For STD, the results were 13.5, 36.2, and 14.4, respectively. This indicates that the search area is insufficiently large. In particular, in the case of a joint corresponding to a fast-moving limb, a sufficient number of inlier candidates cannot participate in the merging process. Furthermore, the point used for 1 frame smoothing prevents fast-moving inlier candidates from taking part in the merge process. When the searching area was set to 10 cm, the torso had an AE of 39.1 mm, the upper limb had an AE of 51.6 mm, and the lower limb had an AE of 35.7 mm, with the The performance for positioning accuracy in all joints improved as a result of A5, and the standard deviation of error was also reduced. Consequently, the performance of A5 proposed in this paper had the highest accuracy among all comparison groups. The raw data of the experimental result of algorithm improvement was described in Tables A1-A4 in Appendix A.

Result of Different Searching Areas of DBSCAN
The second analysis is a comparison of results according to the searching area of DBSCAN used in A3, A4, and A5. Additionally, the comparison group consisted of 5, 10, 15, and 20 cm. As in the above analysis, the results for all gestures are divided into torso, upper limb, and lower limb as shown in Figure 10. As a result of the searching area of 5 cm, the joints of torso had an AE of 40.6 mm, the joints of the upper limb had an AE of 60.2 mm, and the joints of the lower limb had an AE of 36.0 mm. For STD, the results were 13.5, 36.2, and 14.4, respectively. This indicates that the search area is insufficiently large. In particular, in the case of a joint corresponding to a fast-moving limb, a sufficient number of inlier candidates cannot participate in the merging process. Furthermore, the point used for 1 frame smoothing prevents fast-moving inlier candidates from taking part in the merge process. When the searching area was set to 10 cm, the torso had an AE of 39.1 mm, the upper limb had an AE of 51.6 mm, and the lower limb had an AE of 35.7 mm, with the error STD values being 9.6, 26.7, and 13.8, respectively. As a result of setting the searching area to 15 cm, the torso had an AE of 39.9 mm, the upper limb had an AE of 53.5 mm, and the lower limb had an AE of 36.2 mm. Additionally, they had STDs of 9.0, 26.6, and 14.0, respectively. As a result of the searching area of 20 cm, the torso had an AE of 39.9 mm, upper limb 58.1 mm, and lower limb 36.8 mm, with STDs of 9.0, 27.8, and 14.4, respectively. Setting the search area to 15 or 20 cm is too large for noise filtering. As a result, the error increased because points corresponding to noise candidates participated in the merge process without being filtered. Overall, the case of the 10 cm searching area performed the best out of all comparison groups. Furthermore, the standard deviation of the error with searching areas of 15 and 20 cm is less than 10 cm, but there is no statistically significant difference.
error STD values being 9.6, 26.7, and 13.8, respectively. As a result of setting the searching area to 15 cm, the torso had an AE of 39.9 mm, the upper limb had an AE of 53.5 mm, and the lower limb had an AE of 36.2 mm. Additionally, they had STDs of 9.0, 26.6, and 14.0, respectively. As a result of the searching area of 20 cm, the torso had an AE of 39.9 mm, upper limb 58.1 mm, and lower limb 36.8 mm, with STDs of 9.0, 27.8, and 14.4, respectively. Setting the search area to 15 or 20 cm is too large for noise filtering. As a result, the error increased because points corresponding to noise candidates participated in the merge process without being filtered. Overall, the case of the 10 cm searching area performed the best out of all comparison groups. Furthermore, the standard deviation of the error with searching areas of 15 and 20 cm is less than 10 cm, but there is no statistically significant difference.

Result according to TNOS
Finally, a third analysis was performed to evaluate the performance of the proposed algorithm in improving the accuracy of the skeleton data by merging the skeletons obtained from multiple sensors. This analysis also made use of data from the described gestures performed by all subjects. The comparison was carried out by adjusting the TNOS used for merging, and the results are depicted in Figure 11. The TNOS values used in the evaluation are 1, 2, 3, and 4. In the case of a single sensor, the raw data was used instead of the merging algorithm. The comparison components (AE, STD) were calculated as the average value of all combinations with the same size as the TNOS used for merging. In addition, based on the above experimental results, the DBSCAN searching area was fixed to 10 cm that had the best performance.

Result According to TNOS
Finally, a third analysis was performed to evaluate the performance of the proposed algorithm in improving the accuracy of the skeleton data by merging the skeletons obtained from multiple sensors. This analysis also made use of data from the described gestures performed by all subjects. The comparison was carried out by adjusting the TNOS used for merging, and the results are depicted in Figure 11. The TNOS values used in the evaluation are 1, 2, 3, and 4. In the case of a single sensor, the raw data was used instead of the merging algorithm. The comparison components (AE, STD) were calculated as the average value of all combinations with the same size as the TNOS used for merging. In addition, based on the above experimental results, the DBSCAN searching area was fixed to 10 cm that had the best performance.  Regarding the result of a single sensor, the number of combinations was 4, had an AE of 63.3 mm for the joints corresponding to the torso, 125.6 mm for the joints belonging to the upper limb, and 60.2 mm for the lower limb. The STD values of error were 15.8, 64.2, and 28.6, respectively. The number of combinations in the case of using two sensors for merging was 6, and the torso had an AE of 50.9 mm, the upper limb had an AE of 103.5 mm, and the lower limb had an AE of 46.8 mm. Additionally, the standard deviations of error were 13.6, 57.5, and 20.3, respectively. The number of combinations for the use of three sensors was 4, and the AE of torso joints was 43.4 mm, upper limb 66.3 mm, and lower limb 38.3 mm. The standard deviations of error were 12.2, 37.1, and 14.5, respectively. Lastly, in the case of four sensors with a single combination, the AE of torso joints was 39.6 mm, the upper limb 51.8 mm, and the lower limb 35.8 mm. Additionally, the STD values of error were 9.8, 26.0, and 13.2, respectively. Consequently, the accuracy of merged skeleton data increased according to the increase in the TNOS used for merging. The result of TONS experiment was described in Tables A5 and A6 in Appendix B.

Discussion
In this study, we proposed the markerless skeleton tracking algorithm to track skeleton data accurately. The main strategy of the algorithm is filtering the noise candidate joints that occurred due to the self-occlusion problem in the merging process. The proposed algorithm was evaluated by comparing the ground truth obtained with the best view sensor that measures the subject from the front. The detailed analysis for this experimental result is as follows: The analysis was conducted based on the results of the searching area of 10 cm with the best performance. Among the joints corresponding to the torso (pelvis, spine naval, neck, left hip, right hip, left shoulder, right shoulder, and head joints) they could be measured in all gestures because of the low installation height of all sensors. Furthermore, because the amount of change in the positions of the torso joints was not large while the subjects performed all gestures, the proposed algorithm outperformed A1 by less than 7 mm.
In the case of the upper limb joints, the proposed algorithm showed at least 15 mm better results than A1 in all gestures. Among them, the squat gesture improved performance the most. Furthermore, when compared to the results of the elbow joint, the performance of the wrist joint improved by an average of 10 mm or more. In comparison to A1, A3 improved by 20.1 mm, A4 by 23.4 mm, and A5 by 27.0 mm as a result of the elbow joint in all gestures. In the case of the squat gesture, the A3 improved by 34.3 mm, the A4 by 39.3 mm, and the A5 by 42.7 mm for elbow joint. Moreover, in the case of the wrist joint, there was an improvement in the performance of 29.8 mm for A3, 45.4 mm for A4, and 51.6 mm for A5 compared with A1 for all gestures. Additionally, the same as in the case of the elbow, it showed the maximum performance in the squat gesture, and there were performance improvements of 44.0, 80.0, and 83.9 mm for each algorithm. The subjects frequently raised their upper limb toward the sensor during the squatting gesture, resulting in self-occlusion problems at the elbow and wrist joints.
In the case of the joints corresponding to the lower limb, the difference in improvement of performance between the knee and ankle joints is not large, within 7 mm except for the lunge gesture. This is because, due to the characteristics of all gestures, the effect of occlusion affecting the lower limb was not large. Thus, we described the result of the lunge gesture below. In the results of the knee joint, the coordinate accuracy of the joint corresponding to the lower limb was improved by 25.5 mm in A3, 27.7 mm in A4, and 27.6 mm in A5 compared with A1, respectively. Additionally, in the case of the ankle joint, the improvement of A3 was 28.4 mm, A4 32.2 mm, A5 34.5 mm, respectively. Consequently, while there was a large error in the result of A1 in the merged skeleton, the skeletal data composed of the precise joint position could be obtained using the algorithm proposed in this paper.
Our results are comparable to the performance in existing studies related to tracking accurate skeleton data by merging multiple inaccurate skeleton data. Existing skeleton merging algorithms were mainly developed on the basis of Kinect V2. In addition, the superiority of the algorithm was evaluated compared to the performance of a single Kinect. In [22], the authors reported an error of 87 mm for the entire joint in the T-pose and walking around gesture. For the experiment, eight Kinect v2 sensors were used and a marker-based motion capture system was used as the ground truth. They mentioned that the position of the skeleton recognized using the motion capture device and the skeleton of the Kinect SDK did not match perfectly, so there was a problem in accurate comparison. To overcome this, the position of the marker closest to the skeleton joint among the markers attached to the body of the target was adopted as the ground truth, and there was an average distance difference of 100 mm between the joint of skeleton data and marker position. In [23], a merged skeleton was obtained using seven Kinect V2 sensors, and the skeleton data measured in the best view was adopted as the ground truth in the same way as in this study. They reported an average error of 80.3 mm in the experiments on standing, rotating, walking, roaming, and free motion gestures performed by seven subjects. The best view sensor was automatically selected among the capturing sensors using the factor used in the PJA algorithm proposed in the paper. In [24], the authors merged skeletons obtained from five Kinect V2 sensors into one skeleton. They adopted a marker-based motion capture system as the ground truth and evaluated the performance of the proposed merged algorithm in terms of the gestures of walking, spinning, sitting, running, kicking, punching, and crossing of limbs. As a result, the average errors for all joints were 97.1, 91.2, and 69.5 mm for single Kinect (centre-Kinect), simply average, and the proposed merging method, respectively. They also reported that motion capture skeleton and Kinect skeleton did not match perfectly, as in [22], and there was an average difference of 55 mm between the skeleton of Kinect and motion capture system. In [26], the authors obtained a merged skeleton using three Kinect V2 sensors. They evaluated the performance of the proposed merging algorithm using a specific training protocol and non-standing posture, including standing, bending, squat, lying, crossing arms or legs. They also adopted the skeleton obtained from the marker-based motion capture system as the ground truth. As a result, they reported that the accuracy of the merged skeleton was improved by 15.7% compared to the single Kinect, and the average error was measured to be less than 55 mm for all joints.
As with the results in this study, most of the studies have reported that the performance improvement in the joints belonging to the limbs is more pronounced compared to the torso joints. Ref. [26] also reported that the joint belonging to the limbs had a higher performance improvement than the torso. In [25], the authors obtained merged skeleton data of 20 people walking on the treadmill using five Kinect V2 sensors. They evaluated the performance of the merging algorithm using the STD for the difference between the pre-measured bone length of the subjects and that of the merged skeleton. As a result, the STD of the torso was 5.9, 12.0 for the upper limb and 21.8 for the lower limb. In [19], the authors proposed a skeleton merging algorithm for each Kinect V2 and Azure Kinect skeleton. Three of each sensor types were used, and a marker-based motion capture system was adopted as the ground truth. They evaluated the proposed algorithm for running, kicking, punching, crossing arms, crossing legs, crossing arms and legs, bowing from the waist, sitting on the chair, spinning, and walking around. As a result, Kinect V2 reported an error of 46.2 mm for the torso joints, 105 mm for the upper limb, and 135.5 mm for the lower limb. According to the Azure Kinect result, an error of 31 mm for the torso, 59.5 mm for the upper limb, and 121.5 for the lower limb was measured. As mentioned before, the interference problem and IR-reflective marker issue were reported also. For this issue, controlling the trigger between the motion capture system and Azure Kinect was necessary. Additionally, the miniature markers of 2.5 mm in diameter were used to avoid the problem that depth information is not measured by the markers. However, despite these attempts, the problem that occurs when the motion capture system and Azure Kinect are running at the same time cannot be completely eliminated. In [21], the authors proposed a skeleton merging algorithm using four Kinect V2 sensors. Similar to this study, the skeleton measured using the manually selected best view sensor was adopted as the ground truth. The authors collected validation data for walking, walking and spinning, walking while moving arms, walking and bending down, spin arms, and jumping jack gestures performed by six subjects (three for training; three for testing). The performance of the proposed algorithm was evaluated by comparing with the results of a single sensor.
As a result, the error for the single sensor was 128.6 mm in the torso, 187.2 mm in the upper limb, and 157.3 mm in the lower limb. The error for the merged skeleton was 86.0 mm in the torso, 94.5 mm in the upper limb, and 97.5 mm in the lower limb.
As a result of analyzing the performance improvement of the components constituting the algorithm, the merged skeleton using A5 (Orientation Resetting and 1 frame Smoothing DBSCAN with Kalman Filter) was the most accurate. Furthermore, the highest accuracy was obtained when 10 cm was applied to the DBSCAN searching area. Similar to the results of other research, the proposed algorithm improved the accuracy of the merged skeleton compared to the single sensor, and the accuracy improvement for the joints belonging to the limbs was greater than the joints belonging to the torso. In addition, our result was comparable or greater to the performances in existing studies about development of skeleton merging algorithms. Especially, the error of the proposed algorithm was relatively low compared to [21,23], which used the best view evaluation method as in this study. In addition, the accuracy of the merged skeleton increased as TNOS increased. However, the evaluation method performed to evaluate the performance of the proposed algorithm only evaluates the difference between the merged skeleton and the skeleton captured using the best view. Therefore, in order to evaluate the performance of a wider range, it is necessary to compare it with an interference-free measuring device that can measure actual human behavior, such as motion capture using IMU.
Consequently, the proposed system can track the skeleton as accurately as a skeleton measured in front of the subject. The proposed system can be utilized in the field of behavioral monitoring research targeting human tracking. Furthermore, it can be used for a variety of interactive content such as games or education. The proposed algorithm, on the other hand, is intended to apply to a single person. In this study, we focus on applying the algorithm to only a single person. Therefore, extending the study is needed to apply the proposed algorithm for tracking multiple people. For the next study, we have a plan to implement a game-like interaction with a large number of participants. A study on the standards of gestures (actual motion shape and speed) that can be actually applied by using the skeleton data will be conducted to define the possible interaction. Additionally, discussion of the proper installation of sensors to track multiple people will also be covered.

Conclusions
The goal of this paper was development of the markerless skeleton tracking system using multiple RGB-D sensors. We proposed the algorithm to merge multiple skeleton data, which includes an error in the position of joints due to self-occlusion problems, into accurate single skeleton data. The main issue with this approach was determining how to filter the noise candidates reported by the multiple RGB-D sensors during the merging process. To address this issue, we used a clustering algorithm called DBSCAN. We proposed additional tricks to increase the weight of inlier candidates participating in the merging process. For the evaluation of the proposed algorithm, we conducted the experiment capturing the six gestures performed by six subjects using four capturing RGB-D sensors and a single best view sensor for obtaining ground truth. As a result of the analysis, the proposed algorithm showed the most accurate performance. Additionally, the proposed method showed relatively lower errors than other related studies. Furthermore, the result of the 10 cm searching area of DBSCAN showed the highest accuracy. Consequently, using the algorithm proposed in this study, it was possible to acquire skeleton data as accurate as the skeleton data measured from the front of the subject. Institutional Review Board Statement: Ethical review and approval were waived for this study due to the data used in this study was lab-data. Also, the The data does not contain any information about the subject's identity.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.