How the Processing Mode Influences Azure Kinect Body Tracking Results

The Azure Kinect DK is an RGB-D-camera popular in research and studies with humans. For good scientific practice, it is relevant that Azure Kinect yields consistent and reproducible results. We noticed the yielded results were inconsistent. Therefore, we examined 100 body tracking runs per processing mode provided by the Azure Kinect Body Tracking SDK on two different computers using a prerecorded video. We compared those runs with respect to spatiotemporal progression (spatial distribution of joint positions per processing mode and run), derived parameters (bone length), and differences between the computers. We found a previously undocumented converging behavior of joint positions at the start of the body tracking. Euclidean distances of joint positions varied clinically relevantly with up to 87 mm between runs for CUDA and TensorRT; CPU and DirectML had no differences on the same computer. Additionally, we found noticeable differences between two computers. Therefore, we recommend choosing the processing mode carefully, reporting the processing mode, and performing all analyses on the same computer to ensure reproducible results when using Azure Kinect and its body tracking in research. Consequently, results from previous studies with Azure Kinect should be reevaluated, and until then, their findings should be interpreted with caution.


Introduction
Reproducibility is one of the main quality criteria in research. Measurement errors of instruments should be small and stable to ensure reproducible and consistent results. Like any other measurement instrument, popular depth cameras do not provide error-free measurements. A frequently used depth camera in research is the Microsoft Azure Kinect DK RGB-D-camera with its Azure Kinect Body Tracking SDK. Among other things, it is used for movement analysis [1,2] as well as posture analysis [3,4]. The Azure Kinect has a 1-megapixel time-of-flight (ToF) camera installed for depth measurement, which has a typical systematic error of <11 mm + 0.1% of the distance to the object and a random error with a standard deviation of ≤17 mm according to its manufacturer [5]. These numbers were confirmed in a study by Kurillo et al. [6]. The systematic error describes the difference between the measured depth (time average over several frames) and the correct depth. Multi-path interference, i.e., the condition when a sensor pixel integrates light reflected from multiple objects, is not considered here. The random error, on the other hand, is the standard deviation of the depth over time [7]. One source of depth noise is introduced by the internal temperature of the depth sensor. Tölgyessy et al. found that the camera needed to be warmed up for about 50-60 min before a stable depth output was acquired [8].
Not only does the sensory system itself introduces measurement error, all elements in the entire processing chain, such as the body tracking using Microsoft's Azure Kinect Body Tracking SDK [9], introduce additional measurement errors. Measurement errors propagate and might be amplified further up the chain. Albert et al. found a mean Euclidean distance between 10 mm and 60 mm, depending on the tracked joint, in a comparison between the optical marker-based Vicon system (Vicon Motion Systems, Oxford, UK) as the gold standard and the Azure Kinect [10]. Ma et al. found a Root Mean Square Error (RMSE) of the joint angles in the lower extremities between 7.2°and 32.3°compared to the Vicon system [11].
According to Tölgyessy et al., the depth mode of the camera (wide field of view (WFOV) or narrow field of view (NFOV)), as well as the distance of the person to the camera, also have a large influence on the accuracy of the body tracking [12]. They showed a strong increase in the standard deviation by more than 2.2 m from the camera in WFOV and more than 3.7 m in NFOV. Romeo et al. have shown that ambient light conditions also influence the accuracy of body tracking. They found an up to 2.4 times higher mean distance error when the subject was illuminated with 1750 lux compared to 10 lux [13]. However, both light conditions are not realistic for a study with human subjects. Furthermore, we question whether the halogen light source used by Romeo et al. provided a considerable source of infrared light, which might have influenced the results. Our original goal was to investigate the influences of ambient (natural and artificial) light on body tracking accuracy.
During the initial data analysis of our data collected under different light conditions, we found considerable differences in the body tracking results when running the Azure Kinect Body Tracking SDK using the processing mode CUDA (short for Compute Unified Device Architecture, developed by NVIDIA) repeatedly on the same video. These differences in body tracking mean that results cannot be reproduced, and therefore, quality assurance and good scientific practice are not assured. At first sight, the differences were not explainable by the known sources of measurement error from previous studies described above. These studies focused mainly on the validity of the Kinect's measurements and its body tracking. In contrast, the differences we observed are signs of a repeatability instead of solely a validity problem. As a consequence, the inconsistent results of repeated body tracking runs may have influenced the validity analyses in previous studies. Additionally, the dimension of the measurement error introduced by repeated body tracking runs might be clinically relevant, although unknown, since we did not find any literature describing differences between multiple runs of the body tracking. Therefore, we decided to perform additional experiments to further investigate the error introduced by the body tracking processing mode before investigating the influence of ambient light. These additional experiments and their results are the focus of this paper. The aim of this paper is to analyze and quantify the effects of the chosen processing mode on the results when using the Azure Kinect DK in combination with repeated body tracking runs using the Azure Kinect Body Tracking SDK.
The structure of this paper is as follows: Section 2 describes the experimental setup as well as the used hardware and software. Section 3.1 presents the methods and results of changes over time. Section 3.2 deals with the methods and results of the comparison of processing modes. Section 3.3 handles the comparison of different computers, followed by the discussion and conclusion in Sections 4 and 5, respectively.

Materials
In this paper, we aim to quantify the different results caused by the body tracking's processing mode. For all analyses, the experimental setup described in Section 2.1 was used together with the hardware and software described in Section 2.2.

Experimental Setup
To analyze the effects of running body tracking multiple times, we created an experimental setup designed to minimize the effects of other sources of noise that might influence the results. Therefore, we recorded a single 30-s video that served as input for all our experiments. The video was recorded in a windowless room without reflective surfaces.
To exclude the influences of external light sources, which might have an effect on the body tracking, as described by Romeo et al. [13], we turned off all lights in the room during the recording. To exclude the effects of movement on the results, we used a mannequin of 1.72 m in height instead of a human (Figure 1). The mannequin was stationary, i.e., did not move during the experiments, and was positioned in a neutral pose such that all joints were within the camera's field of view without (self-)occlusions (Figure 1b). The mannequin was placed frontally to the camera in a distance of 1.9 m. This distance represents the middle of the optimal range (0.5-3.86 m) for the NFOV of the camera [6,8]. According to the literature, a lateral view of the person is useful when either only the side of the body facing the camera is to be analyzed [14] or the side of the body facing away from the camera occludes itself [15]. However, since there is no occlusion in this study and both sides of the body are to be analyzed equally, the camera was placed frontally to the mannequin. The camera was placed on a tripod in front of the mannequin at approximately 1 m height to ensure a centered position of the mannequin in the depth camera's field of view. In addition, the camera case was aligned horizontally using a spirit level on top of the camera. A schematic overview of the setup is shown in Figure 1a.
In accordance with the results of Tölgyessy et al. [8], the camera was warmed up for an hour before recording. To check whether body tracking was able to estimate the skeleton of the mannequin, we used the k4abt_simple_3d_viewer from the Azure Kinect Body Tracking SDK. As shown in Figure 1c, the mannequin's skeleton was recognized by the body tracking.
(a) (b) (c) Figure 1. Overview of the experimental setup. (a) Schematic setup. In addition, the coordinate system of the depth camera is shown, which is tilted downwards by 6°with respect to the camera's case. (b) Mannequin from the camera's point of view in a windowless dark room (picture taken with the lights turned on). (c) Point cloud of the mannequin with overlaid body tracking. Screenshot of k4abt_simple_3d_viewer from the Azure Kinect Body Tracking SDK.

Hardware and Software
The The Azure Kinect DK depth camera's coordinate system x-axis points to the right, the y-axis to the bottom, and the z-axis to the front (seen from the camera's perspective and visualized in Figure 1a). The depth camera is tilted downward by 6°relative to the camera's RGB-lens and the camera's case [16].
For the body tracking, we used the recorded video as input for all analyses utilizing the Microsoft Azure Kinect Body Tracking SDK version 1.1.2. The video was processed using the offline_processor from the Azure Kinect Samples on GitHub downloaded on 30 June 2022 [17]. The results from the offline_processorwere stored in a JSON file.
The offline_processorwas executed 100 times for each of the four processing modes (Central Processing Unit (CPU), Compute Unified Device Architecture (CUDA), Direct Machine Learning (DirectML), and TensorRT) provided by the Azure Kinect Body Tracking SDK. This process was performed on two desktop computers: computer A, with an Intel Core i9-10980XE 18-Core Processor running at 4.80 GHz and NVIDIA GeForce RTX 3080 graphic card; and computer B, with an AMD Ryzen 7 5800X 8-Core Processor running at 3.80 GHz and a NVIDIA GeForce RTX 3070Ti graphic card. Both computers were running Windows 10. In the remainder, we present the results from computer A for all analyses unless stated otherwise. In Section 3.3, we present and compare the results from both computers.
Of the 32 calculated joints provided by the Azure Kinect Body Tracking SDK, we limited our analyses to the set of main joints listed in Table 1. This was performed to provide a better overview of the relevant joints for posture analysis and because the excluded joints are difficult for the camera to recognize accurately. The results of the body tracking were analyzed using Python 3.8.10, except for the calculations of the ellipsoids for which we used MATLAB (version 2022a, The MathWorks Inc., Natick, MA, USA). Table 1. Included and excluded joints from the Azure Kinect Body Tracking SDK joints in our analysis (for reference, see [18]). SPINE_NAVEL  HAND_LEFT  SPINE_CHEST  HANDTIP_LEFT  NECK  THUMB_LEFT  SHOULDER_LEFT  CLAVICLE_RIGHT  ELBOW_LEFT  HAND_RIGHT  WRIST_LEFT  HANDTIP_RIGHT  SHOULDER_RIGHT  THUMB_RIGHT  ELBOW_RIGHT  HEAD  WRIST_RIGHT  NOSE  HIP_LEFT  EYE_LEFT  KNEE_LEFT  EAR_LEFT  ANKLE_LEFT  EYE_RIGHT  FOOT_LEFT EAR_RIGHT HIP_RIGHT KNEE_RIGHT ANKLE_RIGHT FOOT_RIGHT

Methods and Results
To analyze the body tracking and their different results by processing mode, three main experiments were conducted. In the first experiment (Section 3.1), we processed the video of the static mannequin multiple times in each processing mode to analyze possible changes in joint positions over time caused by the body tracking. The second experiment (Section 3.2) deals with the effects of the individual processing modes on the joint positions, analyzing both the spatiotemporal distribution and the effect on derived parameters (e.g., bone length). In the last experiment (Section 3.3), these analyses were performed on two computers, and their results were compared. Figure 2 shows a schematic overview of the data processing and the experiments.   Table 1 for the setup described in Section 2 were plotted in relation to the depth sensor's x-, y-, and z-axes for all body tracking runs and each processing mode.
In the resulting visualization, we noticed a converging behavior of the joint positions in the first few seconds from the start of the body tracking (described in detail below in Section 3.1.2). To model and quantify this converging behavior, we fitted an exponential curve for each axis, processing mode, and body tracking run using Python's scipy.optimize.curve_fit function and the following curve fitting formula: where a, b, c ∈ R; constant coefficients calculated by Python's curve_fit function, x ∈ R. For each of the fitted curves, we calculated the half-life time T1 /2 using the formula: with b from Equation (1). We considered four times T1 /2 (93.75%) as the cut-off point between the convergence and the random noise in the signal, i.e., considered the end of the convergence at x = 4 · T1 /2 . For each axis, joint, and processing mode, the mean was calculated. The overall cut-off point was calculated using the 85%-quantile over these means.

Results
We observed an undocumented converging initialization behavior of joint positions estimated by the Azure Kinect Body Tracking SDK in the first seconds. Figure 3 shows this behavior for two exemplary joints in the first 90 frames: (a) the PELVIS and (b) the WRIST_LEFT. The PELVIS (Figure 3a) showed similar converging behavior to a steady state for all processing modes and all axes. For other joints, such as the WRIST_LEFT (Figure 3b), this converging behavior took only a few frames (e.g., CPU/DirectML x-axis in Figure 3b) or was not that distinctive (e.g., CUDA/TensorRT y-axis in Figure 3b). The duration of the converging behavior varied by joint and processing mode and ranged from 2 frames (SPINE_CHEST; DirectML) to 360 frames (FOOT_RIGHT; TensorRT). It is also noticeable that while most of the joints stabilized from left to right (x-axis), top to bottom (y-axis), and front to back (z-axis), some stabilized the other way around, e.g., the x-axis of the PELVIS stabilized from right to left. The range of stabilization differed between the various joints, axes, and processing modes. The range of stabilization for the y-axis of the WRIST_LEFT for processing mode CPU, for example, was less than 5 mm, while the z-axis of the same joint using CUDA showed a converging range of about 60 mm.

Methods
In Section 3.1.2, we observed a previously undocumented converging initialization behavior in the first seconds of the body tracking. Since these partially large variations can have a substantial influence on the quantification of the differences between the processing modes, we decided to discard the data in which the converging behavior was observed. Since the duration of the converging behavior highly varied by joint, axis, and processing mode, we decided to take the 85%-quantile (55 frames; Section 3.1.2) and round it up to the next full second of the video (frame 60, i.e., 2 s @30 FPS). Therefore, the first 60 frames were discarded in all analyses from this point on, i.e., only frames 61 until 900 were included in the analyses.
Differences in body tracking caused by the processing mode were quantified using three different metrics: (1) Volume of ellipsoids containing positions of one joint each: As the joint positions scatter in all three dimensions, they form a volume. The first observations showed that the scattering was not equally distributed over all axes. Since there was often a dominant direction, an ellipsoid was chosen over a sphere. These ellipsoids were calculated per processing mode and joint (Table 1) using MATLAB and the data of each of the 100 body tracking runs per processing mode. The function PLOT_GAUSSIAN_ELLIPSOID written by Gautam Vallabha from the MATLAB central file exchange [19], was used for these calculations. As parameters for this function, we used SD = 2 and NPTS = 20 to create an ellipsoid with 20 faces that encapsulates approximately 86.5% of all data points [20]. The volume of the resulting ellipsoid was then calculated using MATLAB's convhull function by feeding it the data points of the ellipsoid.
(2) Euclidean distance between positions of a joint between the 100 runs: The Euclidean distances d t (p, q) were calculated using the following formula: where t = time in frame number 61 . . . 900, joint ∈ included joints (Table 1), p {x/y/z} i = joint position for joint in i for x/y/z axes, respectively, q {x/y/z} j = joint position for joint in j for x/y/z axes, respectively, i = body tracking run 1 . . . 100, j = body tracking run i . . . 100. The Euclidean distances were calculated for each frame and joint position from one body tracking run to all other runs within the same processing mode, i.e., when p was the position of the PELVIS in frame 61 for processing mode CPU in run 1, q was the position of the PELVIS in frame 61 for processing mode CPU in run 2 until 100.
(3) Bone length of the extremities: The bone lengths of the extremities were calculated with s as the start joint and e as the end joint from Table A1 using the following formula: where t = time in frame number 61 . . . 900, bone ∈ bones (Table A1). s {x/y/z} = position of the start joint (Table A1) for bone for x/y/z axes, respectively, e {x/y/z} = position of the end joint (Table A1) for bone for x/y/z axes, respectively. The bone lengths were calculated for each frame and processing mode within the same body tracking run.
We calculated and reported the minimum, maximum, mean, median, and standard deviation for each metric, processing mode, and joint or bone length, respectively.

Results
After discarding the first 60 frames of the body tracking results, the x-, y-, and z-axes of the positions for all four processing modes looked like the exemplary plots for PELVIS and FOOT_LEFT in Figure 5. Note that the converging stabilization phase disappeared. The positions were stable in a distinct value range, i.e., steady state, for example, the z-axis of PELVIS for CUDA or DirectML in Figure 5a, or they switched between two (or more) steady value ranges (e.g., z-axis of FOOT_LEFT for CPU or CUDA in Figure 5b). Furthermore, it can be seen that while DirectML and CPU showed the same progression of position in the 100 runs, represented by the single line, CUDA and TensorRT showed differences in position during the 100 runs.  Table A2 in the appendix. As the standard deviation for processing modes CPU and DirectML is zero, only the mean is shown for these modes. Note: the y-axis has a logarithmic scale.
It can be observed that the volume of the ellipsoids for processing modes CPU and DirectML did not differ between the 100 body tracking runs and, therefore, had a standard deviation of 0.0 mm 3 for all joints. The minimal ellipsoid volume for the processing mode CPU was 2.2 mm 3 (SPINE_NAVEL), and the maximal ellipsoid volume was 1797.0 mm 3 (FOOT_LEFT). The ellipsoid volume for the processing mode DirectML was between 1.4 (SPINE_NAVEL) and 392.6 mm 3 (FOOT_LEFT).
In general, the ellipsoid volume, i.e., the variations over time and different runs, were the largest for the processing modes CUDA and TensorRT. The variations in the outer extremities (elbows, wrists, knees, ankles, feet) were significantly higher than the variations in the upper body (PELVIS, spines, hips, shoulders) ( Figure 6; Table A2).
The ellipsoids of the joints did not only differ in volume; their direction differed as well. The PELVIS showed very little variation over time, i.e., had a small ellipsoid volume. Its variations were mainly in the x-y-plane (see Figure 1a for reference), as shown for processing mode CUDA in Figure 7. The KNEE_LEFT mainly showed variations over time in the y-z-plane (Figure 8a, shown for processing mode CUDA), except for DirectML (Figure 8b), which showed little variation in all directions. FOOT_LEFT showed very large variations in the z-direction and much smaller variations in the xand y-directions for all processing modes (Figure 9a, as shown for CUDA), except for DirectML (Figure 9b), which showed relatively little variation in the z-direction. We visually analyzed the directions of the ellipsoid's semi-axes; however, from this analysis, no clear pattern emerged. The ellipsoids were neither clearly rotated in the direction of the camera's laser rays nor distorted closer to the edge of the depth camera's field of view.   The spatiotemporal distribution for each of the joints is visualized for each processing mode in Figure 10. It becomes clear that FOOT_LEFT had the biggest distribution in the z-direction. Furthermore, this visualization confirms that points of the outer extremities had a larger spatiotemporal distribution than the points of the upper body. Considering all joints, the processing mode DirectML overall had the smallest distributions, followed by CPU. CUDA had the biggest distributions of all processing modes, closely followed by TensorRT. Visual analysis of the closest point toward the camera for all joints showed that all points are behind the body surface. The only exception is WRIST_LEFT, where the foremost points lie slightly in front of or inside the body surface.

Euclidean Distances of Joint Positions between the Processing Modes
The Euclidean distance between frames in different body tracking runs within the same processing mode was calculated as a metric to quantify the differences in the joint position between each body tracking run. The processing modes CPU and DirectML produced the same result in each run, i.e., their Euclidean distance was 0.0 mm (Table A3).
For the processing mode CUDA (Figure 11a), the minimal Euclidean distance was 0.0 mm for all joints, and the maximum Euclidean distances were between 6.2 (PELVIS) and 87.2 mm (FOOT_LEFT). The means were between 0.9 (SPINE_NAVEL) and 17.9 mm (FOOT_LEFT), and the medians were between 0.7 (PELVIS) and 4.4 mm (FOOT_LEFT). The standard deviations were between 0.7 (SPINE_NAVEL) and 26.7 mm (FOOT_LEFT).
Compared to CUDA (Figure 11a), TensorRT had wider confidence intervals (Figure 11b) with fewer outliers. Similar to the ellipsoid volume from Figure 6, it becomes clear that the Euclidean distance was higher for the outer extremities than for the upper body.

Variations in Bone Length between Body Tracking Runs
The bone length was calculated as a metric for the differences in the size of the skeleton caused by the variations in body tracking between the various runs. As shown in Figure 12 and Table A4, the bone lengths varied much less over time and between the runs compared to the ellipsoid volume and Euclidean distance described above. In general, the mean bone lengths for processing modes CUDA, DirectML, and TensorRT were quite similar; the mean bone lengths for processing mode CPU were a few millimeters longer. DirectML showed the smallest standard deviation, followed by CPU and CUDA. TensorRT showed the largest standard deviations for the bone lengths but had no outlier.

Methods
Up to this point, we only considered the differences between body tracking runs on a single computer (computer A). Since we found substantial differences in the ellipsoid volumes, Euclidean distances, and bone lengths described above, we repeated the analyses on a second computer; computer B, described in Section 2.2.
To find similarities and differences between the results of computers A and B, the joint positions over time at the sensor's x-, y-, and z-axes, as well as the calculated bone lengths from both computers, were compared.

Results
When looking at Figure 13a, one can see in these exemplary plots that for processing modes CUDA and TensorRT, the behaviors of computers A and B looked similar. Similar behavior was observed for both processing modes for the remaining axes and joints ( Table 1). The behavior of the processing modes CPU and DirectML, on the other hand, were only partially similar between the two computers. For example, the z-axis of FOOT_LEFT in processing mode CPU showed a switch to another steady value range around frame 280 on computer B (Figure 13c). However, a similar switch on computer A was first observed around frame 800. DirectML switched between steady value ranges about 2 mm apart on the x-axis of the PELVIS and about 65 mm on the z-axis of the FOOT_LEFT on computer B, but both remained stable on computer A (Figure 13d). On the y-axis of FOOT_LEFT on processing mode CPU, the values on computer B were only slightly larger compared to computer A. Similar behavior was seen for DirectML (computer A slightly larger than computer B), although not entirely similar (Figure 13b). Here, the steady value ranges for both computers were not exactly the same but did not show clear, distinct differences compared to the steady value ranges in Figure 13c,d. In general, the joint positions of the two computers differed for the processing modes DirectML and CPU, where the following distinct behaviors were observed: (1) one computer was continuously in the same steady value range, while the other computer switched to another steady value range (Figure 13d), (2) both computers switched, but at different points in time (Figure 13c), (3) the steady value ranges of both computers were very close to each other (Figure 13b).
The bone lengths showed similar properties for processing mode CUDA on both computers, as shown in Figure 14. For processing mode TensorRT, the bone lengths had a similar median. However, computer A had a bigger interquartile range as well as no outlier; computer B had a much smaller interquartile range and a lot of outliers. Thereby, TensorRT on computer B produced similar results compared to CUDA on computers A and B.
The processing modes CPU and DirectML, on the other hand, had a difference in the median. CPU on computer B and DirectML on computer A had a similar median to CUDA and TensorRT, and CPU on computer A and DirectML on computer B had an, on average, 2.7 mm higher median.
It is striking that the box plots of one computer and processing mode show the same pattern for all bone lengths (distance of the quartiles to the median, distance of the whiskers, and positioning of the outliers).

Discussion
The aim of this paper was to analyze and quantify the effects of the chosen processing mode on the results when using the Azure Kinect Body Tracking SDK. For this purpose, spatiotemporal changes in the joint positions, the differences within and among processing modes, as well as differences between two different computers were analyzed. We have shown that there were considerable differences between the processing modes, different runs of the body tracking and between different computers.

Consideration of Change over Time
In Section 3.1.2, we described a converging behavior of body tracking in the first seconds. To the best of our knowledge, this behavior of stabilization has not been described before. It seems to originate from some kind of initialization phase of body tracking. Similar behavior was observed when body tracking was started after the first 100 frames of the video instead of from the beginning, although the pattern seen was ambiguous. However, since the converging behavior seemed to be more pronounced when starting body tracking from the first frame, it could be that the depth sensor itself needs a few seconds to initialize or to focus. Furthermore, there might be some kind of pre-or post-processing in the body tracking that needs a few frames to stabilize. Unfortunately, the Azure Kinect Body Tracking SDK is closed-source and, therefore, a black box, so we can only speculate on the causes of the observed behavior. This phase of stabilization could be the subject of future work, whereby various possible camera settings, as well as external influences on the recording and body tracking, consequently make it difficult to analyze and isolate the reason for this stabilization phase.
Since the Azure Kinect Body Tracking SDK needs some time (around 60 frames @30 FPS) to stabilize in a position, we recommend starting body tracking from the first frame but waiting approximately two seconds (@30 FPS) before starting the subsequent analysis of the joint positions.

Comparison of Processing Modes
Our analyses in Section 3.2.2 showed that the spatiotemporal distribution of joint positions over all body tracking runs was the smallest using processing mode DirectML, followed closely by CPU, and larger for CUDA and TensorRT. One reason is that DirectML and CPU showed the same results for all 100 runs, while CUDA and TensorRT produced different joint positions in each run. Thus, DirectML and CPU, unlike CUDA and TensorRT, seem to yield reproducible data and thereby achieve results that are compatible with quality assurance and good scientific practice.
The Euclidean distances between the joint positions in all runs, as well as the volume of the ellipsoids, represent the spatiotemporal distribution of the joint positions over all body tracking runs and frames. We found that both were considerably higher for the outer extremities (especially feet, but also wrists) than for the upper body (PELVIS, spine, hips, shoulders). Our results are consistent with the results of Albert et al., who calculated the Euclidean distances between Azure Kinect DK body tracking and a Vicon system [10]. The increasing difference might be related to the fact that the skeleton of the Azure Kinect Body Tracking SDK is built up like a tree, with the PELVIS as its root and the feet, thumbs, hand tips, as well as the points in the face, as its leaves [18]. Consistently, the average errors of the outer extremities (feet, ankles, wrists, hands) were higher than those of the upper body (PELVIS, hips, spines, shoulders, clavicles). The speed of convergence in the first frames, on the other hand, seemed to be independent of the tree structure of the skeleton.
Furthermore, the body tracking seemed to stabilize in different steady value ranges; either it stayed in one steady range or it switched between different steady ranges. For CUDA and DirectML, less than half of the plots showed switches in the x-, y-, and z-axes. In contrast, these switches occurred in more than half of all plots for the processing modes CPU and TensorRT. Furthermore, the distribution of the switches over the frames varied for all processing modes. Therefore, it can be assumed that the switches were not caused by noise in the recorded video, but depended on the selected processing mode.
Moreover, it is noticeable that the distribution of the calculated bone lengths over the runs and frames showed a similar pattern for all bones. This suggests that the recognized skeleton probably has only minor changes in the size ratio of the individual bones and is merely scaled differently from frame to frame (and run to run). It is reasonable to assume that body tracking keeps some kind of (historical) skeletal model when detecting joint positions. Colombel et al. also suggest that the Azure Kinect Body Tracking SDK tracks individuals in an anatomically consistent manner as additional anthropomorphic constraints had only little effect on body tracking results [21]. It is also interesting that the maximum Euclidean distance between two runs was 87.2 mm (CUDA, FOOT_LEFT); the maximum difference in bone length, however, was only 11.5 mm (CUDA, Torso) and thus had significantly smaller variations.
All in all, when deciding which processing mode to use, one has to consider that CPU and DirectML had no differences in multiple body tracking runs on the same computer, whereas CUDA and TensorRT had differences of up to 87 mm. This means CPU and DirectML yielded seemingly consistent and repeatable results (important for quality assurance); in contrast, CUDA and TensorRT yielded inconsistent results. Besides that, one should be aware that the outer extremities have a higher spatiotemporal distribution than the upper body. For further analyses, such as body posture analyses or research involving humans, it must, therefore, be noted that the results become less accurate when focusing on the outer extremities compared to the upper body.

Comparison of Different Computers
Our results in Section 3.3.2 showed that CUDA and TensorRT produced similar results on both computers for both the spatiotemporal joint positions as well as the bone lengths. The processing modes CPU and DirectML, on the other hand, showed clear differences in the joint positions and bone lengths. We did not find any studies that compared body tracking between multiple computers.
One of the reasons that CUDA and TensorRT were similar on both computers could be based on the fact that they yielded different results between the runs, while DirectML and CPU yielded the same results in every run. As already described above, we observed different steady value ranges of joint positions. Multiple runs of CUDA and TensorRT covered several steady ranges. Conversely, CPU and DirectML covered only one of the steady ranges or switched from one steady range to another. As a result of this, the differences in CUDA and TensorRT between the two computers more or less averaged out, i.e., showed a regression to the mean. On the other hand, when the processing modes CPU or DirectML cover one steady range on one computer and another on the other computer (e.g., Figure 13c or Figure 13d), large differences in the joint position between the computers can occur. Consequently, this can result in larger differences in bone lengths. One should be aware of this phenomenon, therefore, we recommend performing all body tracking on a single computer.

Implications of the Results
Our results have shown that running body tracking repeatedly yielded clinically relevant differences in joint positions with Euclidean distances of up to 87 mm, depending on the processing mode used for the body tracking. Furthermore, we have shown that the computational hardware used can have an impact on the joint positions. Therefore, the results of previous studies using the Azure Kinect Body Tracking SDK might originate from differences in the body tracking algorithm instead of actual physiological effects being measured. Their results probably originate from a single body tracking run and might not be reproducible. It is difficult to assess the accuracy of their results since the processing mode used is usually not specified. As a consequence, the results from previous studies should be reevaluated, and until then, their findings should be interpreted with caution.
Not only the interpretation of previous studies is affected, also future studies need to take our results into account. The fact that body tracking might yield different results on multiple runs and with different processing modes implies that one should consider which data should be stored from a study. Several aspects should be taken into account: (i) data volume, (ii) privacy, and (iii) measurement error. Possible sets of data to store are: (a) raw data of the recorded video (RGB and depth); (b) raw data of the recorded video (depth only); (c) body tracking data of one run; or (d) body tracking data of multiple runs. Each of these sets has its specific (dis)advantages.

(a)
The raw data requires a lot of storage space (i); the 30-s video used in this paper was 1.6 GB. At the same time, privacy (ii) is not assured since the subject can be identified from the video data. However, no erroneous data (iii) are stored, and body tracking can be executed again using future improved body tracking methods.
When storing only the raw depth data, the data volume (i) is still high. However, the privacy (ii) of the subject is ensured a little better since it is more difficult to identify a person from depth data. Additionally, body tracking can be repeated at a later time (iii), as in (a).

(c)
Storing only the body tracking data of one run requires the least amount of storage space (i) of all options. Privacy (ii) is ensured since no identifiable data is stored. However, differences between various body tracking runs, processing modes, and computers might strongly influence the result (iii). (d) Body tracking data of multiple runs, on the other hand, require little storage space (i) and ensure privacy (ii). In addition, the measurement error could be reduced by aggregating or filtering the results from multiple body tracking runs (iii) to achieve reasonable accuracy. Although, it should be noted that aggregation and the filtering of body tracking runs are not trivial.
Saving a single body tracking run should normally suffice when the body tracking SDK yields reproducible results. While we have shown that this is not the case for the Azure Kinect Body Tracking SDK, we recommend saving the raw data or, when not possible, at least store multiple body tracking runs to ensure good scientific practice.

Limitations and Recommendations for Future Work
Our results showed that the processing modes of the Azure Kinect Body Tracking SDK introduce a number of anomalies. However, it should be noted that our study has several limitations. First, to exclude the influence of movements, we used a mannequin instead of a human. Although the mannequin was very well recognized by body tracking, it cannot be ruled out that body tracking of a real human differs. Second, for the same reason we analyzed a static scene; dynamic human movement might aggravate or reduce the differences found. Third, we analyzed a single pose; other poses (e.g., with self-occlusion) might produce different results. Fourth, we analyzed a single recording with a single frame rate and a single field of view setting to isolate the effects of the body tracking settings as much as possible. In future work, the findings presented in this paper should be confirmed for the other possible settings provided by the Azure Kinect (e.g., 15 FPS, 5 FPS, WFOV, binned versus unbinned). Fifth, we used the latest version (1.1.2) of the Azure Kinect Body Tracking SDK; other versions might exhibit different behavior. Sixth, we analyzed the differences in joint positions between the processing modes. However, it remains unknown which processing mode is the closest to the real joint positions. We recommend comparing the joint positions against the ground truth, e.g., obtained using the Vicon system similar to the experiments by Albert et al. [10] in future work. Although, one should be aware of the interference between the Azure Kinect and Vicon system when used simultaneously [14]. Seventh, we recorded the video used as an input for the body tracking in a windowless dark room to exclude influences of external light. That environment is not suitable for studies with human subjects. As already shown by Romeo et al. [13], ambient light can have an influence on body tracking. We recommend investigating the presented findings for different light conditions, preferably using illumination levels workable in studies with human subjects, and without infrared light.

Conclusions
This is, to the best of our knowledge, the first article that analyzed the differences in the Azure Kinect Body Tracking SDK between multiple runs, the four possible processing modes, and on different computers. We found substantial differences in body tracking results depending on the processing mode and computer used. The cause of these differences remains unclear because of the closed-source nature of the SDK. However, our results might have major consequences for all research performed using the Azure Kinect DK camera together with the Azure Kinect Body Tracking SDK since differences found in analyses of the body tracking might be caused by the processing mode instead of an actual physical effect on the measured subject.
To partially counteract these consequences or at least create awareness of the effects of the processing mode, we recommend the following for future studies that want to use the Azure Kinect DK (at least to evaluate static human poses): • Be aware that running body tracking multiple times on the same recording might produce different results; • Choose your processing mode wisely: CPU and DirectML seem to yield reproducible data (on the same computer), while CUDA and TensorRT do not; • Report the processing mode in your publication; • Do not start your analysis from the beginning of the body tracking, but skip a few frames (e.g., 60 frames) to let the joint positions converge to a steady state; • Generate all body tracking results for your analyses on the same computer, since different computers result in different joint positions; and • In case it is not possible to save the raw data of the recording (due to data volume constraints and/or privacy concerns), store multiple runs of body tracking data to reduce possible error effects. Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Acknowledgments:
We thank Dennis Bussenius for uncovering the differences between body tracking runs, Nils Strodthoff for your support in answering our questions about deep learning, and Noah Rickermann for proofreading the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: