Motion capture via marker-less tracking reduces the preparation afford associated with marker-based systems such as the VICON system. Such systems can determine joint positions with an accuracy between 0.28 and 0.35 mm, depending on the density of cameras, quality of calibration, and recording conditions [1
]. Among other domains, accurate motion capture systems are of special interest for gait recognition in the medical domain, where an accurate determination of joint positions enables the extraction of kinematic and kinetic gait-parameters to determine patients’ functional abilities as required to detect increased fall risks [2
] and progression in neurocognitive diseases [4
]. However, since marker-based approaches are currently too time-consuming and cost-intensive for routine use [5
], the use of marker-less tracking-based gait analysis promises to be applied.
Thus, Fudickar et al. [7
] and Hellmers et al. [8
] show that the use of light barriers and/or inertial measurement units have very good results in automating the timed “up & go” (TUG) test. In addition, Dubois et al. [9
] indicated the suitability to determine the number of steps and step length sufficiently accurately via depth cameras in the TUG test to distinguish fallers from non-fallers [10
]. An automated sit-to-stand test can also be used to detect neurodegenerative diseases: Jung et al. [11
] used a load cell embedded chair as well as a LiDAR sensor within a semi-automated implementation of the Short Physical Performance Battery to observe subjects during the 5-time-sit-to-stand test. Hellmers et al. [12
] used an inertial measurement unit integrated into a belt to develop an automatic chair rise test detection and evaluation system, while Dentamaro et al. [13
] classifies dementia via a developed automatic video diagnosis system for sit-to-stand phase segmentation. Yang et al. [14
] also detected the gait using a single RGB camera, although this has the disadvantage that the joints can only be tracked in 2D space. Arizpe-Gomez et al. [15
], on the other hand, have used three Azure Kinect cameras for automatic gait feature detection in people with and without Parkinson’s disease, but limitations here are that the data captured by Kinect is relatively noisy and at the same time the transitions between the cameras have inaccuracies.
A fundamental requirement for calculating and analyzing gait parameters with marker-less motion capture is the accurate joint-detection in still images.
Several implementations for the marker-less pose- and joint-recognition in RGB images have been proposed such as the prominent OpenPose [16
] and subsequent HRNet [17
]. These approaches either apply a bottom-up or a top-down approach. For the bottom-up approach, initially, keypoint positions are detected on the complete image. These keypoint positions are then fitted to complete skeletal representations per person. With the top-down approach, individual persons are initially recognized and then joint positions are detected in the corresponding image-segments. The bottom-up approach is much faster in detecting multiple persons within one image, while the detection of single persons is much more accurate via the top-down approach. These two contrary approaches are both commonly applied, as shown by the state-of-the-art joint recognition systems, OpenPose and HRNet.
In OpenPose, keypoints of body, lower and upper extremities and the head are recognized via the bottom-up approach and a convolutional neural network (CNN). Its sensitivity to detect keypoints with an average precision (AP) of 0.642 was shown for the Common Objects in Context (COCO) keypoint dataset via the evaluation metrics of COCO [16
]. The evaluation metrics of COCO combine 10 metrics: average precision (AP), average recall (AR) and their variants AP50
(AP and AR with a similarity factor of 50), AP75
(AP and AR with a similarity factor of 75), APM
(AR and AR for medium-sized objects), and APL
(AR and AR for large-sized objects). To measure the similarity between ground-truth objects and predicted objects, an object keypoint similarity (OKS) is calculated as follows [18
is the Euclidean distance between the ground-truth and the detected keypoint position. vi
indicates whether the keypoint is visible (0 for no, 1 and 2 for yes), s is the root of the area of the person for whom the keypoint was recognized, and ki
is the per-keypoint constant that controls falloff, where σi
is 0.026, 0.025, 0.035, 0.079, 0.072, 0.062, 0.107, 0.087, and 0.089 for the nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and feet. Consequently, an OKS of 1.0 indicates perfect recognized keypoints where an OKS of 0.0 indicates complete keypoint delocation [18
]. The corresponding optimal results reported for HRNet and OpenPose are summarized in Table 1
In contrast to OpenPose, HRNet applies the top-down approach via a CNN on RGB images. Due to the top-down approach, HRNet requires image sections (of 288 × 348 px) that cover only a single person instead of entire images. When evaluated under the same conditions as OpenPose, HRNet shows results significantly better than OpenPose with an AP of 0.77.
With OpenPose and HRNet focusing on the detection of poses and keypoints in the 2D RGB image plane, these approaches have been proposed for 3D human pose estimation—that is to estimate and include depth-information of keypoints. Approaches that consider depth-information typically combined this with RGB images as proposed by Véges et al. [19
]. Herein, pixel coordinates of keypoints are detected via a 2D PoseNet (HRNet with a Mask-RCNN as a bounding-box detector) and the corresponding depth per keypoint coordinate is then extracted from the DepthNet (which estimates the depth per pixel). Subsequently, the data are combined to determine the 3D position per keypoint and to calculate the 3D pose from the keypoints. Evaluated via the a multi-person data set with indoor and outdoor videos (MuPoTS-3D) data set, and the Panoptic data set, consisting of several RGB-D videos captured with a Kinect, the model achieved an absolute mean position error (A-MPJPE) of 255 mm and a relative mean position error (MPJPE) of 108 mm per keypoint with a detection rate of 93%. Thereby, keypoints detection relies on evaluating RGB data, while depth information is only considered for the estimation of the depth of the joints—representing the state-of-the-art.
However, limited research has been conducted to estimate 3D keypoints solely from depth images without consideration of RGB information. Corresponding research was conducted by Ye et al. [20
] who proposed the combination of body surface models and skeleton models. Herein, the depth image is initially filtered, irrelevant objects are removed, and by consideration of pre-recorded motion patterns, the poses are refined. Resulting in a mean detection error-distance for keypoints positions of 38 mm on the data set of Ganapathi et al. [21
]. A similar approach was applied by Shotton et al. [22
] to estimate 3D locations of skeletal joints from a single Kinect depth image. They achieved a mean average precision of 0.984 on their real test set with ground-truth labels of head, shoulders, elbows, and hands. In addition Wei et al. [23
] used a single Kinect depth camera to detect and track 3D-poses simultaneously, showing an error of about 5.0 cm per joint per frame compared to the VICON system. Compared to [21
], the system has a higher accuracy per joint (approximately between 0.85 and 1 per joint), with a deviation of up to 0.1 m to the ground truth assumed to be “correct”.
Existing approaches for human pose and joint detection require the availability of color images, RGB-D videos (i.e., color images with additional depth information), or depend on an individualized person’s data such as body surface models or skeleton models. However, with their robustness to light variation [22
], and their color and texture invariance [22
], as well as the easier elimination of the background [25
], depth images have some advantages over RGB images. Getting the real depth value [24
] also makes it possible to recognize the position of joints in 3D space, which provides further information for gait analyses or similar. However, with given advances in pose estimation and depth-cameras, the applicability of HRNet or OpenPose for depth images for marker-less tracking holds potential benefits and should be further investigated.
Correspondingly, the article at hand introduces HRDepthNet to detect keypoints of persons in depth images instead of RGB data. The model is based on the HRNet CNN model, which is retrained for annotated depth images. To evaluate the sensitivity of using an HRNet-like model for the keypoint detection via depth images instead of RGB images, the algorithm’s sensitivity is evaluated using COCO’s evaluation metrics and is compared to the sensitivity of HRNet rendered on the RGB images of the same dataset. To quantify the position error, the spatial error in cm is analyzed per keypoint.
2. Materials and Methods
To evaluate the suitability of pure depth images for joint recognition, we propose corresponding pre-processing (Section 2.1
), a convolutional neural network based on HRNet (Section 2.2
), and post-processing steps (Section 2.3
). The validity to detect and locate keypoints was evaluated in a study (see Section 2.4
) and a corresponding dataset is described in Section 2.5
2.1. Data Preprocessing
To enhance the sensitivity of the HRNet-based model, initial pre-processing is necessary to convert the available depth information into a format suitable for model-training and use. The model expects only the image segment covering the person in a certain pixel size, but not the entire depth image. Therefore, background subtraction (via a python binding to the Point Cloud Library [26
]), gray value normalization, image cropping, and scaling are conducted.
In segmentation, the area a person occupies is detected. The point cloud of the depth images is used to represent a sounder base for background elimination than depth images. As part of the segmentation procedure, smoothing, and background elimination have been applied:
For smoothing, Moving Least Squares [27
] with a search radius of 0.03 was applied. For background elimination, cropping of the surroundings has been favored over general background subtraction, as we found higher robustness for variations in specific background subtraction—benefiting from the stable sensor setup and a clear path of movements. For example, background-removal of the floor was implemented via the planar segmentation algorithm with a threshold-distance of 0.02 m (as included in the PCL) which applies the random sample consensus (RANSAC) algorithm as being robust regarding outliers ([28
], p. 117).
Then, the depth-axis is cropped based on the center of mass (mean z-value over all remaining points) in the z-axis (as representing the person) and points that exceed a threshold-distance of 50 cm from the remaining center of mass are subtracted as background.
For clarity, in Figure 1
the elimination of points in the x-direction is marked in green, the elimination of the floor in light blue, and the elimination in the z-direction in purple, leaving the person marked in blue after the background elimination.
With HRNet requiring RGB images of specific proportions as input, these depth images are finally converted. From the filtered point cloud, a depth image is generated as follows: For the conversion from a point in 3D space
to a point on an image plane
, one needs the rotation matrix
, the translation vector
, and the intrinsic matrix
Since in this case there is no rotation and translation, the matrix
, consisting of the focal length in the x and y direction (
) and center of projection in the x and y direction (
), in this case is constructed as follows:
The depth image
is then calculated with
is multiplied by −1 for easier interpretation of depth information, since point cloud depth values are given in the negative direction.
The resulting array encodes the depth per pixel with the background being represented as 0 and the remaining depth-measures as positive values, which are encoded as gray-values, normalized to the range of [0, 255]. This normalized depth value is then encoded in a gray-scale image (see Section 2.4
The resulting gray-scale image is further cropped and scaled under the HRNet-input requirements. From the generated gray-scale image, a section with the appropriate aspect ratio of 3:4 that covers the person and all pixels that contain depth-information >0 is extracted. This image section is then scaled to a size of 288 × 348 px representing the required image size for the input of HRNet model. The scaling factor and location of the subtraction are stored for location reconstruction in post-processing.
2.2. Machine Learning
For keypoint detection in depth images, transfer-learning via HRNet is the basic model for training with depth images. HRNet was preferred as the basic model for training with depth images instead of OpenPose due to these reasons: HRNet is more accurate in joint detection than OpenPose with a 15.2 higher AP. Furthermore, HRNet uses the top-down approach, which is well suited for deriving highly precise joint positions (considering corresponding raw data instead of interpolating them). This increased spatial position accuracy and the fact that in medical applications typically only a single person is visible per image, the top-down approach is more suitable for the medical-domain. For retraining, the PyTorch library version 1.4.0 of HRNet is applied with the generated depth images, heatmaps, and visible-lists (see Section 2.5
). The HRNet model pose_resnet_152 with an image size of 384 × 288 is used for retraining, as the models with ResNet as backbone are more suitable for integrating them into our own system and as this explicit model showed the best results of all HRNet-models with ResNet as backbone on the COCO val2017 dataset [17
]. The images are encoded as a tensor, are scaled to the interval [0,1], and normalized. For normalization, PyTorch was used with mean values (0.485, 0.456, 0.406) and standards deviations (0.229, 0.224, 0.225), as representing default-values for the ImageNet8 dataset, which is the dataset used for ResNet—the foundation of HRNet.
With the resulting normalized and scaled tensors, 20 epochs were trained and the most accurate model over 20 epochs was finally selected. Per epoch, the batch composition was randomized. The adaptive moment estimation (Adam) optimizer was used [30
]. For loss calculation JointsMSELoss was chosen, which calculates the average of the individual mean squared error (MSELoss, suitable for regression challenges as is the given one) of all visible keypoints among the reference and detected annotation:
For evaluation, ten models are trained.
2.3. Post-Processing: Keypoint Visibility
The resulting ML model’s heatmap encodes per coordinate the probability to hold a keypoint. Per keypoint, the location with the highest probability is chosen. As the heatmap encodes as well invisible keypoints (e.g., that are hidden by other body-parts), separate filtering is applied to remove invisible keypoints from the evaluation. Only keypoints that surpass a probability-threshold of 0.83 are accepted as valid and all others are rejected as invisible.
The corresponding threshold was determined as the optimal combined true positive rate (
) and false positive rate (
), which was determined as the distance
per threshold value t
(see Figure 2
) to the point tpr = 1, fpr = 0.
The used threshold t was determined as holding minimal distance .
In addition, the location of the keypoints on the original images is reconstructed from the cropped and scaled depth image via the saved scaling factor and location of the clipping window.
2.4. Study Design
For training and evaluation of the proposed HRDepthNet, a corresponding training and evaluation dataset was generated in a study. The study considers humans conducting the TUG test. The TUG test is an established geriatric functional assessment which includes activities such as walking, turning, and sitting and corresponding transitions in between these. This assessment covers multiple critical body-postures prone to increased classification errors [31
The recording took place in the Unsupervised Screening System (USS) [7
], shown in Figure 3
The USS includes a chair (c), where subjects start and finish each test-run in a seated position. Crossing a yellow line (e), 3 m in front of the chair, the subjects were expected to turn around and head back to the chair following the TUG test-protocol.
The test-execution was recorded using an Intel RealSenseTM
D435 RGB-D camera (f), which was placed at knee height 4.14 m in front of the chair facing towards the participants and the chair (see Figure 4
b). Depth images are recorded via an active infrared stereo sensor. Both, RGB and depth images were captured with 640 × 480 px resolution at 30 fps and recordings were stored in .bag files. The sensor orientation (see Figure 3
) sets the x
-axis horizontal, the y
-axis vertical, so that the z
-axis is covering the depth (as the distance from the sensor). Depending on calibration-settings and ambient light conditions, the reported distance of the depth-sensor ranges from 20 to 1000 cm. The RGB-D camera was connected to a PC, running a Python-script for starting and stopping recordings and data-storage.
Participants that met the following inclusion criteria (adults that have no neurological or psychiatric disorders, can walk 6 min without a break, and have full or corrected vision) were considered and invited to participate. Participants were informed about the study procedure and signed informed consent (in accordance with Declaration of Helsinki [32
]). Their sex, age, height (cm), weight (kg), leg length (cm), and pre-existing conditions related to mobility were collected.
Participants conducted the TUG test repetitively at varying walking paces (normal, fast, and slow) each at least five repetitions per pace and were asked to include breaks as needed. Pace variations were subject to personal perception. The repetition of these test-blocks was extended depending on participants’ willingness and capacity and was ended latest after maximal 2 h. Per TUG execution, a new recording was initialized.
As no multi-person view is considered yet, images intentionally cover only single persons. By covering most common activities and including a varying distance of 1 to 4.2 m from the camera, facing the persons’ front, the corresponding recordings make a well-suited training and evaluation dataset.
The study was approved by the Commission for Research Impact Assessment and Ethics at “Carl von Ossietzky University of Oldenburg” (Drs.EK/2019/094) and was conducted in accordance with the Declaration of Helsinki [32
2.5. Preparation of the Dataset
As no suitable dataset is yet available, we created a dataset as follows. While the captured RGB-D videos were recorded with 30 Hz, subsequent video-frames are remarkably similar regarding keypoint-position. However, HRDepthNet is not intended to track keypoints but to detect the keypoint-positions in still images. For this reason, a high frame rate is not necessary and consecutive video-frames have only limited benefit for model training and evaluation. Thus, a frequency of 5 Hz was found sufficient and only every sixth video-frame is considered in the dataset. In addition, the initial frames per shot were discarded since the camera’s exposure control takes milliseconds to adjust to lighting situations and correct manual recognition of joint positions was impossible for them. From the considered frames of the RGB-D videos, depth images were extracted as input for keypoint detection, while corresponding RGB images were used for annotation and HRNet evaluation.
From the depth-videos, individual images were extracted, aligned with the RGB images (via align and process commands), and stored as point clouds via the pyrealsense2 library. In addition to the pre-processing of the depth images (see Section 2.1
), RGB image size is scaled to the depth images and RGB images are mirrored along the y
-axis. By correcting lens-distortions among both image types, the validity of keypoints coordinates for the depth images is ensured.
Images are annotated regarding the actual keypoint positions and the person overlapping polygons (as required input for the COCO evaluation metric). Keypoint positions were manually annotated in the two-dimensional coordinate system via the VGG Image Annotator (VIA) software [33
] and visual inspection based on RGB images and guidelines for keypoint positioning as described in Section 1
. Thereby, using RGB images instead of depth images for annotation ensures greater positioning accuracy. This is because keypoint recognition is more challenging with depth images for human annotators—especially at greater distances from the camera—due to a reduced signal-to-noise ratio. To increase the efficiency of the annotation process, keypoint positions in RGB images were pre-labeled using the HRNet model, transferred into a VIA conform file-format, and then were adjusted manually via VIA. Afterwards, the corrected 2D keypoint positions were converted into HRNet conform heatmaps. These heatmaps are generated per visible keypoint, under HRNet. Per keypoint, a 2D Gaussian Function with a standard deviation of 1 pixel is run from the annotated keypoint coordinate. A visible-list is generated per image. In the visible-list, all keypoints, which are annotated are encoded binary as 1.
The COCO metric requires for evaluating the keypoint similarity also a specification of the area occupied by humans per image (which is typically estimated via the polygon of their contour). Thus, this contour was also annotated manually as a polygon in the RGB images via the VIA software and visual inspection, and polygons and the outlined areas are stored COCO conform in a JSON file.
Images were grouped in approximately 70% training-set, 15% validation-set, and 15% test-set. For the test-set, images of a randomly selected subject were selected exclusively (leave one out).
To achieve comparability among results, evaluating the proposed approach considers the evaluation metrics of COCO [34
]. Since evaluating the keypoint accuracy by mimicking the evaluation metrics used for object detection, these metrics include average precision (AP), average recall (AR) (both including APM
and for medium and large person representations), and the corresponding second and third quantiles and object keypoint similarity (OKS).
To investigate the generalizability of the approach, ten models have been trained and evaluated via these metrics. For comparability to the original HRNet (pose_hrnet_w48 with image size 384 × 288), which showed the best results of all HRNet-models on the COCO val2017 dataset [17
], it was evaluated on the corresponding dataset’s RGB images. As the COCO evaluation metric evaluation of OKS only considers the spatial correctness of the visible keypoints in the ground-truth, it does not consider the correct detection of non-visible keypoints. To evaluate the suitability of the proposed threshold-based filtering of non-visible keypoints, we first generated receiver operating characteristic (ROC) curves to identify the rate of TP and FP for each of the ten generated ML models and to determine the best threshold for each model, which has been used for these analyses. The ROC-curve has been also plotted for the HRNet on the dataset’s RGB images.
To determine the most accurate model, we evaluated the sensitivity and specificity by which the key-points are correctly detected (as visible or invisible) via the F1-score. The F1-score is calculated as follows with precision (p
) and recall (r
In addition to the mean F1-score for all keypoints, as required for model selection, we also calculated keypoint-specific F1 scores.
Further evaluation was conducted only on the best among the 10 models. For evaluation of the validity to detect only visible keypoints, a corresponding confusion matrix was calculated.
For the consideration of individual keypoints, true-positive and true-negative rates and associated parameters are evaluated on a per-joint level. In addition, to analyze the accuracy of the keypoint-positions, deviations were mapped to reference-labels. Only visible keypoints are considered by the model. In addition, joints with invalid depth values, as indicated by the sensor reading of zero, have been excluded to overcome these consequentially erroneous positions.
Per (TP) keypoint-location median spatial errors are calculated per axis and corresponding boxplots have been generated.