1. Introduction
3D Human kinematics involves measuring joint angles between body segments, which is essential in the day-to-day practice of experts. Skilled physicians could judge, just by looking at a specific motion of their patient, whether it is healthy or abnormal. Skilled sports coaches can help their coachees achieve better performance and lower injury risk by evaluating their movements through observation. However, these visual examinations of human kinematics remain inherently subjective, leading to variation between and within human observers. Modern systems and sensors could reduce these variations through more objective observations. Yet, these systems make the measurement of human motion more costly and more time-consuming. A system with the availability and ease of use of visual estimation would help physicians and coaches make more objective observations more often, ultimately raising their own and their subjects quality of life. Digital cameras have made the estimation of human kinematics more accessible but come at the cost of reduced accuracy. Compared to the more traditional Optical Motion capture (OMC) systems, markerless motion capture (MMC) systems do not require specialized cameras and markers attached to the subject being monitored, but use normal RGB cameras in combination with image-based automatic pose estimation algorithms. Instead of specific markers, pose estimation algorithms detect the centers of major joints of the human body, such as the shoulders, hips, and knees. These detected centers are usually referred to as key points.
Multiple commonly used markerless motion capture methods rely on 2D pose estimation methods [
1,
2,
3,
4]. Often these methods still need more than one camera to generate a good estimation of the keypoints in 3D, which again requires additional cameras to be set up. On the other hand, an increasing number of methods are using single-view (monocular) 3D pose estimation methods [
5,
6,
7], which allow to estimate a 3D pose just by using a single camera. This makes MMC systems faster and more accessible as they do not require the additional time and expertise to place markers on the subject or calibrate multiple cameras. However, MMC systems assume that current pose estimation algorithms can accurately replace markerless motion capture systems for, e.g., biomechanical applications [
8,
9,
10].
Commonly used pose estimation algorithms introduce mistakes in kinematic estimation pipelines due to systematic errors in their predictions. To detect key points, most pose estimation methods are trained on a combination of images of a person and ground truth annotations which map pixels in the image to their corresponding joint center.These ground truth annotations are often manually conducted by non-expert annotators, leading to errors caused by personal biases for training and inaccuracies in the pose estimations [
9]. For example, Needham et al. [
11] compared three often used pose estimation algorithms OpenPose [
12], DeepLabCut [
13] and AlphaPose [
14] algorithm against an OMC system and showed errors in the estimation of joint centers of 30 mm to 50 mm with variations in 12 mm to 25 mm in marker placement. Cronin [
9] provides an overview of additional problems with 2D pose estimation for kinematic analysis. These differences are most likely due to a difference between the application that pose estimation algorithms are often developed for and their application to, e.g., the biomedical domain, which has different accuracy requirements [
8]. Wade et al. [
10] proposed to solve this problem by re-annotating existing large-scale datasets, this, however, is a time-consuming process, when for example considering the COCO-keypoint dataset
https://cocodataset.org/#keypoints-2020 (accessed on 2 December 2022) consists of more than 250.000 labeled poses. For the evaluation of pose estimation algorithms, these labeling errors will just appear as a baseline error that all algorithms training on the same data will have. However, for applications in the biomedical domain and in situations such as kinematic estimation, where the pose is just an intermediate step errors can propagate to subsequent tasks.
Errors in the estimated pose cannot not be corrected by most kinematic estimation pipelines because they all roughly follow a ‘multi-step’ approach. The ‘multi-step’ approach consists of
Detection of the 3D pose (in one or more steps);
(Optional) modeling of the pose with a (musculo)skeletal model.
Calculation of kinematics and/or downstream tasks such as gait parameters or dynamics.
For example, Kidzinski et al. [
1] used OpenPose to first predict key points from a video and then trained a convolutional neural network (CNN) to predict the walking parameters of patients with cerebral palsy. Liao et al. [
6] first model the 2D pose in OpenPose then create a 3D pose using data-driven matching and finally estimate 3D gait parameters. Noteboom et al. [
7] first used VideoPose3D [
15] to estimate a 3D pose, followed by modeling in OpenSim [
16] for the estimation of dynamics from a single camera. Pagnon et al. [
2,
3] developed the handy Pose2Sim tool, which first combines 2D OpenPose pose estimations from multiple cameras into a 3D pose then models it in OpenSim. Because the pose estimation step is de-coupled from kinematic estimation, errors in pose estimation propagate through to the estimation of kinematics. Uchida and Seth [
17] showed that 20
of marker uncertainty leads to a variation of 15.9
in peak ankle plantarflexion angle and impacts downstream tasks such as joint moment estimation. Della Croce et al. [
18] showed precision variation 13 mm to 25 mm, which leads to differences in estimated joint angles up to 10
. Fonseca et al. [
19] showed that misplacement of markers up to 10
can lead to errors of 7
depending on the marker. With estimation errors of 30 mm to 50 mm in keypoint estimation [
11], it is to be expected that these errors will substantially influence kinematic estimation from markerless motion capture. Low-pass filter [
2,
20] or bi-directional Kalman-filter [
20] has been applied to compensate for noisy key point estimations, but cannot correct for faults in keypoint detection. Subsequent modeling and kinematic calculation steps can only compensate for these inaccuracies. This ‘multi-step’ approach is probably inspired by the steps of a traditional OMC method, as in the traditional OMC systems the pose detection step is done using a different system and is thus isolated from the other steps. In camera-based kinematic estimation pipelines, however, the de-coupling of individual steps is no longer necessary.
Deep neural networks have often demonstrated their ability to outperform multi-step systems, by implicitly learning individual steps through end-to-end training between an input and the desired output [
21,
22]. The main strength of deep neural networks lies in their ability to break down a highly complex task, in this case, the estimation of kinematics from videos, into a sequence of simpler tasks, without the need for intermediate ‘hand-crafted’ representations [
23,
24]. Due to the fully differentiable nature of neural networks, it means that an error in estimation during training can influence all stages of the network and adjust them accordingly [
24]. This allows deep neural networks to directly estimate kinematics.
In this work, we challenge the notion of the classical multi-step approach of pose estimation, fitting of a musculoskeletal model and kinematic estimation. To this end, we propose a novel end-to-end method that allows for direct estimation of human kinematics, which is directly optimized for kinematic estimation while treating pose estimation only as an auxiliary task to constrain the estimations of the network.
Figure 1 shows a general overview of our method.
Contributions
To the best of our knowledge, we are the first to present an end-to-end trainable network that directly generates joint angles, joint positions, scale factors and marker positions of a biomechanical model from a monocular video. We propose a method that directly regresses from a video to joint angles and scales using deep neural networks. We investigate the influence of various temporal smoothing methods to increase the accuracy of our algorithm. We introduce a novel type of network layer that allows for the calculation of the 3D pose from estimated kinematics during the training process to train the network simultaneously on the pose and kinematic labels.
3. Experiments
3.1. Experiment 1: Direct vs. Multi-Step Estimation
To evaluate the accuracy of our direct 3D kinematic estimation approach (D3KE) for joint angle estimation, we compare its performance against multiple versions of the multi-step approach.
3.1.1. Experiment 1-A: 3D Pose Based Kinematic Estimation
We first compare our direct estimation of kinematics and a 3D pose estimation multi-step baseline. To create a fair comparison between direct and multi-step estimations, we implement a custom multi-step approach (CMS) that is trained on the same data as our direct approach. For the CMS, we combine a
3D human pose estimation method with subsequent musculoskeletal modeling in OpenSim. We modify the metric-scale heatmaps [
25] of the convolutional network to predict marker positions and SMPL keypoint positions in the metric scale. As for D3KE, we exploit a sequence network to re-fine marker positions at the target frame. More specifically, the convolutional network initially infers marker positions under a calibration pose (T-pose), and OpenSim utilizes the predicted marker data for body scaling, where the general musculoskeletal model is scaled to the participant’s body size. Re-fined marker positions are then used to run inverse kinematics with the scaled musculoskeletal model to obtain joint angles. The main difference between the CMS approach and D3KE is that CMS uses multiple steps to estimate the kinematics and is only supervised on the marker/pose estimation task, while D3KE is directly trained on the kinematic estimation task; this way, we can compare direct vs. multi-step estimation of kinematics. We use multiple metrics for the comparison of D3KE and the CMS. The mean per bony landmarks position error (MPBLPE) is used to evaluate bony landmark positions. Bony landmarks are markers placed where bones are close to the surface, such as the elbow. This metric is inspired by the mean per joint position error (MPJPE) which is often used in 3D pose estimation. MPBLPE first aligns estimations and ground truth at the root position and calculates the average Euclidean distance. We directly evaluate the body scale factors by the root mean square error RMSE
on the scalars predicted by the network. However, to present the results in a more intuitive format, we choose the axis along the longest dimension in each body scale and convert the scale of the axis into millimeters and calculate the mean absolute error (MAE
).
3.1.2. Experiment 1-B: 2D-Pose Based Kinematic Estimation
In the previous experiment, we evaluate the multi-step baseline with the
3D body pose estimation method. However, the use of fully trained
2D pose estimation algorithms is common in kinematic estimation works [
2,
3,
20]. Therefore, we conduct experiments to compare our method and these 2D-based kinematic estimation methods. In contrast to our CMS method which estimates 3D pose from a single camera estimation, 2D pose estimation methods require at least 2 calibrated cameras for the estimation of 3D keypoints. The use of an additional camera to generate the pose could be an advantage, which the CMS method does not have. We chose a naive implementation of the OpenPose algorithm [
12,
58], which has extensively been used in related work [
2,
3,
20]. Additionally, we test the MediaPipe implementation of the blazepose algorithm [
59], as a more modern 2D algorithm. MediaPipe is easy to use since it is available as a python library, however, in contrast to OpenPose it runs faster, allows for additional smoothing of its predictions, provides more key points, and is labeled on different keypoint labels.
For the OpenPose and MediaPipe, we project the key points to 3D using the BML-Movi camera parameters
https://github.com/saeed1262/MoVi-Toolbox (accessed on 2 August 2022). For OpenPose, we connected missing points (due to self-occlusion) using linear interpolation. For MediaPipe, we chose the highest model complexity (2) and set enabled smooth landmarks for continuous frames of a video. For modeling and inverse kinematics, we follow the same steps as for the CMS method only redefining the positions of markers on the OpenSim model to fit the provided key points.
We compare against an average across both camera views for CMS and D3KE, as MediaPipe and OpenPose need at least two cameras to work. We evaluate performance based on mean absolute error
, the standard deviation of errors
and smoothness of the predictions as the mean velocity of the angle
MV (
/s). The mean velocity error is calculated by the derivative of the landmark position and joint angle data with respect to time.
with
an individual marker position at time
t,
is the amount of time between time steps and
n is the total number of timesteps.
3.2. Experiment 2: Sequential Network Variants
Since we have multiple options for the sequential networks, we evaluate three to determine whether the additional modeling of temporal dependencies in the data improves the accuracy of our method or not. For subsequent smoothing and reduction of self-occlusion artifacts of the estimations, we test three different networks including LSTM [
34], temporal convolutional networks (TCNs) [
15], and a lifting Transformer [
33] as the sequential network. As smoothing is known to improve the accuracy of multi-step approaches [
20], we also evaluate combinations of our CMS model with these sequential networks.
For the LSTM, we implement a bidirectional architecture with a hidden size of 128, three recurrent layers and a dropout probability of 0.1. For TCNs, we follow [
15] to exploit 243 frames as the receptive field and make the momentum of batch normalization decay from 0.1 to 0.001. For the lifting Transformer, we use a hidden size of 256 and 8 parallel attention heads in the self-attention layer and a channel size of 512 in the convolutional layer. Each sequential network is trained with a sequence length of 243 frames and a batch size of 128 over 50 epochs with Adam optimizer [
60]. The learning rate exponentially decays from
to
.
We use the same metrics for the comparison of individual network variants as we used for the comparison of D3KE and CMS. In addition, we investigate the smoothness of the predicted sequences, we estimate the mean velocity (MV) on bony landmark positions and joint angles, denoted as MV and MV.
3.3. Experiment 3: Processing Speed
One important property of our proposed method for clinical applications is its processing speed. As applications for camera-based kinematic estimation should form an alternative to visual examinations in the future, it should ideally be able to run fast enough to estimate kinematics from video frames as fast as they are collected by a camera, mostly between 15 and 30 frames per second.
We compare the running time on Windows 10 with four core CPU, 52 GB RAM and NVIDIA T4 GPU. We compare the CMS and D3KE method as they both use the same type of convolutional network. We choose the lifting Transformer as the sequential architecture in both the CMS and D3KE. For the CMS method, OpenSim is executed in parallel with four cores. Our report results in frames per second.
3.4. Experiment 4: Generalization Performance
The goal of camera-based kinematic estimation is ultimately to create tools for researchers and clinicians to analyze and diagnose human movement, these tools should not discriminate between different subjects and movements. We analyze whether our method generalizes to different subjects, movements, and joints.
To assess how well D3KE generalizes, we compare the estimates of the proposed method to the ground truth on the time series of each of the 16 participants in the test set with respect to the performed movement, the joint, the camera view and the individual participant. For this test, we use the best performing model from experiment 1 with the lifting transformer.
For all time series of joint angles, mean absolute error (MAE) and Pearson’s correlation coefficient () were calculated between the estimation from D3KE and the ground truth.
Central tendencies in the data are reported as a median and interquartile range of MAE, RMSE and
, as the data are not normally distributed, as assessed through visual inspection and confirmed by the Shapiro-Wilk test. For completion, mean and standard deviation are also reported. The absolute values of
were categorized as weak, moderate, strong and excellent for
,
,
and
, respectively [
61].
3.5. Software and Tools
All data analysis was conducted in python 3 [
62] using the pandas library to generate descriptive statistics, SciPy library for the calculation of MAE, RMSE and Pingouin library for the calculation of
.
5. Discussion
In summary, we compared a direct approach of estimating joint angles from video images to the more traditional multi-step approach found in most recent works. The traditional method first estimates key points from a video of a subject, then calculates joint angles using a (musculo)skeletal model through an inverse-kinematics process. We developed a method consisting of a convolutional neural network and a sequential network both including a specialized layer that performs kinematic transforms of a (musculo)skeletal model and allows for direct optimization of the predicted joint angles (D3KE) and treats the prediction of key points only as an auxiliary task. We compared our direct estimation approach against naive implementations of often used algorithms in the related literature, as well as a self-implemented custom multi-step approach (CMS) that is trained on the same data as our direct approach. We show that direct estimation of kinematics yields higher accuracy in predicted joint angles compared to the traditional multi-step approach. Our results indicate that direct estimation can help the future development of algorithms for fast and accessible kinematic analysis for researchers and clinicians.
5.1. Direct vs. Multi-Step Estimation
5.1.1. 3D-Pose Kinematic Estimation
To compare direct estimation vs. multi-step estimation, we compared our D3KE method against a 3D-pose based multi-step approach (CMS) with comparable network architecture and trained on the same training data. Compared to the CMS, our proposed method improves the accuracy of joint angle estimation. For all model combinations, we can see an improvement of about in accuracy for the estimation of joint angles. Our results support the feasibility of our proposed method. It delivers improvements due to directly optimizing the predicted joint angles and scaling factors while using the pose estimates only as an auxiliary task. The auxiliary task effectively imposes a constraint on the network estimation. We show that direct optimization is preferable to the multi-step approach when using videos from a single-camera view. We expect that using additional specialized layers, a network might be able to directly optimize for individual muscle forces with comparable accuracy from a monocular video.
5.1.2. 2D-Pose Kinematic Estimation
We compared our direct approach (D3KE) and the self-trained multi-step approach (CMS) against two multi-step approaches commonly used in the literature. Compared to the more traditional implementations of the OpenPose and MediaPipe algorithms, both our proposed and our CMS method show superior performance. From
Table 4 and
Figure 6, we see that estimations from D3KE are far smoother compared to the traditional methods. This is likely due to multiple reasons. The predicted key points of OpenPose suffer from systematic errors due to inaccuracies in their training data [
9], which can explain the drop in performance, for MediaPipe the accuracy of labels in their training data is not known as their paper only states that their annotators were human [
59], not whether they had expertise in labeling anatomical key points. The lack of smoothing for the predicted OpenPose key points can also contribute to its overall worse performance. MediaPipe, which uses internal smoothing, shows better results in comparison. In general, we expected worse performance from OpenPose and MediaPipe as they only predict 18 or 33 key points respectively, while we supervise a total of 77. Both traditional methods predict keypoints representing joint centers and not markers on body segments, this makes it hard to distinguish rotations between different body segments during the musculoskeletal modeling step.
5.2. Network Variants
We compared three commonly used sequential networks to improve our estimation. We show that all sequential networks improve our estimation accuracy. One possible explanation for this is that the network is better able to handle ‘self-occlusion’ artifacts. The estimation of a 3D-pose from a single camera is an ill-posed problem as a 3D point projected to a 2D image can originate from any position along the ray(s) that fall on the image sensor and form the corresponding pixel. This can lead to self-occlusion, such as the torso and left arm occluding the right arm during a right arm swing in frames recorded from the left sagittal view. During self-occlusion, it is difficult for frame-based networks to make a good estimate as they lack temporal information of previous angles of the arm to extrapolate from. Sequential networks on the other hand have access to temporal information, which can allow for more accurate estimations. Although we only see a slight increase
in MAE of the joint angle estimation, we can see a clear improvement of the sequential models in the smoothness of the predicted angles
Table 4. This might be due to the network learning to interpolate motions during occurrences of self-occlusion.
5.3. Processing Speed
Compared to the multi-step baseline, CMS, our D3KE approach shows increased calculation speeds for larger batch sizes. Both CMS and D3KE make use of the same ResNeXt50 architecture, which should show approximately the same performance increase with increasing batch sizes for both methods. D3KE could be expected to be slower, as it also has the additional time cost of calculating the pose from the estimated kinematics in the skeletal model layer. However, due to its multi-step nature, CMS has to perform an additional inverse kinematics calculation. This calculation seems to form a bottleneck in the processing speed of the CMS approach restricting it to a framerate of 8 fps. Other multi-step algorithms will most likely encounter the same problem. In the case of OpenPose, which runs at about 4 fps [
59], even lower frame rates can be expected for a complete pipeline. This shows the advantage in the processing time of our direct approach.
For a method to be usable in everyday life, it should be reasonably fast in running. Processing speeds allowing a method to run between 15 and 30 frames per second are favorable, as they show that a method can process a video as fast as its frames are collected. However, our results might not directly translate to every real-world scenario. To process multiple images simultaneously as batches, D3KE currently requires GPUs that are not available in mobile devices, which prevents it from beingportable. In addition, we use the Faster R-CNN object detection network to crop our images. This step was not included in the processing speed evaluation, as it is highly dependent on the chosen object-detection algorithm. However, with inference speeds of 12fps, the Faster R-CNN object detection would form a bottleneck in applying our method in real-time applications. Given the speed of development in the field of object detection, Faster R-CNN can by now be regarded as an old algorithm, and newer and faster object detectors should be used instead. The YoloV7 algorithm [
63], which performs object detection at up to 286 fps could be considered. In general, the current architecture is not optimized for speed or a specific technology and we are using off-the-shelf, fairly standard convolutional and sequential architectures. For these architectures, smaller and faster alternatives might be found in the future. When optimizing for all these points, we predict the proposed method could run on mobile devices within a few years, effectively enabling a 3D kinematic analysis instrument to become available for everyone with a mobile phone or tablet.
5.4. Generalization Performance
Our method generalizes well on the tested data. As shown in
Figure 4 and
Table 6, the estimation variations across participants and movements are small. We can conclude that our method shows the ability to generalize to different camera views, participants, and performed movements, within the tested dataset. Our results indicate that D3KE could be generally applied to a variety of people and movements, including clinical and sports applications, e.g., physiotherapists and athletes, when trained on sufficient additional data.
Although we show good generalization performance on the BMLMovi database, it is difficult to estimate how well our method will generalize in a real-world scenario. In machine learning settings, training data is often not representative of the task of the network in the real world [
64] and can introduce biases if applied to scenarios that are very different from the one represented in the training data. Unfortunately, there is currently a lack of deep learning datasets for kinematic analysis [
8,
9,
10,
20]. In addition, while the BML-Movi database is excellent for training neural networks due to the large number of participants performing movements and the diversity of execution styles, it might be not extensive enough to train a network for biomedical applications in the real world. However, to evaluate the current method fully, such an extensive dataset would be necessary. In general, we expect a drop in accuracy when our method is applied to a scenario different from the BMLMovi database. As we train on just two calibrated cameras, we expect our method to be most vulnerable to alternative camera positions, that do not show people in either frontal or sagittal view. Future research should investigate the stability of direct estimation methods when applied to data that differs significantly from the training data.
5.5. Future Work
To improve the accuracy of the algorithm and provide further insight into the strengths and weaknesses of monocular joint angle estimation, a new dataset with dedicated annotation is needed. A dataset specifically designed for the estimation of joint angles and/or kinetics could improve the accuracy of the algorithm. This dataset could be established with a large number of camera views, and top-down views for better estimation of movements in the transverse plane, where participants perform movements that exercise the full ROMs of individual joints including upper extremities, as well as movements that are relevant for health care professionals such as physiotherapy exercises and other clinical tests. In addition, the inclusion of abnormal movement patterns could give better insights into the clinical relevance of newly developed methods.
Transfer learning could be explored to apply 3DKE in settings where little training data are available. Vdeos that are very different from the BMLMovi training data, such as people wearing more clothes, are in different surroundings, or are filmed from a different camera view, will, most probably, yield worse accuracy than shown in this paper. Transfer learning of a pre-trained D3KE on a minimal portion of a dataset could be investigated as an alternative to the time-consuming collection of a novel dataset.
The capabilities of D3KE as an adapter for kinetic analysis of a movement in OpenSim could be explored. Given data similar to BMLMovi or successful transfer learning on relevant data beforehand, our method provides an easy way to skip the tedious steps of scaling and running inverse kinematics on an MSM. This enables the quick generation of MSMs for kinetic analysis from just a single video. Even if this kinematic estimation comes at the cost of reduced accuracy, it could provide coarse insights into collected data, which can later be confirmed through finer analysis with the manually scaled MSMs.
D3KE could be made more generally applicable if the underlying model of the Skeletal-model layer would not be fixed. Currently, the underlying model is fixed in the Skeletal-model layer. Future iterations could explore combinations of the Pytorch and OpenSim python libraries to allow training a network on a self-defined model or allowing a pre-trained model to be refined through transfer learning for, e.g., only estimation of joint angles around the shoulder.
Existing Explainable AI tools should be applied to better understand the inner workings of D3KE. Deep neural networks are capable of high accuracy estimation, because of their ability to break down highly complex tasks into simpler tasks [
24], but understanding what these simpler tasks are is non-trivial. Research in Explainable AI has generated tools and frameworks that allow one to better understand the basis of the final predictions of a network [
65]. Applying these tools could help users and researchers alike to better understand the biases and limitations of our method. D3KE can still predict the joint angles even if these joints are occluded; this means it must make assumptions. What these assumptions are and how they came to be are important to estimate the trustworthiness of this algorithm in a real-world scenario.
6. Conclusions
In this paper, we present a novel end-to-end neural network for the estimation of segment joint angles of the human body. Compared to the previous method, we directly regress to the joint angle and scale for individual segments from the input video. We trained our method from scratch on the BML-Movie database and compared it against a 3D pose estimation method on which we used the inverse kinematics tool of OpenSim to obtain the kinematics.
We conclude that using direct estimation of joint angles is preferable in a single camera setting, as it is more accurate compared to the common approach of fitting an estimated pose to a musculoskeletal model and performing inverse kinematics. By allowing the network to directly optimize for the joint angles and scaling factors, our method is less prone to errors in the key point labels used to predict key point location for pose estimation. In addition, the use of a sequential model is important when designing a neural network architecture for kinematic estimation, as it allows to smooth predictions over time to create better estimates of limb position and joint angles during self-occlusion.
While using deep learning for biomedical solutions is still in its infancy, the presented method shows that training networks from scratch for specialized tasks is a viable way to estimate joint angles from a single camera video. With further advancements in the underlying algorithms as well computational performance, we predict that the methodology we have presented will assist biomedical and clinic practitioners to measure and monitor human movement in the near future.